Skip to content

Instantly share code, notes, and snippets.

@arunsrinivasan
Last active July 18, 2018 08:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arunsrinivasan/784824d08392d11bf643 to your computer and use it in GitHub Desktop.
Save arunsrinivasan/784824d08392d11bf643 to your computer and use it in GitHub Desktop.
data.table, dplyr and R - floating point comparisons

Checking for exact equality of FPs

require(dplyr)
DF = data.frame(a=seq(0, 1, by=0.2), b=1:2)

merge(data.frame(a=0.6), DF, all.x=TRUE)
#     a  b
# 1 0.6 NA

left_join(data.frame(a=0.6), DF)
# Joining by: "a"
#     a  b
# 1 0.6 NA

Yes floating point match is hard! But that's really not an answer. This post, in fact, the entire series in that blog, is an excellent read about the ways one can overcome such surprises. It also talks about how using tolerance is rubbish. There is not really one perfect answer to this issue (including the one provided in that blog) - which'll become also obvious by reading the comments under this link.

What we do in data.table is to round off the last 2 bytes by default for numeric comparisons (with an option to not do this if you really wish so - by doing setNumericRounding(0L)). This is just another way to tackle this problem. This is plentiful sufficient unless we deal with really large numerics. Personally I've not seen a floating point number that huge and with decimal places that's of any use.. ex: 123456789987654321.12345. We recommend using bit64::integer64 for really large numerics.

Like I said, this is just another way of attempting to avoid surprises like the case above. But it's essential to not let it slide by saying floating point math is hard, IMHO.

getNumericRounding() # [1] 2
DT = data.table(DF, key="a")
DT[.(0.6)]
#      a b
# 1: 0.6 2

setNumericRounding(0L) # no rounding
DT[.(0.6)]
#      a  b
# 1: 0.6 NA
@fabeit
Copy link

fabeit commented Jul 18, 2018

I am having an issue in this regard, I have to use signif() on one of the two floating point columns I use for merge otherwise merge misses some matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment