Skip to content

Instantly share code, notes, and snippets.

@patternproject
Created June 9, 2017 16:43
Show Gist options
  • Save patternproject/76637a1e0abd26e28330f8c90370c9f7 to your computer and use it in GitHub Desktop.
Save patternproject/76637a1e0abd26e28330f8c90370c9f7 to your computer and use it in GitHub Desktop.
# The main idea comes from this post:
# http://zevross.com/blog/2014/08/05/using-the-r-function-anti_join-to-find-unmatched-records/
# Setting up
df.1 <- data.frame(
state = rep(1:3, each=2),
country = rep(c("A","B"),3),
vals = rnorm(6)
)
df.2 = df.1[c(1,3,4),]
# Anti Join on 1 Parameter
anti_join(df.1,df.2, by=c("state"))
# state country vals
# 3 A -0.1566940
# 3 B 0.5666763
# Anti Join on 2 Parameters
anti_join(df.1,df.2, by=c("state","country"))
# state country vals
# 1 B -0.4089718
# 3 B 0.5666763
# 3 A -0.1566940
# Anti Join on 3 or All Parameters
anti_join(df.1,df.2, by=c("state","country","vals"))
# state country vals
# 3 B 0.5666763
# 3 A -0.1566940
# 1 B -0.4089718
setdiff(df.1,df.2)
# state country vals
# 1 B -0.4089718
# 3 B 0.5666763
# 3 A -0.1566940
# setdiff behavior is similar to anti-join with all parameters.
@bhive01
Copy link

bhive01 commented Jun 12, 2017

You are correct.
Anti-join is looking only at the column or column combinations that you give it. setdiff is looking across the entire row. So, if you give anti-join all of the columns it works similar enough, but obviously there is some different sorting going on underneath. I haven't dug at the guts recently as they are written in C now and are harder for me to play with.

You can see the difference more if you add one more row to df.1 and take away Vals from the Anti-join.

## add a row that has same state and country, but different value
df.1 <- structure(list(state = c(1L, 1L, 2L, 2L, 3L, 3L, 1L), country = structure(c(1L, 
2L, 1L, 2L, 1L, 2L, 1L), .Label = c("A", "B"), class = "factor"), 
    vals = c(-0.611971622156942, 0.729355537031257, -0.9603762909679, 
    1.05861829907496, -0.167709554493079, -1.67932110281417, -1.342348572908456)), .Names = c("state", 
"country", "vals"), row.names = c(NA, -7L), class = "data.frame")


df.2 = df.1[c(1,3,4),]

# anti on every column
anti_join(df.1,df.2, by=c("state","country","vals"))
  # state country       vals
# 1     1       A -1.3423486
# 2     3       B -1.6793211
# 3     3       A -0.1677096
# 4     1       B  0.7293555

#anti only on 2, missing that new row
anti_join(df.1,df.2, by=c("state","country"))
  # state country       vals
# 1     3       B -1.6793211
# 2     3       A -0.1677096
# 3     1       B  0.7293555

# set diff compares across the entire row, similar to anti on everything
setdiff(df.1,df.2)
  # state country       vals
# 1     1       B  0.7293555
# 2     3       A -0.1677096
# 3     3       B -1.6793211
# 4     1       A -1.3423486

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment