Last active
April 8, 2020 11:02
-
-
Save brodieG/046e7cdd2acf42d95909 to your computer and use it in GitHub Desktop.
Corner Cases With Non-Standard Evaluation in data.table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Because there is no way to tell data.table | |
# "interpret this variable as a column name", it's possible to come up | |
# with corner cases. I'll grant these are unlikely to occur in day | |
# to day use, but any function that uses `data.table` must account for | |
# them | |
# Low odds, and yes, there are workarounds, but this is | |
# what I mean by you have to think carefully to avoid | |
# corner cases | |
# Ex 1 | |
my.dt <- data.table(col=letters[1:5], col2=1:5) | |
fun <- mean | |
col <- "col2" | |
my.dt[, fun(get(col))] | |
# this one in particular very unlikely, but illustrating a point | |
# Ex 2 | |
mtcars.dt <- data.table(mtcars) | |
mtcars.dt[,`cyl,am`:= 1] | |
grp <- "cyl,am" | |
mtcars.dt[,mean(hp), by=grp] | |
grp <- "`cyl,am`" | |
mtcars.dt[,mean(hp), by=grp] | |
# This one actually works fine, but again, you have to be careful | |
# by signaling your intent with an expression instead of a symbol | |
# name, which is not at all intuitive to anyone familiar with R. | |
# The `get` solution is internally consistent, at least, though | |
# with the collision issue I highlighted earlier | |
# Ex 3 | |
cols <- c("hp", "mpg") | |
fun <- mean | |
(data.table(mtcars)[, cols:=lapply(.SD, fun), .SDcols=cols]) | |
(data.table(mtcars)[, (cols):=lapply(.SD, fun), .SDcols=cols]) | |
# Let's try to group by expressions (to be fair, you can't | |
# really do this with `dplyr`) | |
# Ex 4 | |
exp <- list(a=quote(gear %% 2), b=quote(cut(hp, 5))) | |
data.table(mtcars)[, mean(mpg), by=list(a=gear %% 2, b=cut(hp, 5))] | |
data.table(mtcars)[, mean(mpg), by=exp] # argh | |
# Ex 5 | |
group_by_exp <- function(exp) | |
data.table(mtcars)[, mean(mpg), by=eval(substitute(exp))] | |
group_by_exp(list(a=gear %% 2, b=cut(hp, 5))) # this kind of wokrs | |
# Ex 6 | |
exp.q <- quote(list(a=gear %% 2, b=cut(hp, 5))) | |
group_by_exp(exp.q) # argh | |
group_by_exp2 <- function(exp) | |
data.table(mtcars)[, mean(mpg), by=eval(eval(substitute(exp)))] | |
group_by_exp2(exp.q) # now we're getting crazy... | |
data.table(mtcars)[, mean(mpg), by=exp.q] # this actually works!, but not documented | |
# Again, everyone one of these has workarounds, though they require | |
# some care. I'd like a version of `[.data.table` that allows me | |
# to very explicitly tell it how to interpret things so that I don't | |
# have to worry about funny corner cases due to the flexibility in | |
# data.table. Don't get me wrong, for the most part the flexibility | |
# is fantastic. |
Also, one comment re 2, 3, and 4-6 that's worth highlighting. The workarounds are all different. For one we need to use get
, for the other to use()
or some such, and for the last we need to evaluate quoted expressions. By providing one SE version that handles all this stuff you greatly simplify the accessibility of use of data.table
to programmers (as opposed to command line users).
Note: discussion is being continued on e-mail. Will report back with conclusions.
Any news?
AFAIU all those corner cases are addressed by Rdatatable/data.table#4304
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Re 1:
That said, even if the above were not possible in
dplyr
, I thinkdata.table
should provide a mechanism for a programmer to provide a variable column name and not have to first check that the variable name they chose in code doesn't exist in the table name (and if it does, twist themselves into pretzels to change the variable name, or error out asking the user to change the column names of the data table).Re point 2:
I appreciate the feature aspect of it for interactive use, but now if I'm writing a function and asking people to provide character column names, I need to check to make sure that they aren't using the "x,y" syntax, and if they are, check for a column called "x,y", and if they are, hope there aren't also "x" and "y" columns, and if there are, throw a warning alerting them to the possible ambiguity.
Note
c(grp)
does not fix the problem. The problem is the support of the"col1,col2,col3"
syntax along with the support ofc("col1", "col2", "col3")
syntax. In my examplec(grp)
still groups by the two columnscyl
andam
, instead of the single columncyl,am
which I created (and agree is completely contrived).Re point 3:
Seems like this one has no possible ambiguity, so it's probably fine as is.
Re point 4-6:
You're right that I've probably overcomplicated some of the cases. That said, I don't think there is any documentation about how to programmatically pass grouping expressions (at least that I saw). I built these through trial and error.
Re:
This is exactly my point. I think NSE is very useful for interactive use, but why not provide a SE version of
[.data.table
for non-interactive use so that programmers don't have to learn all the subtleties and gotchas, and protect for corner cases created by functionality that is most useful only interactive mode?I'm not sure how much this has to do with
lazyeval
, vs the following from thelazyeval
vignette:I'm not familiar with
lazyeval
(and had never heard of it until you mentioned it), but it seems to me that it's the philosophy of having a programmer friendly version (SE) of the NSE function that makes it work.dplyr
(partially) provides such escape hatches, which is what I was pointing out in the SO question.