Skip to content

Instantly share code, notes, and snippets.

@brodieG
Last active April 8, 2020 11:02
Show Gist options
  • Save brodieG/046e7cdd2acf42d95909 to your computer and use it in GitHub Desktop.
Save brodieG/046e7cdd2acf42d95909 to your computer and use it in GitHub Desktop.
Corner Cases With Non-Standard Evaluation in data.table
# Because there is no way to tell data.table
# "interpret this variable as a column name", it's possible to come up
# with corner cases. I'll grant these are unlikely to occur in day
# to day use, but any function that uses `data.table` must account for
# them
# Low odds, and yes, there are workarounds, but this is
# what I mean by you have to think carefully to avoid
# corner cases
# Ex 1
my.dt <- data.table(col=letters[1:5], col2=1:5)
fun <- mean
col <- "col2"
my.dt[, fun(get(col))]
# this one in particular very unlikely, but illustrating a point
# Ex 2
mtcars.dt <- data.table(mtcars)
mtcars.dt[,`cyl,am`:= 1]
grp <- "cyl,am"
mtcars.dt[,mean(hp), by=grp]
grp <- "`cyl,am`"
mtcars.dt[,mean(hp), by=grp]
# This one actually works fine, but again, you have to be careful
# by signaling your intent with an expression instead of a symbol
# name, which is not at all intuitive to anyone familiar with R.
# The `get` solution is internally consistent, at least, though
# with the collision issue I highlighted earlier
# Ex 3
cols <- c("hp", "mpg")
fun <- mean
(data.table(mtcars)[, cols:=lapply(.SD, fun), .SDcols=cols])
(data.table(mtcars)[, (cols):=lapply(.SD, fun), .SDcols=cols])
# Let's try to group by expressions (to be fair, you can't
# really do this with `dplyr`)
# Ex 4
exp <- list(a=quote(gear %% 2), b=quote(cut(hp, 5)))
data.table(mtcars)[, mean(mpg), by=list(a=gear %% 2, b=cut(hp, 5))]
data.table(mtcars)[, mean(mpg), by=exp] # argh
# Ex 5
group_by_exp <- function(exp)
data.table(mtcars)[, mean(mpg), by=eval(substitute(exp))]
group_by_exp(list(a=gear %% 2, b=cut(hp, 5))) # this kind of wokrs
# Ex 6
exp.q <- quote(list(a=gear %% 2, b=cut(hp, 5)))
group_by_exp(exp.q) # argh
group_by_exp2 <- function(exp)
data.table(mtcars)[, mean(mpg), by=eval(eval(substitute(exp)))]
group_by_exp2(exp.q) # now we're getting crazy...
data.table(mtcars)[, mean(mpg), by=exp.q] # this actually works!, but not documented
# Again, everyone one of these has workarounds, though they require
# some care. I'd like a version of `[.data.table` that allows me
# to very explicitly tell it how to interpret things so that I don't
# have to worry about funny corner cases due to the flexibility in
# data.table. Don't get me wrong, for the most part the flexibility
# is fantastic.
@arunsrinivasan
Copy link

Thanks. It'd be nice if you could number your examples so that it's easy to refer to them.

  1. Could you please provide the dplyr equivalent for the first case: my.dt[, fun(get(col))]?
  2. mtcars.dt[,mean(hp), by=grp] - this is really a feature to avoid the "" while dealing with single columns in an interactive session. You should do: mtcars.dt[,mean(hp), by=c(grp)]. This is not something I'd consider as "being careful". But alright. Will think about this.
    While testing (sometime ago) I came across something similar in dplyr - shown here - my first reply.
  3. Again cols := and (cols) := are features to provide an easier way during interactive sessions. But alright. Will think about this as well.
  4. by=exp case is again the result of a feature. But maybe we can fix that.
  5. (and 6.) huh? exp.q <- quote(list(a=gear %% 2, b=cut(hp, 5))) - shouldn't you be doing eval(exp)? It's already an expression - and you wrap that with substitute.
foo <- function(x, exp) { eval(substitute(exp), x, parent.frame()) }
foo(mtcars, exp.q)
# list(a = gear%%2, b = cut(hp, 5))
bar <- function(x, exp) { eval(exp, x, parent.frame()) }
bar(mtcars, exp.q)
# gives the intended result.

To summarise:
Point 1 - I'd like to see a dplyr solution using lazyeval to see how else it could be done.
Points 2,3,4 are basically the same issue - in that a feature that takes time to get used to. But like I said, I'll think about it / discuss with Matt.
Point 5 (and 6) work as intended.

In general, with NSE, you can always break intended functionality if you look careful enough. What I'm curious about is to know why one doesn't need to be careful with lazyeval.

@brodieG
Copy link
Author

brodieG commented Nov 6, 2014

Re 1:

my.df <- data.frame(col=letters[1:5], col2=1:5)
col <- "col2"
fun <- mean
my.df %>% summarise_each_(funs(fun), col)

That said, even if the above were not possible in dplyr, I think data.table should provide a mechanism for a programmer to provide a variable column name and not have to first check that the variable name they chose in code doesn't exist in the table name (and if it does, twist themselves into pretzels to change the variable name, or error out asking the user to change the column names of the data table).

Re point 2:
I appreciate the feature aspect of it for interactive use, but now if I'm writing a function and asking people to provide character column names, I need to check to make sure that they aren't using the "x,y" syntax, and if they are, check for a column called "x,y", and if they are, hope there aren't also "x" and "y" columns, and if there are, throw a warning alerting them to the possible ambiguity.

Note c(grp) does not fix the problem. The problem is the support of the "col1,col2,col3" syntax along with the support of c("col1", "col2", "col3") syntax. In my example c(grp) still groups by the two columns cyl and am, instead of the single column cyl,am which I created (and agree is completely contrived).

Re point 3:
Seems like this one has no possible ambiguity, so it's probably fine as is.

Re point 4-6:
You're right that I've probably overcomplicated some of the cases. That said, I don't think there is any documentation about how to programmatically pass grouping expressions (at least that I saw). I built these through trial and error.

Re:

In general, with NSE, you can always break intended functionality if you look careful enough.

This is exactly my point. I think NSE is very useful for interactive use, but why not provide a SE version of [.data.table for non-interactive use so that programmers don't have to learn all the subtleties and gotchas, and protect for corner cases created by functionality that is most useful only interactive mode?

What I'm curious about is to know why one doesn't need to be careful with lazyeval.

I'm not sure how much this has to do with lazyeval, vs the following from the lazyeval vignette:

Every function that uses NSE should have a standard evaluation (SE) escape hatch that does the actual computation. The SE-function name should end with _. The SE-function has a flexible input specification to make it easy for people to program with.

I'm not familiar with lazyeval (and had never heard of it until you mentioned it), but it seems to me that it's the philosophy of having a programmer friendly version (SE) of the NSE function that makes it work.
dplyr (partially) provides such escape hatches, which is what I was pointing out in the SO question.

@brodieG
Copy link
Author

brodieG commented Nov 6, 2014

Also, one comment re 2, 3, and 4-6 that's worth highlighting. The workarounds are all different. For one we need to use get, for the other to use() or some such, and for the last we need to evaluate quoted expressions. By providing one SE version that handles all this stuff you greatly simplify the accessibility of use of data.table to programmers (as opposed to command line users).

@brodieG
Copy link
Author

brodieG commented Nov 6, 2014

Note: discussion is being continued on e-mail. Will report back with conclusions.

@wolkym
Copy link

wolkym commented Oct 6, 2015

Any news?

@jangorecki
Copy link

jangorecki commented Apr 8, 2020

AFAIU all those corner cases are addressed by Rdatatable/data.table#4304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment