The goal of this document is to outline proposed new behavior for
DataFramesMeta and it's integration with the new DataFrames.jl
"piping" functions, e.g. transform
, select
, combine
, and filter
.
- The most glaring limitation of the current behavior of DataFramesMeta is it's undefined behavior. Consider the following:
using DataFrames, DataFramesMeta
df = DataFrame(
g = [1, 1, 1, 2, 2],
i = 1:5,
t = ["a", "b", "c", "c", "e"],
y = [:v, :w, :x, :y, :z],
c = [:g, :quote, :body, :transform, missing]
)
@transform(df, Not(:i))
5×6 DataFrames.DataFrame
│ Row │ g │ i │ t │ y │ c │ Not │
│ │ Int64 │ Int64 │ String │ Symbol │ Symbol? │ Int64 │
├─────┼───────┼───────┼────────┼────────┼───────────┼───────┤
│ 1 │ 1 │ 1 │ a │ v │ g │ 1 │
│ 2 │ 1 │ 2 │ b │ w │ quote │ 2 │
│ 3 │ 1 │ 3 │ c │ x │ body │ 3 │
│ 4 │ 2 │ 4 │ c │ y │ transform │ 4 │
│ 5 │ 2 │ 5 │ e │ z │ missing │ 5 │
Additionally, many features in DataFrames.transform are simply not available.
- Variables 2. Its not possible to programmatically generate a column with a new name that's stored in a variable.
newname = :a_new_variable
transform(df, :i => identity => newname)
5×6 DataFrames.DataFrame
│ Row │ g │ i │ t │ y │ c │ a_new_variable │
│ │ Int64 │ Int64 │ String │ Symbol │ Symbol? │ Int64 │
├─────┼───────┼───────┼────────┼────────┼───────────┼────────────────┤
│ 1 │ 1 │ 1 │ a │ v │ g │ 1 │
│ 2 │ 1 │ 2 │ b │ w │ quote │ 2 │
│ 3 │ 1 │ 3 │ c │ x │ body │ 3 │
│ 4 │ 2 │ 4 │ c │ y │ transform │ 4 │
│ 5 │ 2 │ 5 │ e │ z │ missing │ 5 │
compared with
@transform(df, cols(newname) = :i)
which errors.
- Variables 1. Using variable which is a vector of symbols representing columns.
n = [:g, :i]
transform(df, n => ByRow(+) => :newvar)
5×6 DataFrames.DataFrame
│ Row │ g │ i │ t │ y │ c │ newvar │
│ │ Int64 │ Int64 │ String │ Symbol │ Symbol? │ Int64 │
├─────┼───────┼───────┼────────┼────────┼───────────┼────────┤
│ 1 │ 1 │ 1 │ a │ v │ g │ 2 │
│ 2 │ 1 │ 2 │ b │ w │ quote │ 3 │
│ 3 │ 1 │ 3 │ c │ x │ body │ 4 │
│ 4 │ 2 │ 4 │ c │ y │ transform │ 6 │
│ 5 │ 2 │ 5 │ e │ z │ missing │ 7 │
In DataFramesMeta, using a vector of names evaluates to a DataFrame
function show_intermediate(x)
print(x)
return fill(1, nrow(x))
end
@transform(df, y = show_intermediate(cols(n)))
5×5 DataFrames.DataFrame
│ Row │ g │ i │ t │ y │ c │
│ │ Int64 │ Int64 │ String │ Int64 │ Symbol? │
├─────┼───────┼───────┼────────┼───────┼───────────┤
│ 1 │ 1 │ 1 │ a │ 1 │ g │
│ 2 │ 1 │ 2 │ b │ 1 │ quote │
│ 3 │ 1 │ 3 │ c │ 1 │ body │
│ 4 │ 2 │ 4 │ c │ 1 │ transform │
│ 5 │ 2 │ 5 │ e │ 1 │ missing │
Literate.jl doesn't actually show the intermediate value but you can try for yourself. I'm not sure how many users are actually aware of this behavior or use this behavior.
-
Performance. None of the implementations in DataFramesMeta actually use DataFrames in the backend. Milan is working on multithreading for DataFrames and we want to be able to benefit from that infrastructure.
-
Naming. Currently we have
@based_on
instead of@combine
and@where
instead of@filter
. This probably causes confusion for new users.
The following is a small implementation that should supersede @transform
,
@select
, and @based_on
.
cols(x) = x
function make_vec_to_fun(kw::Expr)
if kw.head == :(=) || kw.head == :kw
output = kw.args[1]
membernames = Dict{Any, Symbol}()
funname = gensym()
body = DataFramesMeta.replace_syms!(kw.args[2], membernames)
if kw.args[1] isa Symbol
t = quote
$(Expr(:vect, keys(membernames)...)) => function $funname($(values(membernames)...))
$body
end => $(QuoteNode(output))
end
elseif kw.args[1] isa QuoteNode || DataFramesMeta.onearg(kw.args[1], :cols)
t = quote
$(Expr(:vect, keys(membernames)...)) => function $funname($(values(membernames)...))
$body
end => $(output)
end
end
return t
else
return kw
end
end
function make_vec_to_fun(kw::QuoteNode)
return kw
end
function transform_helper2(x, args...)
t = [make_vec_to_fun(arg) for arg in args]
quote
$DataFrames.transform($x, $(t...))
end
end
macro transform2(x, args...)
esc(transform_helper2(x, args...))
end
function based_on_helper2(x, args...)
t = [make_vec_to_fun(arg) for arg in args]
quote
$DataFrames.combine($x, $(t...))
end
end
macro based_on2(x, args...)
esc(based_on_helper2(x, args...))
end
function select_helper2(x, args...)
t = [make_vec_to_fun(arg) for arg in args]
quote
$DataFrames.select($x, $(t...))
end
end
macro select2(x, args...)
esc(select_helper2(x, args...))
end
@select2 (macro with 1 method)
These macros do the following.
-
Take an expression, if it is of the form
:z = :x + :y
, then we branch to special DataFramesMeta parsing. Otherwise we don't do anything special and parse normally. -
If it is of the form
:z = :x + :y
, we use existing DataFramesMeta tools to make a dictionary of symbols togensym
-ed variables and a function. From this dictionary of symbols, we construct the expression
[:x, :y] => function(x1, x2) x1 + x2 end => :z
You can see this transformation when you evaluate the following
df = DataFrame(x = [1, 2], y = [3, 4])
@transform2(df, :z = :x + :y)
2×3 DataFrames.DataFrame
│ Row │ x │ y │ z │
│ │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1 │ 1 │ 3 │ 4 │
│ 2 │ 2 │ 4 │ 6 │
For robustness, I also allow for the traditional "keyword argument" syntax
@transform(df, z = :x + :y)
. This brings me to my next point:
In dplyr
and Stata, you don't need to refer to a column name as a string,
you can simply write
mutate(df, z = x + y)
Adding the extra :
in DataFramesMeta is admittedly a pain. Is there a consensus
that we should deprecate requiring :
everywhere in favor of symbols as
literaels?
First, note that a major benefit of the proposed @select
is that
it allows you to use "normal" DataFrames behavior alongside DataFramesMeta
behavior.
@select2(df, Not(:x), :z = :y)
2×2 DataFrames.DataFrame
│ Row │ y │ z │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 3 │ 3 │
│ 2 │ 4 │ 4 │
Because of that, I think it would be awkward to have some arguments parse as "literal as variable" and others parse as "variable as variable", for example
@select(df, :x, z = y)
On the other hand, we could search the whole expression for symbol
literals, except for those prefaced by cols
or $
and apply
replace_syms!
on those. This would ensure something like
@select2(df, z = y + x, Between(a, q), cols(t))
would work. In the above example, z
, y
, x
, a
, and q
are
parsed as symbols: They mean the same columns :z
, :y
etc.
-
Is this feasible? I think so. We just replace literals with
QuoteNode
s and do a second parsing step. Or better yet, changereplace_syms!
to work withSymbols
(i.e. literals in the code) instead ofQuoteNode
s. -
Would this behavior make it very hard to put DataFramesMeta calls into functions? If so, I wouldn't want to do it. Allowing variables to reoresent columns easily is the major benefit over Stata over
dplyr
. I would rather make it easier to put into functions than have the convenience of not writing:
.
This proposed implementation faces a major challenge. We haven't solved
the problem of working with multiple arguments in cols
. The following
code works
@select2(df, :z = sum([:x, :y]))
2×1 DataFrames.DataFrame
│ Row │ z │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 4 │
│ 2 │ 6 │
However we can't simply write
n = [:x, :y]
@select2(df, :z = :x + sum(cols(n)))
The problem is that this evaluates to
select(df, [:x, [:x, :y]] => fun => :z)
which is not currently allowed in DataFrames. All elements of the
input vector need to be either all Symbol
s or all String
s.
Possible solutions to this are
-
Add a pre-processing step where we walk through the expression and look for all the
cols
calls, evaluate them in the local scope, and then replace their places in the expression tree with,[:x, :y]
or whatever else whats insidecols
evaluates to.Is this even possible with macros? Not sure.
-
Change Dataframes.jl to allow multiple types of inputs. I discuss this in this issue. One type-stable option is to make every non-singleton input in
cols
evaluate to aAsTable
object.
@byrow!
has an annoying limitation where you have to declare the type
of the new column when you want to make a new column.
@byrow! df begin
@newcol z::Vector{Float64}
:z = :x + :y
end
2×3 DataFrames.DataFrame
│ Row │ x │ y │ z │
│ │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 3 │ 4.0 │
│ 2 │ 2 │ 4 │ 6.0 │
This is clunky and probably intimidating to new users. Additionally,
the current implemenation makes using cols
with multiple columns
difficult.
In conclusion, I hope this is a fruitful document that leads to a good discussion.
This page was generated using Literate.jl.