Skip to content

Instantly share code, notes, and snippets.

@pdeffebach
Last active July 31, 2020 14:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pdeffebach/7ce8a5cbb8053bc7c4cf7bc61b3db129 to your computer and use it in GitHub Desktop.
Save pdeffebach/7ce8a5cbb8053bc7c4cf7bc61b3db129 to your computer and use it in GitHub Desktop.
Proposal for new DataFramesMeta behavior

New DataFramesMeta.jl behavior

Goal

The goal of this document is to outline proposed new behavior for DataFramesMeta and it's integration with the new DataFrames.jl "piping" functions, e.g. transform, select, combine, and filter.

Limitations of current implementation

  1. The most glaring limitation of the current behavior of DataFramesMeta is it's undefined behavior. Consider the following:
using DataFrames, DataFramesMeta

df = DataFrame(
        g = [1, 1, 1, 2, 2],
        i = 1:5,
        t = ["a", "b", "c", "c", "e"],
        y = [:v, :w, :x, :y, :z],
        c = [:g, :quote, :body, :transform, missing]
    )

 @transform(df, Not(:i))
5×6 DataFrames.DataFrame
│ Row │ g     │ i     │ t      │ y      │ c         │ Not   │
│     │ Int64 │ Int64 │ String │ Symbol │ Symbol?   │ Int64 │
├─────┼───────┼───────┼────────┼────────┼───────────┼───────┤
│ 1   │ 1     │ 1     │ a      │ v      │ g         │ 1     │
│ 2   │ 1     │ 2     │ b      │ w      │ quote     │ 2     │
│ 3   │ 1     │ 3     │ c      │ x      │ body      │ 3     │
│ 4   │ 2     │ 4     │ c      │ y      │ transform │ 4     │
│ 5   │ 2     │ 5     │ e      │ z      │ missing   │ 5     │

Additionally, many features in DataFrames.transform are simply not available.

  1. Variables 2. Its not possible to programmatically generate a column with a new name that's stored in a variable.
newname = :a_new_variable
transform(df, :i => identity => newname)
5×6 DataFrames.DataFrame
│ Row │ g     │ i     │ t      │ y      │ c         │ a_new_variable │
│     │ Int64 │ Int64 │ String │ Symbol │ Symbol?   │ Int64          │
├─────┼───────┼───────┼────────┼────────┼───────────┼────────────────┤
│ 1   │ 1     │ 1     │ a      │ v      │ g         │ 1              │
│ 2   │ 1     │ 2     │ b      │ w      │ quote     │ 2              │
│ 3   │ 1     │ 3     │ c      │ x      │ body      │ 3              │
│ 4   │ 2     │ 4     │ c      │ y      │ transform │ 4              │
│ 5   │ 2     │ 5     │ e      │ z      │ missing   │ 5              │

compared with

@transform(df, cols(newname) = :i)

which errors.

  1. Variables 1. Using variable which is a vector of symbols representing columns.
n = [:g, :i]
transform(df, n => ByRow(+) => :newvar)
5×6 DataFrames.DataFrame
│ Row │ g     │ i     │ t      │ y      │ c         │ newvar │
│     │ Int64 │ Int64 │ String │ Symbol │ Symbol?   │ Int64  │
├─────┼───────┼───────┼────────┼────────┼───────────┼────────┤
│ 1   │ 1     │ 1     │ a      │ v      │ g         │ 2      │
│ 2   │ 1     │ 2     │ b      │ w      │ quote     │ 3      │
│ 3   │ 1     │ 3     │ c      │ x      │ body      │ 4      │
│ 4   │ 2     │ 4     │ c      │ y      │ transform │ 6      │
│ 5   │ 2     │ 5     │ e      │ z      │ missing   │ 7      │

In DataFramesMeta, using a vector of names evaluates to a DataFrame

function show_intermediate(x)
	print(x)
	return fill(1, nrow(x))
end

@transform(df, y = show_intermediate(cols(n)))
5×5 DataFrames.DataFrame
│ Row │ g     │ i     │ t      │ y     │ c         │
│     │ Int64 │ Int64 │ String │ Int64 │ Symbol?   │
├─────┼───────┼───────┼────────┼───────┼───────────┤
│ 1   │ 1     │ 1     │ a      │ 1     │ g         │
│ 2   │ 1     │ 2     │ b      │ 1     │ quote     │
│ 3   │ 1     │ 3     │ c      │ 1     │ body      │
│ 4   │ 2     │ 4     │ c      │ 1     │ transform │
│ 5   │ 2     │ 5     │ e      │ 1     │ missing   │

Literate.jl doesn't actually show the intermediate value but you can try for yourself. I'm not sure how many users are actually aware of this behavior or use this behavior.

  1. Performance. None of the implementations in DataFramesMeta actually use DataFrames in the backend. Milan is working on multithreading for DataFrames and we want to be able to benefit from that infrastructure.

  2. Naming. Currently we have @based_on instead of @combine and @where instead of @filter. This probably causes confusion for new users.

Proposal

The following is a small implementation that should supersede @transform, @select, and @based_on.

cols(x) = x

function make_vec_to_fun(kw::Expr)

    if kw.head == :(=) || kw.head == :kw
        output = kw.args[1]

        membernames = Dict{Any, Symbol}()
        funname = gensym()
        body = DataFramesMeta.replace_syms!(kw.args[2], membernames)
        if kw.args[1] isa Symbol
            t = quote
                $(Expr(:vect, keys(membernames)...)) => function $funname($(values(membernames)...))
                    $body
                end => $(QuoteNode(output))
            end
        elseif kw.args[1] isa QuoteNode || DataFramesMeta.onearg(kw.args[1], :cols)
            t = quote
                $(Expr(:vect, keys(membernames)...)) => function $funname($(values(membernames)...))
                    $body
                end => $(output)
            end
        end
        return t
    else
        return kw
    end
end

function make_vec_to_fun(kw::QuoteNode)
    return kw
end

function transform_helper2(x, args...)

    t = [make_vec_to_fun(arg) for arg in args]

    quote
        $DataFrames.transform($x, $(t...))
    end
end

macro transform2(x, args...)
    esc(transform_helper2(x, args...))
end

function based_on_helper2(x, args...)

    t = [make_vec_to_fun(arg) for arg in args]

    quote
        $DataFrames.combine($x, $(t...))
    end
end

macro based_on2(x, args...)
    esc(based_on_helper2(x, args...))
end

function select_helper2(x, args...)
    t = [make_vec_to_fun(arg) for arg in args]

    quote
        $DataFrames.select($x, $(t...))
    end
end

macro select2(x, args...)
    esc(select_helper2(x, args...))
end
@select2 (macro with 1 method)

These macros do the following.

  1. Take an expression, if it is of the form :z = :x + :y, then we branch to special DataFramesMeta parsing. Otherwise we don't do anything special and parse normally.

  2. If it is of the form :z = :x + :y, we use existing DataFramesMeta tools to make a dictionary of symbols to gensym-ed variables and a function. From this dictionary of symbols, we construct the expression

[:x, :y] => function(x1, x2) x1 + x2 end => :z

You can see this transformation when you evaluate the following

df = DataFrame(x = [1, 2], y = [3, 4])
@transform2(df, :z = :x + :y)
2×3 DataFrames.DataFrame
│ Row │ x     │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 3     │ 4     │
│ 2   │ 2     │ 4     │ 6     │

For robustness, I also allow for the traditional "keyword argument" syntax @transform(df, z = :x + :y). This brings me to my next point:

Variables and Symbols

In dplyr and Stata, you don't need to refer to a column name as a string, you can simply write

mutate(df, z = x + y)

Adding the extra : in DataFramesMeta is admittedly a pain. Is there a consensus that we should deprecate requiring : everywhere in favor of symbols as literaels?

First, note that a major benefit of the proposed @select is that it allows you to use "normal" DataFrames behavior alongside DataFramesMeta behavior.

@select2(df, Not(:x), :z = :y)
2×2 DataFrames.DataFrame
│ Row │ y     │ z     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 3     │ 3     │
│ 2   │ 4     │ 4     │

Because of that, I think it would be awkward to have some arguments parse as "literal as variable" and others parse as "variable as variable", for example

@select(df, :x, z = y)

On the other hand, we could search the whole expression for symbol literals, except for those prefaced by cols or $ and apply replace_syms! on those. This would ensure something like

@select2(df, z = y + x, Between(a, q), cols(t))

would work. In the above example, z, y, x, a, and q are parsed as symbols: They mean the same columns :z, :y etc.

  1. Is this feasible? I think so. We just replace literals with QuoteNodes and do a second parsing step. Or better yet, change replace_syms! to work with Symbols (i.e. literals in the code) instead of QuoteNodes.

  2. Would this behavior make it very hard to put DataFramesMeta calls into functions? If so, I wouldn't want to do it. Allowing variables to reoresent columns easily is the major benefit over Stata over dplyr. I would rather make it easier to put into functions than have the convenience of not writing :.

Working with multiple columns.

This proposed implementation faces a major challenge. We haven't solved the problem of working with multiple arguments in cols. The following code works

@select2(df, :z = sum([:x, :y]))
2×1 DataFrames.DataFrame
│ Row │ z     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 6     │

However we can't simply write

n = [:x, :y]
@select2(df, :z = :x + sum(cols(n)))

The problem is that this evaluates to

select(df, [:x, [:x, :y]] => fun => :z)

which is not currently allowed in DataFrames. All elements of the input vector need to be either all Symbols or all Strings. Possible solutions to this are

  1. Add a pre-processing step where we walk through the expression and look for all the cols calls, evaluate them in the local scope, and then replace their places in the expression tree with, [:x, :y] or whatever else whats inside cols evaluates to.

    Is this even possible with macros? Not sure.

  2. Change Dataframes.jl to allow multiple types of inputs. I discuss this in this issue. One type-stable option is to make every non-singleton input in cols evaluate to a AsTable object.

Other topics

@byrow! has an annoying limitation where you have to declare the type of the new column when you want to make a new column.

@byrow! df begin
    @newcol z::Vector{Float64}
    :z = :x + :y
end
2×3 DataFrames.DataFrame
│ Row │ x     │ y     │ z       │
│     │ Int64 │ Int64 │ Float64 │
├─────┼───────┼───────┼─────────┤
│ 1   │ 1     │ 3     │ 4.0     │
│ 2   │ 2     │ 4     │ 6.0     │

This is clunky and probably intimidating to new users. Additionally, the current implemenation makes using cols with multiple columns difficult.

In conclusion, I hope this is a fruitful document that leads to a good discussion.


This page was generated using Literate.jl.

# # New DataFramesMeta.jl behavior
# ## Goal
# The goal of this document is to outline proposed new behavior for
# DataFramesMeta and it's integration with the new DataFrames.jl
# "piping" functions, e.g. `transform`, `select`, `combine`, and `filter`.
# ## Limitations of current implementation
#
# 1. The most glaring limitation of the current behavior of DataFramesMeta is
# it's undefined behavior. Consider the following:
using DataFrames, DataFramesMeta
df = DataFrame(
g = [1, 1, 1, 2, 2],
i = 1:5,
t = ["a", "b", "c", "c", "e"],
y = [:v, :w, :x, :y, :z],
c = [:g, :quote, :body, :transform, missing]
)
@transform(df, Not(:i))
# Additionally, many features in DataFrames.transform are simply
# not available.
#
# 2. Variables 2. Its not possible to programmatically generate a column with
# a new name that's stored in a variable.
newname = :a_new_variable
transform(df, :i => identity => newname)
# compared with
#
# ```
# @transform(df, cols(newname) = :i)
# ```
#
# which errors.
# 3. Variables 1. Using variable which is a vector of symbols representing
# columns.
n = [:g, :i]
transform(df, n => ByRow(+) => :newvar)
# In DataFramesMeta, using a vector of names evaluates to a DataFrame
function show_intermediate(x)
print(x)
return fill(1, nrow(x))
end
@transform(df, y = show_intermediate(cols(n)))
# Literate.jl doesn't actually show the intermediate value but you can try
# for yourself. I'm not sure how many users are actually aware of this behavior
# or use this behavior.
# 4. Performance. None of the implementations in DataFramesMeta actually
# use DataFrames in the backend. Milan is working on multithreading for
# DataFrames and we want to be able to benefit from that infrastructure.
# 5. Naming. Currently we have `@based_on` instead of `@combine` and
# `@where` instead of `@filter`. This probably causes confusion for new
# users.
# ## Proposal
#
# The following is a small implementation that should supersede `@transform`,
# `@select`, and `@based_on`.
cols(x) = x
function make_vec_to_fun(kw::Expr)
if kw.head == :(=) || kw.head == :kw
output = kw.args[1]
membernames = Dict{Any, Symbol}()
funname = gensym()
body = DataFramesMeta.replace_syms!(kw.args[2], membernames)
if kw.args[1] isa Symbol
t = quote
$(Expr(:vect, keys(membernames)...)) => function $funname($(values(membernames)...))
$body
end => $(QuoteNode(output))
end
elseif kw.args[1] isa QuoteNode || DataFramesMeta.onearg(kw.args[1], :cols)
t = quote
$(Expr(:vect, keys(membernames)...)) => function $funname($(values(membernames)...))
$body
end => $(output)
end
end
return t
else
return kw
end
end
function make_vec_to_fun(kw::QuoteNode)
return kw
end
function transform_helper2(x, args...)
t = [make_vec_to_fun(arg) for arg in args]
quote
$DataFrames.transform($x, $(t...))
end
end
macro transform2(x, args...)
esc(transform_helper2(x, args...))
end
function based_on_helper2(x, args...)
t = [make_vec_to_fun(arg) for arg in args]
quote
$DataFrames.combine($x, $(t...))
end
end
macro based_on2(x, args...)
esc(based_on_helper2(x, args...))
end
function select_helper2(x, args...)
t = [make_vec_to_fun(arg) for arg in args]
quote
$DataFrames.select($x, $(t...))
end
end
macro select2(x, args...)
esc(select_helper2(x, args...))
end
# These macros do the following.
#
# 1. Take an expression, if it is of the form `:z = :x + :y`, then
# we branch to special DataFramesMeta parsing. Otherwise we don't do
# anything special and parse normally.
#
# 2. If it is of the form `:z = :x + :y`, we use existing DataFramesMeta tools
# to make a dictionary of symbols to `gensym`-ed variables and a function.
# From this dictionary of symbols, we construct the expression
#
# ```
# [:x, :y] => function(x1, x2) x1 + x2 end => :z
# ```
#
# You can see this transformation when you evaluate the following
df = DataFrame(x = [1, 2], y = [3, 4])
@transform2(df, :z = :x + :y)
# For robustness, I also allow for the traditional "keyword argument" syntax
# `@transform(df, z = :x + :y)`. This brings me to my next point:
# ## Variables and Symbols
#
# In `dplyr` and Stata, you don't need to refer to a column name as a string,
# you can simply write
#
# ```
# mutate(df, z = x + y)
# ```
#
# Adding the extra `:` in DataFramesMeta is admittedly a pain. Is there a consensus
# that we should deprecate requiring `:` everywhere in favor of symbols as
# literaels?
#
# First, note that a major benefit of the proposed `@select` is that
# it allows you to use "normal" DataFrames behavior alongside DataFramesMeta
# behavior.
@select2(df, Not(:x), :z = :y)
# Because of that, I think it would be awkward to have some arguments parse
# as "literal as variable" and others parse as "variable as variable", for example
#
# ```
# @select(df, :x, z = y)
# ```
# On the other hand, we could search the whole expression for symbol
# literals, except for those prefaced by `cols` or `$` and apply
# `replace_syms!` on those. This would ensure something like
#
# ```
# @select2(df, z = y + x, Between(a, q), cols(t))
# ```
#
# would work. In the above example, `z`, `y`, `x`, `a`, and `q` are
# parsed as symbols: They mean the same columns `:z`, `:y` etc.
#
# 1. Is this feasible? I think so. We just replace literals with `QuoteNode`s and do
# a second parsing step. Or better yet, change `replace_syms!` to work with
# `Symbols` (i.e. literals in the code) instead of `QuoteNode`s.
#
# 2. Would this behavior make it very hard to put DataFramesMeta
# calls into functions? If so, I wouldn't want to do it. Allowing variables to reoresent
# columns easily is *the* major benefit over Stata over `dplyr`. I would rather make
# it easier to put into functions than have the convenience of not writing `:`.
#
# ## Working with multiple columns.
#
# This proposed implementation faces a major challenge. We haven't solved
# the problem of working with multiple arguments in `cols`. The following
# code works
@select2(df, :z = sum([:x, :y]))
# However we can't simply write
#
# ```
# n = [:x, :y]
# @select2(df, :z = :x + sum(cols(n)))
# ```
#
# The problem is that this evaluates to
#
# ```
# select(df, [:x, [:x, :y]] => fun => :z)
# ```
#
#
# which is not currently allowed in DataFrames. All elements of the
# input vector need to be either all `Symbol`s or all `String`s.
# Possible solutions to this are
#
# 1. Add a pre-processing step where we walk through the expression and
# look for all the `cols` calls, evaluate them in the local scope,
# and then replace their places in the expression tree with, `[:x, :y]` or
# whatever else whats inside `cols` evaluates to.
#
# Is this even possible with macros? Not sure.
#
# 2. Change Dataframes.jl to allow multiple types of inputs. I discuss this
# in [this issue](https://github.com/JuliaData/DataFrames.jl/issues/2328).
# One type-stable option is to make every non-singleton input in `cols`
# evaluate to a `AsTable` object.
#
# ## Other topics
#
# `@byrow!` has an annoying limitation where you have to declare the type
# of the new column when you want to make a new column.
@byrow! df begin
@newcol z::Vector{Float64}
:z = :x + :y
end
# This is clunky and probably intimidating to new users. Additionally,
# the current implemenation makes using `cols` with multiple columns
# difficult.
#
# In conclusion, I hope this is a fruitful document that leads to a good
# discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment