Skip to content

Instantly share code, notes, and snippets.

@Vindaar
Last active October 10, 2023 17:32
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save Vindaar/6908c038707c7d8293049edb3d204f84 to your computer and use it in GitHub Desktop.
Save Vindaar/6908c038707c7d8293049edb3d204f84 to your computer and use it in GitHub Desktop.
A comparison of dplyr, Pandas and data frames in Nim (using ggplotnim)

Dplyr (R), Pandas (Python) & Nim data frame comparison

This comparison is inspired by the comparison here: https://gist.github.com/conormm/fd8b1980c28dd21cfaf6975c86c74d07

The Nim data frame implementation we use here is the Datamancer data frame.

Note that due to UFCS in Nim we can write the commands we present either similar to the Python notation as if the function were a method of the object (method call syntax: df.filter(...)) or using regular function call syntax similar to R (filter(df, ...)). In R using the %>% pipe operator we can achieve something that looks a bit like method call syntax. In the R examples in which both ways are presented, we replace that by the equivalent in Nim.

We will use the method call syntax for all Nim examples with the exceptions mentioned. Also, unless the expression does not return anything we will discard all Nim statements.

And as stated in the gist linked above, possibly take the Python examples with a grain of salt in terms of notation.

Filter

R

filter(df, var > 20000 & var < 30000)
filter(df, var == 'string') # df %>% filter(var != 'string')
df %>% filter(var != 'string')
df %>% group_by(group) %>% filter(sum(var) > 2000000)

Python

df[(df['var'] > 20000) & (df['var'] < 30000)]
df[df['var'] == 'string']
df[df['var'] != 'string']
df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)

Nim

discard df.filter(f{`var` > 20000 and `var` < 30000})
# Note: `"string"` here refers to a specific string literal and not the string
# type (which may very well be the case in languages like Python),
# using function call syntax
discard filter(df, f{`var` == "string"})
# or using method call syntax with a similar example
discard df.filter(f{`var` != "string"})
# this returns all subgroups that pass the reducing predicate
discard df.group_by("group").filter(f{sum(`var`) > 2000000})

In these four cases the major difference between the R and Nim implementations is two things stemming from the major fact that R is an interpreted language and Nim compiled. The “pseudo code” described in the R examples needs to be converted into something that can be compiled as regular Nim code. Nim’s powerful macro system allows us to achieve the elegance of R with a small amount of additional fluff.

The two differences are:

  • we need to enter an untyped macro to allow us to rewrite code. This is done by using the f{} macro
  • References to columns are not done by untyped identifiers as (in essence) is done in R. There would be no way to differentiate between a variable foo and a column of name foo.
    let foo = 5
    # let `df` be a dataframe with column "foo"
    discard df.filter({ foo > 10 })
        

    For our implementation this code snippet means that foo will be interpreted as the local variable resulting in a formula that is constantly false. The resulting data frame will thus be empty.

For the latter we could abuse Nim’s macro system to plainly have an identifier refer to a column iff there is no symbol defined of that identifier. However, that is too ambiguous and makes it undesirable to use column names that might clash with typical symbols in code. There are different ways to refer to columns, which won’t be discussed in detail (a couple of ways will be mentioned below though as there are certain restrictions).

Side note: Nim’s implementation could be handled in a way similar to Pandas. That means essentially having broadcast operations for DF column + scalar operations. However, this has the incredible downside of introducing two for loops in code such as:

df["var"] > 200 and df["var"] < 500

Each comparison operation then needs to be broadcast and evaluated individually using a full loop over the data. This is wasted work. That’s the reason we take the full macro path in Nim (and having the additional beauty of being able to emulate the R syntax closely) to convert the DSL into a single for loop over the data independent of the complexity of the formula { multiple formulas after one another cannot be fused unfortunately. That would add significant complexity requiring analysis of the loop bodies, which in addition would require even the procedures filter, mutate, ... to be macros so that multiple formulas can even be analyzed together }. One currently still missing feature is lifting of constant expressions. Something like f{`var` < mean(`var`)} currently recomputes the mean in each iteration, just as it would in a regular for loop. Soon those things will be lifted out of the loop body.

A note on the last formula in this section

The last formula (in R):

df %>% group_by(group) %>% filter(sum(var) > 2000000)

in Nim:

df.group_by("group").filter(f{sum(`var`) > 2000000})

Depending on your expectations this case might be a bit weird. The formula argument to filter is a reducing formula that returns a boolean. If that boolean is true, the full subgroup (determined by the groub_by call) will be kept, else dropped.

One might on the other hand think it would return the group labels that match that predicate. We will now take a detour to do that instead, because it allow us to explain a few more things.

We will deconstruct (one of) the ways to determine the labels that pass the above predicate.

The Nim code to do that is (line breaks for readability):

df.group_by("group")
  .summarize(f{int: sum(`var`)})
  .filter(f{idx("(sum var)") > 2000000})

From a “high-level” point of view the only difference really is the splitting of the reduce + filter into individual calls. First a reducing operation of the kind we wish (summing the individual groups of “group”) and then filtering on the result.

There are now 2 things in particular that stand out.

  1. in the summarize formula we see the int: at the beginning. This is a type hint for the formula macro. It means to read the column (“var” in this case) as a column of integers. It is required, because from the sum(`var`) it isn’t possible to tell which type needs to be read as there are multiple overloads of the symbol sum.
  2. The filter proc now uses the idx “function”. It’s not really a function, but a different way to refer to data frame columns. There are again 2 things to untangle here:
    • why is the argument this string “sum var” when the original data frame contains a column “var”? Because the summarize operation returns a new data frame that contains those columns that are described by the formulas. As we did not give a name for the formula, the default naming scheme is used. That scheme is simply a lisp like representation of the code in the formula. A function call sum with the argument var: “(sum var)”. For people used to lisp this should be comfortable. This allows to write the default formula name easily by hand even if more complex expressions are used (though they get lengthy). A name to the formula is given by:
      df.summarize(f{int: "summed" << sum(`var`)})
              

      where the new column is called “summed”. The << is to refer to a reducing formula. The other two are:

      • ~ mapping operation T -> T. For any formula that is not reducing in nature a =”name” ~ RHS= can be used to name the newly created column. See the additional mutate example below.
      • <- as a constant assignment, typically only used in the context of the rename procedure to rename DF columns. See the section on rename below.
    • why do we use idx suddenly? The default naming schemes of formulas contains spaces. The accented quotes in Nim (identifiers in `) cannot properly handle spaces as we lose information. We can either reconstruct expressions in accented quotes without any spaces (what we do) or with spaces between every identifier. E.g.:
      `x/y` 
      `x / y`
              

      cannot be distinguished and it’s up to us to convert such an expression into a string. We choose to convert both to =”x/y”=.

    The idx function simply refers to a column and can take:

    • something in accented quotes
    • a literal string
    • any Nim expression returning a string

So the following is equivalent to the above version.

df.group_by("group") 
  .summarize(f{int: "sumVar" << sum(`var`)}) # name new column
  .filter(f{`sumVar` > 2000000}) # filter by new column

Select

R

select(df, var1, var2)
select(df, -var3)

Python

df[['var1', 'var2']]
df.drop('var3', 1)

Nim:

discard df.select("var1", "var2")
# or
discard df.select(@["var1", "var2"])
discard df.select(df.getKeys().filterIt(it != "var3"))
# or more naturally
discard df.drop("var3")

Arrange

R

arrange(df, var1)
arrange(df, desc(var1))

Python

df.sort_values('var1')
df.sort_values('var1', ascending=False)

Nim

df.arrange("var1")
df.arrange("var1", order = SortOrder.Descending)

Grouping

R

df %>% group_by(group)
df %>% group_by(group1, group2)
df %>% ungroup()

Python

df.groupby('group1')
df.groupby(['group1', 'group2'])
df.reset_index() / or when grouping: df.groupby('group1', as_index=False)

Nim

discard df.group_by("group")
# the following is currently broken because of Nim issue #17429
# (we could introduce another overload w/o `add` argument to work around)
discard df.group_by("group1", "group2")
discard df.group_by(@["group1", "group2"])
# ungroup is just
discard df.group_by()

Summarise / Aggregate df by group

R

df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
df %>%
  group_by(group1, group2) %>%
  summarise(mean_var1 = mean(var1),
            sum_var1 = sum(var1),
            count_var1 = n())

df %>%
  group_by(group1, group2) %>%
  summarise(mean_var1 = mean(var1),
            sum_2 = sum(var2),
            var3 = first(var3))

Python

df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specify column name

df.groupby(['group1', 'group2'])['var1'].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({
  "var1" : ["mean"],
  "var2" : ["sum"],
  "var3" : ["first"]
  })
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})

Nim

discard df.group_by("group").summarize(f{float: "mean_var1" << mean(`var1`)})

discard df.group_by(@["group1", "group2"])
  .summarize(f{float: "mean_var1" << mean(`var1`)},
             f{float: "sum_var1" << sum(`var1`)},
             f{int: "count_var1" << col(`var1`).len})

discard df.group_by(@["group1", "group2"])
  .summarize(f{float: "mean_var1" << mean(`var1`)},
             f{float: "sum_2" << sum(`var2`)},
             f{float: "var3" << col(`var3`)[0]})

After the explanations above, these should be mostly straight forward. We need the type hints in all cases as we simply call procedures that are all overloaded. The correct types to use of course depend on the individual dataframe. It might be int in each case depending on data and use case (even if that’s weird for a mean).

The col “function” used here is the equivalent to the idx function introduced above in the case one wishes to explicitly refer to a full data frame column. In this specific use case we can use it to essentially access the underlying tensor and use regular Nim procedures on it. In the first case:

  • col(`var1`).len will be turned into var1T.len, where var1T is the underlying tensor of the column “var1” let var1T = df["var1", float]. Given that the col(...) is converted to a simple tensor, using any procedure that’s valid for tensors should work as expected (for the readers who know arraymancer well: ggplotnim contains an overload len for tensors that checks the dimension and returns the tensor size).
  • in the second case col(`var3`)[0] it’s the same idea. We simply get the regular tensor and access the 0-th element, i.e. the equivalent of a hypothetical first procedure. We could define such a thing of course, too.

This possibly leaves the question “why do we sometimes just write `var1` and other times explicitly col(...) or idx(...)?”. The macro magic behind the formula tries to determine the correct usage in the accented quote column reference. Typically it will be interpreted as an access to a individual element of a column. However, in calls to functions we try to determine the signature and choose accordingly (e.g. for mean we know it’s reducing and thus the input is the full column). There are cases though where one needs to specify what one needs. Either due to ambiguous overloads or to in the case shown here where we want to work on the actual tensor in a more atypical way.

Mutate / transform df by group

This code example is imo rather bad, as it should be done using summarise in dplyr instead as mutate is normally used for mapping operations. The point I think this is trying to get across is that one can mutate on a group, such that including operations that would be affected by different groups (i.e. the mean of one group is probably different than the mean of the whole dataset).

R

df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))

Python

df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)

Nim

discard df.group_by("group").mutate(f{"mean_var1" << mean(var1)})

We need to clarify in Nim that it’s a reducing operation. If we write replaced << by a tilde ~ instead, it would throw us a CT error saying that the operation on the RHS is reducing in nature.

A more typical example utilizing mutate might be something like:

discard df.mutate(f{float: "sumA+B" ~ `A` + `B`})

which creates a new column named “sumA+B” of type float.

Distinct

R

df %>% distinct()
df %>% distinct(col1) # returns dataframe with unique values of col1

Python

df.drop_duplicates()
df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1

Nim

discard df.unique()
discard df.unique("col1") # returns dataframe with unique values of col1

In Nim we replaced the word distinct, because it’s a keyword. unique seems like a good alternative and I like it much better than sequtils deduplicate.

Rename

rename is used to (guess what) rename different columns. For this we use a formula with <-. Let df be a data frame with columns “col1” and “col2”:

discard df.rename(f{"foo" <- "col1"},
                  f{"bar" <- "col2"})

which simply renames both columns.

Sample

Sampling from a data frame is not implemented in ggplotnim’s data frame so far.

R

sample_n(df, 100)
sample_frac(df, 0.5)

Python

df.sample(100)
df.sample(frac=0.5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment