Vindaar/dplyr_pandas_comparison_to_nim.org

## dplyr_pandas_comparison_to_nim.org

      
    Raw
  

              dplyr_pandas_comparison_to_nim.org
            
          
    Dplyr (R), Pandas (Python) & Nim data frame comparison

This comparison is inspired by the comparison here:
  https://gist.github.com/conormm/fd8b1980c28dd21cfaf6975c86c74d07
The Nim data frame implementation we use here is the Datamancer data
  frame.
Note that due to UFCS in Nim we can write the commands we present either
  similar to the Python notation as if the function were a method of the
  object (method call syntax: df.filter(...)) or using regular
  function call syntax similar to R (filter(df, ...)). In R using the
  %>% pipe operator we can achieve something that looks a bit like
  method call syntax. In the R examples in which both ways are
  presented, we replace that by the equivalent in Nim.
We will use the method call syntax for all Nim examples with the
  exceptions mentioned. Also, unless the expression does not return
  anything we will discard all Nim statements.
And as stated in the gist linked above, possibly take the Python
  examples with a grain of salt in terms of notation.
Filter

R
filter(df, var > 20000 & var < 30000)
filter(df, var == 'string') # df %>% filter(var != 'string')
df %>% filter(var != 'string')
df %>% group_by(group) %>% filter(sum(var) > 2000000)
Python
df[(df['var'] > 20000) & (df['var'] < 30000)]
df[df['var'] == 'string']
df[df['var'] != 'string']
df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)
Nim
discard df.filter(f{`var` > 20000 and `var` < 30000})
# Note: `"string"` here refers to a specific string literal and not the string
# type (which may very well be the case in languages like Python),
# using function call syntax
discard filter(df, f{`var` == "string"})
# or using method call syntax with a similar example
discard df.filter(f{`var` != "string"})
# this returns all subgroups that pass the reducing predicate
discard df.group_by("group").filter(f{sum(`var`) > 2000000})
In these four cases the major difference between the R and Nim
  implementations is two things stemming from the major fact that R is
  an interpreted language and Nim compiled. The “pseudo code” described
  in the R examples needs to be converted into something that can be
  compiled as regular Nim code. Nim’s powerful macro system allows us to
  achieve the elegance of R with a small amount of additional fluff.
The two differences are:

  we need to enter an untyped macro to allow us to rewrite
    code. This is done by using the f{} macro
  References to columns are not done by untyped identifiers as (in
    essence) is done in R. There would be no way to differentiate
    between a variable foo and a column of name foo.
    let foo = 5
# let `df` be a dataframe with column "foo"
discard df.filter({ foo > 10 })
    
    For our implementation this code snippet means that foo will be
      interpreted as the local variable resulting in a formula that is
      constantly false. The resulting data frame will thus be empty.
  

For the latter we could abuse Nim’s macro system to plainly have an
  identifier refer to a column iff there is no symbol defined of that
  identifier. However, that is too ambiguous and makes it undesirable to
  use column names that might clash with typical symbols in code.
  There are different ways to refer to columns, which won’t be discussed
  in detail (a couple of ways will be mentioned below though as there
  are certain restrictions).
Side note: Nim’s implementation could be handled in a way similar to
  Pandas. That means essentially having broadcast operations for DF
  column + scalar operations. However, this has the incredible downside
  of introducing two for loops in code such as:
df["var"] > 200 and df["var"] < 500

Each comparison operation then needs to be broadcast and evaluated
  individually using a full loop over the data. This is wasted
  work. That’s the reason we take the full macro path in Nim (and having
  the additional beauty of being able to emulate the R syntax closely)
  to convert the DSL into a single for loop over the data independent
  of the complexity of the formula { multiple formulas after one another
  cannot be fused unfortunately. That would add significant complexity
  requiring analysis of the loop bodies, which in addition would require
  even the procedures filter, mutate, ... to be macros so that
  multiple formulas can even be analyzed together }. One currently still
  missing feature is lifting of constant expressions. Something like
  f{`var` < mean(`var`)} currently recomputes the mean in each
  iteration, just as it would in a regular for loop. Soon those things
  will be lifted out of the loop body.
A note on the last formula in this section

The last formula (in R):
df %>% group_by(group) %>% filter(sum(var) > 2000000)
in Nim:
df.group_by("group").filter(f{sum(`var`) > 2000000})
Depending on your expectations this case might be a bit weird. The
  formula argument to filter is a reducing formula that returns a
  boolean. If that boolean is true, the full subgroup (determined by the
  groub_by call) will be kept, else dropped.
One might on the other hand think it would return the group labels
  that match that predicate. We will now take a detour to do that
  instead, because it allow us to explain a few more things.
We will deconstruct (one of) the ways to determine the labels that
  pass the above predicate.
The Nim code to do that is (line breaks for readability):
df.group_by("group")
  .summarize(f{int: sum(`var`)})
  .filter(f{idx("(sum var)") > 2000000})
From a “high-level” point of view the only difference really is the
  splitting of the reduce + filter into individual calls. First a
  reducing operation of the kind we wish (summing the individual groups
  of “group”) and then filtering on the result.
There are now 2 things in particular that stand out.

  in the summarize formula we see the int: at the beginning. This
    is a type hint for the formula macro. It means to read the column
    (“var” in this case) as a column of integers. It is required,
    because from the sum(`var`) it isn’t possible to tell which type
    needs to be read as there are multiple overloads of the symbol
    sum.
  The filter proc now uses the idx “function”. It’s not really a
    function, but a different way to refer to data frame columns.
    There are again 2 things to untangle here:
    
      why is the argument this string “sum var” when the original data
        frame contains a column “var”? Because the summarize operation
        returns a new data frame that contains those columns that are
        described by the formulas. As we did not give a name for the
        formula, the default naming scheme is used. That scheme is simply
        a lisp like representation of the code in the formula. A function
        call sum with the argument var: “(sum var)”. For people used
        to lisp this should be comfortable. This allows to write the
        default formula name easily by hand even if more complex
        expressions are used (though they get lengthy).
        A name to the formula is given by:
        df.summarize(f{int: "summed" << sum(`var`)})
        
        where the new column is called “summed”. The << is to refer to
          a reducing formula. The other two are:
        
          ~ mapping operation T -> T. For any formula that is not reducing in nature a =”name” ~ RHS= can be
            used to name the newly created column. See the additional
            mutate example below.
          <- as a constant assignment, typically only used in the
            context of the rename procedure to rename DF columns. See the
            section on rename below.
        
      
      why do we use idx suddenly? The default naming schemes of
        formulas contains spaces. The accented quotes in Nim (identifiers
        in `) cannot properly handle spaces as we lose information. We
        can either reconstruct expressions in accented quotes without any
        spaces (what we do) or with spaces between every
        identifier. E.g.:
        `x/y` 
`x / y`
        
        cannot be distinguished and it’s up to us to convert such an
          expression into a string. We choose to convert both to =”x/y”=.
      
    
    The idx function simply refers to a column and can take:
    
      something in accented quotes
      a literal string
      any Nim expression returning a string
    
  
So the following is equivalent to the above version.
df.group_by("group") 
  .summarize(f{int: "sumVar" << sum(`var`)}) # name new column
  .filter(f{`sumVar` > 2000000}) # filter by new column
Select

R
select(df, var1, var2)
select(df, -var3)
Python
df[['var1', 'var2']]
df.drop('var3', 1)
Nim:
discard df.select("var1", "var2")
# or
discard df.select(@["var1", "var2"])
discard df.select(df.getKeys().filterIt(it != "var3"))
# or more naturally
discard df.drop("var3")
Arrange

R
arrange(df, var1)
arrange(df, desc(var1))
Python
df.sort_values('var1')
df.sort_values('var1', ascending=False)
Nim
df.arrange("var1")
df.arrange("var1", order = SortOrder.Descending)
Grouping

R
df %>% group_by(group)
df %>% group_by(group1, group2)
df %>% ungroup()
Python
df.groupby('group1')
df.groupby(['group1', 'group2'])
df.reset_index() / or when grouping: df.groupby('group1', as_index=False)
Nim
discard df.group_by("group")
# the following is currently broken because of Nim issue #17429
# (we could introduce another overload w/o `add` argument to work around)
discard df.group_by("group1", "group2")
discard df.group_by(@["group1", "group2"])
# ungroup is just
discard df.group_by()
Summarise / Aggregate df by group

R
df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
df %>%
  group_by(group1, group2) %>%
  summarise(mean_var1 = mean(var1),
            sum_var1 = sum(var1),
            count_var1 = n())

df %>%
  group_by(group1, group2) %>%
  summarise(mean_var1 = mean(var1),
            sum_2 = sum(var2),
            var3 = first(var3))
Python
df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specify column name

df.groupby(['group1', 'group2'])['var1'].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({
  "var1" : ["mean"],
  "var2" : ["sum"],
  "var3" : ["first"]
  })
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})
Nim
discard df.group_by("group").summarize(f{float: "mean_var1" << mean(`var1`)})

discard df.group_by(@["group1", "group2"])
  .summarize(f{float: "mean_var1" << mean(`var1`)},
             f{float: "sum_var1" << sum(`var1`)},
             f{int: "count_var1" << col(`var1`).len})

discard df.group_by(@["group1", "group2"])
  .summarize(f{float: "mean_var1" << mean(`var1`)},
             f{float: "sum_2" << sum(`var2`)},
             f{float: "var3" << col(`var3`)[0]})
After the explanations above, these should be mostly straight
  forward. We need the type hints in all cases as we simply call
  procedures that are all overloaded. The correct types to use of course
  depend on the individual dataframe. It might be int in each case
  depending on data and use case (even if that’s weird for a mean).
The col “function” used here is the equivalent to the idx function
  introduced above in the case one wishes to explicitly refer to a full
  data frame column. In this specific use case we can use it to
  essentially access the underlying tensor and use regular Nim
  procedures on it. In the first case:

  col(`var1`).len will be turned into var1T.len, where var1T is
    the underlying tensor of the column “var1” let var1T = df["var1",
    float].
    Given that the col(...) is converted to a simple tensor, using any
    procedure that’s valid for tensors should work as expected (for the
    readers who know arraymancer well: ggplotnim contains an overload
    len for tensors that checks the dimension and returns the tensor
    size).
  in the second case col(`var3`)[0] it’s the same idea. We simply
    get the regular tensor and access the 0-th element, i.e. the
    equivalent of a hypothetical first procedure. We could define such
    a thing of course, too.

This possibly leaves the question “why do we sometimes just write
  `var1` and other times explicitly col(...) or idx(...)?”. The
  macro magic behind the formula tries to determine the correct usage in
  the accented quote column reference. Typically it will be interpreted
  as an access to a individual element of a column. However, in calls to
  functions we try to determine the signature and choose accordingly
  (e.g. for mean we know it’s reducing and thus the input is the full
  column). There are cases though where one needs to specify what one
  needs. Either due to ambiguous overloads or to in the case shown here
  where we want to work on the actual tensor in a more atypical way.
Mutate / transform df by group

This code example is imo rather bad, as it should be done using
  summarise in dplyr instead as mutate is normally used for mapping
  operations. The point I think this is trying to get across is that one
  can mutate on a group, such that including operations that would be
  affected by different groups (i.e. the mean of one group is probably
  different than the mean of the whole dataset).
R
df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))
Python
df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)
Nim
discard df.group_by("group").mutate(f{"mean_var1" << mean(var1)})
We need to clarify in Nim that it’s a reducing operation. If we write
  replaced << by a tilde ~ instead, it would throw us a CT error
  saying that the operation on the RHS is reducing in nature.
A more typical example utilizing mutate might be something like:
discard df.mutate(f{float: "sumA+B" ~ `A` + `B`})
which creates a new column named “sumA+B” of type float.
Distinct

R
df %>% distinct()
df %>% distinct(col1) # returns dataframe with unique values of col1
Python
df.drop_duplicates()
df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1
Nim
discard df.unique()
discard df.unique("col1") # returns dataframe with unique values of col1
In Nim we replaced the word distinct, because it’s a
  keyword. unique seems like a good alternative and I like it much
  better than sequtils deduplicate.
Rename

rename is used to (guess what) rename different columns. For this we
  use a formula with <-. Let df be a data frame with columns “col1”
  and “col2”:
discard df.rename(f{"foo" <- "col1"},
                  f{"bar" <- "col2"})
which simply renames both columns.
Sample

Sampling from a data frame is not implemented in ggplotnim’s data frame
  so far.
R
sample_n(df, 100)
sample_frac(df, 0.5)
Python
df.sample(100)
df.sample(frac=0.5)