dmbates/gist:5303570

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Formula, ModelFrame and ModelMatrix types

In R, the functions for fitting statistical models based on a linear predictor usually allow for a formula/data specification of the model.  The data part is a data.frame object in R.  In Julia the corresponding type is a DataFrame as implemented in the DataFrames package.
A formula object in R is an expression whose top-level operator is a unary or binary tilde, ~.  The interpretation of many operators, including +, *, ^, \ and : are overridden in the formula language.  See the result of
?formula
in R for the details.
Model-fitting functions in R produce the model matrix from a formula/data specification and other optional arguments in two stages.  First the formula and the data frame are used to create a model.frame object as


Create a terms object from the formula and data arguments.  The specific R method is terms.formula which, in turn, calls a C function termsform in $RSRC/src/library/stats/models.c.  Stages within that function are

Check for a valid formula
Determine the names of variables used in the formula
Check if the intercept term is present.  Because it is present by default this amounts to checking if it has been suppressed, either by a term of the form 0 + ... or ... - 1 or -1 + ...
Recode the model in a binary form, expanding various operators.  After this stage the model terms are either names of variables, interactions of variables or function calls.  The binary form is a binary matrix with rows corresponding to the variables and columns corresponding to the terms.  The bitcount for each column is the order of the term, which is the number of variables in the interaction.  That is, an intercept term has order zero, a main-effects term has order 1, a second-order interaction has order 2, etc.
Reorder the model terms by bitcount, otherwise preserving the original order.
Compute the factor pattern, 0 indicates that the term does not appear in the model, 1 indicates it will appear as contrasts and 2 indicates it will appear as indicators.  This is essentially the binary matrix described above except for the case where the intercept term is suppressed and the first term that would be a contrast becomes the indicators.


A few fixups are generally performed after this.  See the code for the R function terms.formula.


Extract the variables from the data frame and apply any subsetting specification.


Apply the NA handler, such as na.omit to the reduced data frame.


Drop unused levels in any factor columns.  In Julia these are of type PooledDataArray.


Create the model matrix.  The R function model.matrix delegates most of the work to the C function modelmatrix in the same file as modelframe.  Most of the work here is in handling the contrasts for any factor variables and creating interaction terms from sets of contrasts.  After the symbolic analysis to create the Terms object in a ModelFrame, creation of a ModelMatrix depends only on the factor pattern.


In Julia the formula specification is as a Julia expression.  For example
julia> :(y ~ a + b + log(c))
:(~(y,+(a,b,log(c))))
which has many benefits such as not needing to write a parser, but also a drawback related to operator precedence.  If we use the colon, :, operator to specify an interaction, as in R, the term will need to be enclosed in parentheses.  A formula that could be written in R as
julia> :(y ~ a + b + a:b)
:(~(y,+(a,b,a):b))
must be written as
julia> :(y ~ a + b + (a:b))
:(~(y,+(a,b,a:b)))
We don't actually need to define a colon operator on columns of a data frame to be able to evaluate the model matrix because the information on interactions is encoded in the factor matrix.
Other than that, I think the pieces are getting into place in the db/formula branch of the DataFrames package.  Extraction of variable names is working, I think
julia> ff = Formula(:(y ~ a + b + (a:b)))
Formula: y ~ :(+(a,b,a:b))

julia> DataFrames.allvars(ff)
3-element Symbol Array:
 :b
 :y
 :a
as is dropUnusedLevels! and a basic na_omit NA handler.  The construction of the Terms object still needs more work. Expansion from the PooledDataArray refs field to the contrasts columns is done through a contrasts generator.  At present the only contrasts generator is contr_treatment but it is easy to add others.