Skip to content

Instantly share code, notes, and snippets.

@lokeshh
Last active April 4, 2016 13:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lokeshh/77559575ec53fd8b324185ccd0965973 to your computer and use it in GitHub Desktop.
Save lokeshh/77559575ec53fd8b324185ccd0965973 to your computer and use it in GitHub Desktop.

Implement Formula language and categorical variable support in Statsample::Regression

Below is how Statsample performs regression.

def self.multiple(ds,y_var, opts=Hash.new)
  missing_data= (opts[:missing_data].nil? ) ? :listwise : opts.delete(:missing_data)
  if missing_data==:pairwise
     Statsample::Regression::Multiple::RubyEngine.new(ds,y_var, opts)
  else
    if Statsample.has_gsl? and false
      Statsample::Regression::Multiple::GslEngine.new(ds, y_var, opts)
    else
      ds2=ds.dup_only_valid
      Statsample::Regression::Multiple::RubyEngine.new(ds2,y_var, opts)
    end
  end
end

y_var is one which is the one to be predicted and ds is the DataFrame which contains all the vectors which are going to used in regression.

So, for example y ~ a + b is the regression expression, so y_var = y and ds would contain a, b and y vectors.

Now below is the strategy:

  1. There will be a helper function. Let's call it multiple_helper for now. It would parse the expression using Formula class. More info about the parser can be found here.
  2. Next, after parsing we would have a list of RHS terms.
  3. We will create an empty dataframe and would take each term at a time and code it using #contrast_code.
  4. Now #contrast_code would take a DataFrame and term to code as an argument and would return a DataFrame containing only the vectors which are expanded as result of coding of that term.
  5. We will code all the terms and have a DataFrame which only consists of those terms which only have to be considered for regression.
  6. Finally we call multiple with new DataFrame and y_var as an argument and we are done.

Here's how it might look in coding:

def multiple_helper df, exp
  f = Formula.new
  f.from_formula exp
  y_var = f.lhs_terms       # Variable to be predicted using regression
  terms = f.rhs_terms       # Variables to be used for regression
  
  reg_df = Daru::DataFrame.new
  
  terms.each do |term|
    coded_term = df.contrast_code term
    reg_df.concate_along_col coded_term
  end
  
  multiple coded_df, y_var
  
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment