Below is how Statsample performs regression.
def self.multiple(ds,y_var, opts=Hash.new)
missing_data= (opts[:missing_data].nil? ) ? :listwise : opts.delete(:missing_data)
if missing_data==:pairwise
Statsample::Regression::Multiple::RubyEngine.new(ds,y_var, opts)
else
if Statsample.has_gsl? and false
Statsample::Regression::Multiple::GslEngine.new(ds, y_var, opts)
else
ds2=ds.dup_only_valid
Statsample::Regression::Multiple::RubyEngine.new(ds2,y_var, opts)
end
end
end
y_var
is one which is the one to be predicted and ds
is the DataFrame which contains all the vectors which are going to used in regression.
So, for example y ~ a + b
is the regression expression, so y_var = y
and ds
would contain a
, b
and y
vectors.
Now below is the strategy:
- There will be a helper function. Let's call it
multiple_helper
for now. It would parse the expression using Formula class. More info about the parser can be found here. - Next, after parsing we would have a list of RHS terms.
- We will create an empty dataframe and would take each term at a time and code it using
#contrast_code
. - Now
#contrast_code
would take a DataFrame and term to code as an argument and would return a DataFrame containing only the vectors which are expanded as result of coding of that term. - We will code all the terms and have a DataFrame which only consists of those terms which only have to be considered for regression.
- Finally we call
multiple
with new DataFrame andy_var
as an argument and we are done.
Here's how it might look in coding:
def multiple_helper df, exp
f = Formula.new
f.from_formula exp
y_var = f.lhs_terms # Variable to be predicted using regression
terms = f.rhs_terms # Variables to be used for regression
reg_df = Daru::DataFrame.new
terms.each do |term|
coded_term = df.contrast_code term
reg_df.concate_along_col coded_term
end
multiple coded_df, y_var
end