lokeshh/redundancy.md

## redundancy.md

      
    Raw
  

              redundancy.md
            
          
    (taken from GSOC proposal)
...
contrast_interact: This is there to code interaction terms. In a dataframe with columns ‘a’ and ‘b’, ‘a:b’ is an interaction term. Again we need to code this term to produce some number of variables. But in this case the coding is somewhat different. I’ll explain with an example how to code ‘a:b’ and one can generalize the behavior. Let’s say column ‘a’ has m categories and ‘b’ has n categories. Now if ‘a’ has been mentioned in our regression expression, then we will code the column ‘b’ with n-1 variables and similarly if ‘b’ has been mentioned in the regression expression, then we will code column ‘a’ with m-1 variables. And if ‘a’ hasn’t been mentioned in our regression expression then ‘b’ will be coded with n variables and similarly if ‘a’ hasn’t been mentioned in our regression expression then ‘b’ will be coded with m variables.
Here’s a general rule to follow when we have more than two way interaction. Say we have ‘a: b:c’ and we need to decide whether to code ‘a’ with m categories or m-1 categories. We need to see whether everything expect ‘a’ has been in the regression expression or not. In this case we need to see whether ‘b:c’ is in the regression expression or not. If it’s there then we will code ‘a’ with m-1 variables, otherwise m variables. Similarly to decide for ‘b’ we need to see whether ‘a:c’ is in the regression expression or not. If we do not follow the above rule we will face the problem of having redundant columns that would break our regression.
So, finally once we have all columns in our interaction term coded to correct number of variables we are ready to code our interaction term. I’ll again explain this with the help of an example. Say, we have ‘a’ coded to a_1, a_2 variables, and ‘b’ has been coded to b_1, b_2, b_3 variables. Now the coding for interaction term ‘a:b’ will include a_1:b_1, a_1:b_2, a_1:b_3, a_2:b_1, a_2:b_2, a_2:b_3 variables. (a_i:b_j here means multiplication of a_i and b_j not their interaction) Similarly coding of ‘a: b:c’ will be the set of variables { a[i]b[j]c[k] | a[i], b[j] and c[k] are respectively the set of variables used to code ‘a’, ‘b’ and ‘c’ }.
...