Skip to content

Instantly share code, notes, and snippets.

@alaakh42
Created August 16, 2017 16:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alaakh42/6aca4453ad3a460177d3d02805a4015f to your computer and use it in GitHub Desktop.
Save alaakh42/6aca4453ad3a460177d3d02805a4015f to your computer and use it in GitHub Desktop.
Multiple Linear Regression
Multicollinearity → is the fact that one independent variable is dependent on another independent variable. like the 2 dummy variables New York and California
For example, In case of ‘Dummy Variable’ - which is the encoding of the categorical variable into numerical variables - if there is 2 levels in a categorical column called ‘State’: [New York, California]
State
New York
California
California
New York
California
then the Dummy variables will be: D2 = 1 - D1
New York
California
1
0
0
1
0
1
1
0
0
1
In a Multiple LR problem:
Y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*D1
You cannot add both the dummy variables in the LR equation as it is considered 2 independent dependent variables, the LR model cannot distinguish between the D1 and D2. And that is called the ‘Dummy Variable Trap’.
So, As a rule of thumb always omit one dummy variable when you are creating the LR model, if you have 100 include only 99. Also you have to apply the same scenario for every categorical variable column.
How to build a model (Step-by-Step)
When building a model you have to choose the important variables to include in your model because at the end of the day if you included too many garbage variables you will end up with a garbage model, also you need to understand the impact of your variables on the dependent variable and also communicate it with your boss which will not be so practical.
5 Methods of building a Model
All-in → to use all the variables,
Cases:
If you have prior knowledge, you ‘ve build this model before
You have to use them all, maybe a framework in a bank
When you are preparing to ‘Backward Elimination’
Backward Elimination → is the one that will be used in the tutorial as it is the fastest
STEP 1: Select a significance level to stay in the model (e.g. SL = 0.05)
STEP 2: Fit the full model with all possible predictors, add all the variables to your model
STEP 3: Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN ‘Finish’ which, means your model is ready
STEP 4: Remove the predictor
STEP 5: Fit the model without this variable, means to rebuild the whole model once again without that variable with highest P-value larger than the significance, it is gonna be a new model with new coefficients and a new constant. After STEP 5 you return to do STEP 3 again until you check all the predictors/ variables and your highest P-value is less than your SL
Forward Selection →
STEP 1: Select a significance level to enter the model (e.g. SL = 0.05)
STEP 2: Fit all simple regression models y ~ Xn Select the one with the lowest P-value. It means to use every possible variable to make a LR model, it will be a simple LR model with only one variable.
STEP 3: keep this variable that you already selected using STEP 2 and add all other variables one by one to the one you already have creating a LR models with pairs of variables - always including the variable you already selected in STEP 2
STEP 4: Consider the predictor with the lowest P-value. If P < SL, go to STEP 3, otherwise go to FIN. it means from all the 2-variables LR models we constructed select the one with the lowest P-value then return to STEP 3 to add a third variable and fit the model and as always choose the one with the lowest P-value. The trick here is that you pick the previous model because you will stop when you choose a variable that is insignificant so you pick the model you had before picked the insignificant variable.
Bidirectional Elimination/ Stepwise Regression →
STEP 1: Select a significance level to enter and to stay in the model
E.g. SLENTER = 0.05, SLSTAY = 0.05
STEP 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to ener)
STEP 3: Perform all steps of backward Elimination (old variables must have P < SLSTAY to stay)
STEP 4: No new variables can enter and no old variables can exit, your model is now READY
Score Comparison/ All Possible Models →
STEP 1: Select a criterion of goodness of fit (e.g. Akaike criterion)
STEP 2: Construct all possible regression models: 2N-1 total combinations
STEP 3: Select the one with the best criterion
Example: 10 columns datasets means 1023 models!!!, it is not a good model as it grows exponentially with the size of the dataset
NOTE ::: Numbers 2, 3, and 4 are called “Stepwise Regression”
-------------------------------------------------------------------------------------------------------------------------
In the python practical tutorial, we are building a model that can detect if there are a linear dependencies between the 4 independent variables [R&D Spend, Administration, Marketing Spend, State] and the one dependent variable [Profit], we want to see if we can predict the profi value using the 4 independent variables.
STEP1: Import dataset
dataset = pd.read_csv("50_Startups.csv")
STEP2: define the dependent and independent variables x and y columns
x = dataset.iloc[:, :-1].values # independent variables
y = dataset.iloc[:,4].values # dependent variables
STEP3: we will start encoding the categorical variables using LabelEncoder and OneHotEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:,3] = labelencoder_x.fit_transform(x[:,3])
# Dummy Encoding
onehotencoder = OneHotEncoder(categorical_features=[3]) # encode the state col
x = onehotencoder.fit_transform(x).toarray()
STEP4: Avoid the Dummy variables trap, to eliminate the dependencies between the dummy variables, Note: for ML libraries like sklearn LinearRegression you don’t need to do this as it is already are taken care of by the library
x = x [:, 1:]
STEP5: Split the data to a training and testing sets
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8, random_state= 0)
STEP6: Fit the regressor to the training data set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
STEP7: Evaluate the multivariable LR model on our Test set x_test
y_pred = regressor.predict(x_test) # compare between y_pred and y_test
If you are satisfied with your model results..You can not do the following,,,,
But, I want to get better results:
Build the optimal multiple LR model using Backward Elimination, we are here building the optimal model by eliminating the statistically insignificant variables that don’t have major impact on predicting the independent variable. We will find the independent variables team that have an impact [positive ‘increase profit’’/ negative ‘decrease profit’] on the dependent variable prediction.
STEP1: import the statsmodel library
import statsmodels.formula.api as sm
STEP2: we will add a column of ones to the x values matrix of independent variables to teach the statsmodels library that the multivariable LR equation is y = b0 + b1*X1 + ..bn*xn includes the constant b0, b0 is called the ‘intercept’
x = np.append(arr = np.ones((50, 1)).astype(int), values = x, axis = 1)
STEP3: create a matrix x_optimal that will include the optimal statistically significant variables with P-value < SL
x_optimal = x[:,[0,1,2,3,4,5]]
STEP4: fit all the possible independent variables to the OLS (ordinary least squares) model
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
STEP5: then check the independent variables P-value if it is below the significance level SL, we will remove the independent variables with P-value > 0.05, the following line will return a very useful statistical summary of the x values
regressor_OLS.summary()
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
x1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
x2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
x3 0.8060 0.046 17.369 0.000 0.712 0.900
x4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
x5 0.0270 0.017 1.574 0.123 -0.008 0.062
#### note, that x2 variable has the highest P-value 0.990 which is > 0.05, so we will remove x2
x_optimal = x[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()
#### note, that x1 variable has the highest P-value 0.953 which is > 0.05, so we will remove x1
x_optimal = x[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()
#### note, that x4 variable has the highest P-value 0.608 which is > 0.05, so we will remove x4
x_optimal = x[:,[0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()
#### note, that x5 variable has the highest P-value 0.060 which is > 0.05, so we will remove x5
x_optimal = x[:,[0,3]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()
Now x_optimal is fully created, with the 2 variables x0 and x3 both are highly statistically significant and have a major impact on the prediction.
NOTE:: that in linear regression models, you don’t need to do ‘feature scaling’ because this is being taking care of by the sklearn Linear regression library and it is the same for LR implementation in R.
NOTE:: the lower the P-value is the more statistically significant the independent variable will be and the more impact it will have on the prediction of the dependent variable.
The threshold you should compare your P-value to is 0.05.
-----------------------------------------------------------------------------------------------------------------------
R-Language Tutorial
# Multiple Linear Regression
# import data
dataset = read.csv('50_Startups.csv')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c("California", "New York", "Florida"),
labels = c (1,2,3))
# the R-libray took care of the dummy variables trap
# splitting data into train and test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8) #Profit is the dependent variable, splitRation is the training set ratio
train_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Fitting Multiple Linear regression to the training set
#regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State)
regressor = lm(formula = Profit ~ ., data = train_set) # . --> means a combination of all the independent variables
summary(regressor)
# output
> summary(regressor)
Call:
lm(formula = Profit ~ ., data = train_set)
Residuals:
Min 1Q Median 3Q Max
-33128 -4865 5 6098 18065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.977e+04 7.516e+03 6.622 1.36e-07 ***
R.D.Spend 7.986e-01 5.604e-02 14.251 6.70e-16 ***
Administration -2.942e-02 5.828e-02 -0.505 0.617
Marketing.Spend 3.268e-02 2.127e-02 1.537 0.134
State2 -1.213e+02 3.751e+03 -0.032 0.974
State3 1.162e+02 4.048e+03 0.029 0.977
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared: 0.9499, Adjusted R-squared: 0.9425
F-statistic: 129 on 5 and 34 DF, p-value: < 2.2e-16
NOTE:: that ‘***’ means that the variable is highly significant and its P-value is the least as it is only between 0 and 0.001
NOTE:: We should not use Multiple Linear Regression to predict a dependent variable that is growing exponentially with time.
NOTE:: In R any space in the column names is converted into ‘.’
So we can notice that the only variable that has a high statistical significance is R&D Spend column.
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
y_pred_new = predict(regressor_new, newdata = test_set)
# Now, we will build the optimal model using Backward Elimination
# here we will build the LR model on all the dataset so it can learn the significant and
# insignificant independent variables in the whole dataset
# so we are fitting the whole model to the regressor
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
data = dataset)
###### then we will go through the backward elimination by removing the independent variables with P-value > SL(significance level) =0.05 ######
> regressor_opt = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
+ data = dataset)
> summary(regressor_opt)
Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
State, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-33504 -4736 90 6672 17338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.013e+04 6.885e+03 7.281 4.44e-09 ***
R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
Administration -2.700e-02 5.223e-02 -0.517 0.608
Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
State2 -4.189e+01 3.256e+03 -0.013 0.990
State3 1.988e+02 3.371e+03 0.059 0.953
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9439 on 44 degrees of freedom
Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16
> regressor_opt = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
+ data = dataset)
> summary(regressor_opt)
Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
data = dataset)
Residuals:
Min 1Q Median 3Q Max
-33534 -4795 63 6606 17275
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
R.D.Spend 8.057e-01 4.515e-02 17.846 < 2e-16 ***
Administration -2.682e-02 5.103e-02 -0.526 0.602
Marketing.Spend 2.723e-02 1.645e-02 1.655 0.105
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9232 on 46 degrees of freedom
Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16
> regressor_opt = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
+ data = dataset)
> summary(regressor_opt)
Call:
lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-33645 -4632 -414 6484 17097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9161 on 47 degrees of freedom
Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16
> regressor_opt = lm(formula = Profit ~ R.D.Spend,
+ data = dataset)
> summary(regressor_opt)
Call:
lm(formula = Profit ~ R.D.Spend, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-34351 -4626 -375 6249 17188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.903e+04 2.538e+03 19.32 <2e-16 ***
R.D.Spend 8.543e-01 2.931e-02 29.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9416 on 48 degrees of freedom
Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
F-statistic: 849.8 on 1 and 48 DF, p-value: < 2.2e-16
---------------------------------------------------------------------------------------------------------------------
NOTE:: I did some experiment to find out the more optimal LR model, using the following 2 models
# first model
regressor_opt1 = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
data = dataset)
summary(regressor_opt1)
# second model
regressor_opt = lm(formula = Profit ~ R.D.Spend,
data = dataset)
summary(regressor_opt)
y_pred = predict(regressor_opt, newdata = test_set)
y_pred = predict(regressor_opt1, newdata = test_set)
Observation:: the prediction using the first model was better which, means the y_pred values using regressor_opt1 is closer to the test_set (actual dependent variable values). A possible reason for that is that the ‘ Marketing.Spend’ independent variable has a P-value of 0.06 which is too close to the significant level SL, so it is an arbitrary decision whether to remove it or not. And I can say that removing it make the prediction results worse.
IMP NOTE
Also, By observing the value of Multiple R-squared and Adjusted R-squared →
You can see that adding another independent to the model will increase Multiple R-squared value and decrease the Adjusted R-squared value which means that adding this variable didn’t help your model. So the Adjusted R-squared value is an excellent measure for how good is the regression model is.
For example: When removing the dummy variable ‘State 2’ and ‘State 3’ the value of Multiple R-squared decreased and the value of Adjusted R-squared increases which means that it was good to remove the dummy variables as there were doing no good for the regression model.
Do you remember this observation ?!
Observation:: the prediction using the first model was better which, means the y_pred values using regressor_opt1 is closer to the test_set (actual dependent variable values). A possible reason for that is that the ‘ Marketing.Spend’ independent variable has a P-value of 0.06 which is too close to the significant level SL, so it is an arbitrary decision whether to remove it or not. And I can say that removing it make the prediction results worse.
Now I Know why, By observing the Adjusted R-squared value in those 2 models we can see that by removing ‘Marketing Spend’ independent variable this decreased the regression model performance although the P-value of ‘Marketing Spend’ is 0.06 which is > Significant value which equals 0.05 and according the ‘Backward Elimination’ regression method ‘Marketing Spend’ should ‘have been removed.
Linear Regression Coefficients Interpretation
By increasing the independent variables ‘R&D Spend’ and ‘Marketing Spend’ with 1 unit the dependent variable ‘Profit’ will increase by a factor equals the coefficients of the previous both independent variables. Why ‘R&D Spend’ and ‘Marketing Spend’? because building the regression model using only those 2 independent variables gave us the best model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment