alaakh42/gist:6aca4453ad3a460177d3d02805a4015f

## gistfile1.txt
Multicollinearity → is the fact that one independent variable is dependent on another independent variable. like the 2 dummy variables New York and California
For example, In case of ‘Dummy Variable’ - which is the encoding of the categorical variable into numerical variables - if there is 2 levels in a categorical column called ‘State’: [New York, California]
State
New York
California
California
New York
California

then the Dummy variables will be: D2 = 1 - D1

New York
California
1
0
0
1
0
1
1
0
0
1

In a Multiple LR problem:
Y = b0  + b1*x1 + b2*x2 + b3*x3    + b4*D1
You cannot add both the dummy variables in the LR equation as it is considered 2 independent dependent variables, the LR model cannot distinguish between the D1 and D2. And that is called the ‘Dummy Variable Trap’.
So, As a rule of thumb always omit one dummy variable when you are creating the LR model, if you have 100 include only 99. Also you have to apply the same scenario for every categorical variable column.

How to build a model (Step-by-Step)
When building a model you have to choose the important variables to include in your model because at the end of the day if you included too many garbage variables you will end up with a garbage model, also you need to understand the impact of your variables on the dependent variable and also communicate it with your boss which will not be so practical.

5 Methods of building a Model
All-in →  to use all the variables,
Cases:
If you have prior knowledge, you ‘ve build this model before
You have to use them all, maybe a framework in a bank
When you are preparing to ‘Backward Elimination’

Backward Elimination → is the one that will be used in the tutorial as it is the fastest

STEP 1: Select a significance level to stay in the model (e.g. SL = 0.05)
STEP 2: Fit the full model with all possible predictors, add all the variables to your model
STEP 3: Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN ‘Finish’ which, means your model is ready
STEP 4: Remove the predictor
STEP 5: Fit the model without this variable, means to rebuild the whole model once again without that variable with highest P-value larger than the significance, it is gonna be a new model with new coefficients and a new constant. After STEP 5 you return to do STEP 3 again until you check all the predictors/ variables and your highest P-value is less than your SL

Forward Selection →

STEP 1: Select a significance level to enter the model (e.g. SL = 0.05)
STEP 2: Fit all simple regression models y ~ Xn Select the one with the lowest P-value. It means to use every possible variable to make a LR model, it will be a simple LR model with only one variable.
STEP 3: keep this variable that you already selected using STEP 2 and add all other variables one by one to the one you already have creating a LR models with pairs of variables - always including the variable you already selected in STEP 2
STEP 4: Consider the predictor with the lowest P-value. If P < SL, go to STEP 3, otherwise go to FIN. it means from all the 2-variables LR models we constructed select the one with the lowest P-value then return to STEP 3 to add a third variable and fit the model and as always choose the one with the lowest P-value. The trick here is that you pick the previous model because you will stop when you choose a variable that is insignificant so you pick the model you had before picked the insignificant variable.

Bidirectional Elimination/ Stepwise Regression →

STEP 1: Select a significance level to enter and to stay in the model
E.g. SLENTER = 0.05,  SLSTAY = 0.05
STEP 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to ener)
STEP 3: Perform all steps of backward Elimination (old variables must have P < SLSTAY to stay)
STEP 4: No new variables can enter and no old variables can exit, your model is now READY

Score Comparison/ All Possible Models →
STEP 1: Select a criterion of goodness of fit (e.g. Akaike criterion)
STEP 2: Construct all possible regression models: 2N-1 total combinations
STEP 3: Select the one with the best criterion
Example: 10 columns datasets means 1023 models!!!, it is not a good model as it grows exponentially with the size of the dataset

NOTE ::: Numbers 2, 3, and 4 are called “Stepwise Regression”
-------------------------------------------------------------------------------------------------------------------------
In the python practical tutorial, we are building a model that can detect if there are a linear dependencies between the 4 independent variables [R&D Spend, Administration, Marketing Spend, State] and the one dependent variable [Profit], we want to see if we can predict the profi value using the 4 independent variables.
STEP1: Import dataset
	dataset = pd.read_csv("50_Startups.csv")
STEP2: define the dependent and independent variables x and y columns
	x = dataset.iloc[:, :-1].values  # independent variables
	y = dataset.iloc[:,4].values   # dependent variables
STEP3: we will start encoding the categorical variables using LabelEncoder and OneHotEncoder
	from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_x = LabelEncoder()
x[:,3] = labelencoder_x.fit_transform(x[:,3])
# Dummy Encoding
onehotencoder = OneHotEncoder(categorical_features=[3]) # encode the state col
x = onehotencoder.fit_transform(x).toarray()
STEP4: Avoid the Dummy variables trap, to eliminate the dependencies between the dummy  variables, Note: for ML libraries like sklearn LinearRegression you don’t need to do this as it is already are taken care of by the library
	x = x [:, 1:]
STEP5: Split the data to a training and testing sets
	from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8,  random_state= 0)
STEP6: Fit the regressor to the training data set
	from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)
STEP7: Evaluate the multivariable LR model on our Test set x_test
	y_pred = regressor.predict(x_test) # compare between y_pred and y_test
If you are satisfied with your model results..You can not do the following,,,,


But, I want to get better results:
Build the optimal multiple LR model using Backward Elimination, we are here building the optimal model by eliminating the statistically insignificant variables that don’t have major impact on predicting the independent variable. We will find the independent variables team that have an impact [positive ‘increase profit’’/ negative ‘decrease profit’] on the dependent variable prediction.
STEP1: import the statsmodel library
	import statsmodels.formula.api as sm
STEP2: we will add a column of ones to the x values matrix of independent variables to teach the statsmodels library that the multivariable LR equation is y = b0 + b1*X1 + ..bn*xn includes the constant b0, b0 is called the ‘intercept’
	x = np.append(arr = np.ones((50, 1)).astype(int), values = x, axis = 1)
STEP3: create a matrix x_optimal that will include the optimal statistically significant variables with P-value < SL
	x_optimal = x[:,[0,1,2,3,4,5]]
STEP4: fit all the possible independent variables to the OLS (ordinary least squares) model
	regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
STEP5: then check the independent variables P-value if it is below the significance level SL, we will remove the independent variables with P-value > 0.05, the following line will return a very useful statistical summary of the x values
	regressor_OLS.summary()
                coef              std err            t          P>|t|       [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       5.013e+04   6884.820      7.281      0.000      3.62e+04   6.4e+04
x1           198.7888     3371.007      0.059      0.953     -6595.030  6992.607
x2           -41.8870     3256.039     -0.013      0.990     -6604.003  6520.229
x3             0.8060        0.046        17.369      0.000         0.712     0.900
x4            -0.0270        0.052        -0.517      0.608        -0.132     0.078
x5             0.0270        0.017        1.574      0.123        -0.008     0.062

#### note, that x2 variable has the highest P-value 0.990 which is > 0.05, so we will remove x2
	x_optimal = x[:,[0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()
#### note, that x1 variable has the highest P-value 0.953 which is > 0.05, so we will remove x1
x_optimal = x[:,[0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()
#### note, that x4 variable has the highest P-value 0.608 which is > 0.05, so we will remove x4
x_optimal = x[:,[0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()

#### note, that x5 variable has the highest P-value 0.060 which is > 0.05, so we will remove x5
x_optimal = x[:,[0,3]]
regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
regressor_OLS.summary()

Now x_optimal is fully created, with the 2 variables x0 and x3 both are highly statistically significant and have a major impact on the prediction.


NOTE:: that in linear regression models, you don’t need to do ‘feature scaling’ because this is being taking care of by the sklearn Linear regression library and it is the same for LR implementation in R.

NOTE:: the lower the P-value is the more statistically significant the independent variable will be and the more impact it will have on the prediction of the dependent variable.
The threshold you should compare your P-value to is 0.05.

-----------------------------------------------------------------------------------------------------------------------
R-Language Tutorial

# Multiple Linear Regression

# import data
dataset = read.csv('50_Startups.csv')

# Encoding categorical data
dataset$State = factor(dataset$State,
                       levels = c("California", "New York", "Florida"),
                       labels = c (1,2,3))
# the R-libray took care of the dummy variables trap

# splitting data into train and test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8) #Profit is the dependent variable, splitRation is the training set ratio
train_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Multiple Linear regression to the training set
#regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State)
regressor = lm(formula = Profit ~ ., data = train_set) # . --> means a combination of all the independent variables
summary(regressor)
# output
> summary(regressor)

Call:
lm(formula = Profit ~ ., data = train_set)

Residuals:
   Min     1Q Median     3Q    Max
-33128  -4865      5   6098  18065

Coefficients:
                     Estimate         Std. Error    t value            Pr(>|t|)
(Intercept)        4.977e+04  7.516e+03    6.622       1.36e-07 ***
R.D.Spend        7.986e-01  5.604e-02   14.251       6.70e-16 ***
Administration  -2.942e-02  5.828e-02    -0.505       0.617
Marketing.Spend  3.268e-02  2.127e-02   1.537        0.134
State2          -1.213e+02  3.751e+03       -0.032         0.974
State3           1.162e+02  4.048e+03        0.029        0.977
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared:  0.9499,	Adjusted R-squared:  0.9425
F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16

NOTE:: that ‘***’ means that the variable is highly significant and its P-value is the least as it is only between 0 and 0.001

NOTE:: We should not use Multiple Linear Regression to predict a dependent variable that is growing exponentially with time.

NOTE:: In R any space in the column names is converted into ‘.’

So we can notice that the only variable that has a high statistical significance is R&D Spend column.

# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)
y_pred_new = predict(regressor_new, newdata = test_set)


# Now, we will build the optimal model using Backward Elimination
# here we will build the LR model on all the dataset so it can learn the significant and
 # insignificant independent variables in the whole dataset
# so we are fitting the whole model to the regressor
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
               data = dataset)

###### then we will go through the backward elimination by removing the independent variables with P-value > SL(significance level) =0.05 ######

> regressor_opt = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
+                data = dataset)
> summary(regressor_opt)

Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
    State, data = dataset)

Residuals:
   Min     1Q Median     3Q    Max
-33504  -4736     90   6672  17338

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      5.013e+04  6.885e+03   7.281 4.44e-09 ***
R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
Administration  -2.700e-02  5.223e-02  -0.517    0.608
Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123
State2          -4.189e+01  3.256e+03  -0.013    0.990
State3           1.988e+02  3.371e+03   0.059    0.953
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9439 on 44 degrees of freedom
Multiple R-squared:  0.9508,	Adjusted R-squared:  0.9452
F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16

> regressor_opt = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
+                    data = dataset)
> summary(regressor_opt)

Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
    data = dataset)

Residuals:
   Min     1Q Median     3Q    Max
-33534  -4795     63   6606  17275

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      5.012e+04  6.572e+03   7.626 1.06e-09 ***
R.D.Spend        8.057e-01  4.515e-02  17.846  < 2e-16 ***
Administration  -2.682e-02  5.103e-02  -0.526    0.602
Marketing.Spend  2.723e-02  1.645e-02   1.655    0.105
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9232 on 46 degrees of freedom
Multiple R-squared:  0.9507,	Adjusted R-squared:  0.9475
F-statistic:   296 on 3 and 46 DF,  p-value: < 2.2e-16

> regressor_opt = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
+                    data = dataset)
> summary(regressor_opt)

Call:
lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)

Residuals:
   Min     1Q Median     3Q    Max
-33645  -4632   -414   6484  17097

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9161 on 47 degrees of freedom
Multiple R-squared:  0.9505,	Adjusted R-squared:  0.9483
F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

> regressor_opt = lm(formula = Profit ~ R.D.Spend,
+                    data = dataset)
> summary(regressor_opt)

Call:
lm(formula = Profit ~ R.D.Spend, data = dataset)

Residuals:
   Min     1Q Median     3Q    Max
-34351  -4626   -375   6249  17188

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.903e+04  2.538e+03   19.32   <2e-16 ***
R.D.Spend   8.543e-01  2.931e-02   29.15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9416 on 48 degrees of freedom
Multiple R-squared:  0.9465,	Adjusted R-squared:  0.9454
F-statistic: 849.8 on 1 and 48 DF,  p-value: < 2.2e-16

---------------------------------------------------------------------------------------------------------------------

NOTE:: I did some experiment to find out the more optimal LR model, using the following 2 models

# first model
regressor_opt1 = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
                   data = dataset)
summary(regressor_opt1)
# second model
regressor_opt = lm(formula = Profit ~ R.D.Spend,
                   data = dataset)
summary(regressor_opt)

y_pred = predict(regressor_opt, newdata = test_set)
y_pred = predict(regressor_opt1, newdata = test_set)

Observation:: the prediction using the first model was better which, means the y_pred values using regressor_opt1 is closer to the test_set (actual dependent variable values). A possible reason for that is that the ‘ Marketing.Spend’ independent variable has a P-value of 0.06 which is too close to the significant level SL, so it is an arbitrary decision whether to remove it or not. And I can say that removing it make the prediction results worse.
IMP NOTE
Also, By observing the value of Multiple R-squared and Adjusted R-squared →

You can see that adding another independent to the model will increase Multiple R-squared value and decrease the Adjusted R-squared value which means that adding this variable didn’t help your model. So the Adjusted R-squared value is an excellent measure for how good is the regression model is.

For example:  When removing the dummy variable ‘State 2’ and ‘State 3’ the value of Multiple R-squared decreased and the value of Adjusted R-squared  increases which means that it was good to remove the dummy variables as there were doing no good for the regression model.

Do you remember this observation ?!

↳
Observation:: the prediction using the first model was better which, means the y_pred values using regressor_opt1 is closer to the test_set (actual dependent variable values). A possible reason for that is that the ‘ Marketing.Spend’ independent variable has a P-value of 0.06 which is too close to the significant level SL, so it is an arbitrary decision whether to remove it or not. And I can say that removing it make the prediction results worse.
↲

Now I Know why, By observing the Adjusted R-squared value in those 2 models we can see that by removing ‘Marketing Spend’ independent variable this decreased the regression model performance although the P-value of ‘Marketing Spend’ is 0.06 which is > Significant value which equals 0.05 and according the ‘Backward Elimination’ regression method ‘Marketing Spend’ should ‘have been removed.

Linear Regression Coefficients Interpretation

By increasing the independent variables ‘R&D Spend’ and ‘Marketing Spend’ with 1 unit the dependent variable ‘Profit’ will increase by a factor equals the coefficients of the previous both independent variables. Why ‘R&D Spend’ and ‘Marketing Spend’? because building the regression model using only those 2 independent variables gave us the best model.
	Multicollinearity → is the fact that one independent variable is dependent on another independent variable. like the 2 dummy variables New York and California
	For example, In case of ‘Dummy Variable’ - which is the encoding of the categorical variable into numerical variables - if there is 2 levels in a categorical column called ‘State’: [New York, California]
	State
	New York
	California
	California
	New York
	California

	then the Dummy variables will be: D2 = 1 - D1

	New York
	California
	1
	0
	0
	1
	0
	1
	1
	0
	0
	1

	In a Multiple LR problem:
	Y = b0 + b1x1 + b2x2 + b3x3 + b4D1
	You cannot add both the dummy variables in the LR equation as it is considered 2 independent dependent variables, the LR model cannot distinguish between the D1 and D2. And that is called the ‘Dummy Variable Trap’.
	So, As a rule of thumb always omit one dummy variable when you are creating the LR model, if you have 100 include only 99. Also you have to apply the same scenario for every categorical variable column.

	How to build a model (Step-by-Step)
	When building a model you have to choose the important variables to include in your model because at the end of the day if you included too many garbage variables you will end up with a garbage model, also you need to understand the impact of your variables on the dependent variable and also communicate it with your boss which will not be so practical.

	5 Methods of building a Model
	All-in → to use all the variables,
	Cases:
	If you have prior knowledge, you ‘ve build this model before
	You have to use them all, maybe a framework in a bank
	When you are preparing to ‘Backward Elimination’

	Backward Elimination → is the one that will be used in the tutorial as it is the fastest

	STEP 1: Select a significance level to stay in the model (e.g. SL = 0.05)
	STEP 2: Fit the full model with all possible predictors, add all the variables to your model
	STEP 3: Consider the predictor with the highest P-value. If P > SL, go to STEP 4, otherwise go to FIN ‘Finish’ which, means your model is ready
	STEP 4: Remove the predictor
	STEP 5: Fit the model without this variable, means to rebuild the whole model once again without that variable with highest P-value larger than the significance, it is gonna be a new model with new coefficients and a new constant. After STEP 5 you return to do STEP 3 again until you check all the predictors/ variables and your highest P-value is less than your SL

	Forward Selection →

	STEP 1: Select a significance level to enter the model (e.g. SL = 0.05)
	STEP 2: Fit all simple regression models y ~ Xn Select the one with the lowest P-value. It means to use every possible variable to make a LR model, it will be a simple LR model with only one variable.
	STEP 3: keep this variable that you already selected using STEP 2 and add all other variables one by one to the one you already have creating a LR models with pairs of variables - always including the variable you already selected in STEP 2
	STEP 4: Consider the predictor with the lowest P-value. If P < SL, go to STEP 3, otherwise go to FIN. it means from all the 2-variables LR models we constructed select the one with the lowest P-value then return to STEP 3 to add a third variable and fit the model and as always choose the one with the lowest P-value. The trick here is that you pick the previous model because you will stop when you choose a variable that is insignificant so you pick the model you had before picked the insignificant variable.

	Bidirectional Elimination/ Stepwise Regression →

	STEP 1: Select a significance level to enter and to stay in the model
	E.g. SLENTER = 0.05, SLSTAY = 0.05
	STEP 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to ener)
	STEP 3: Perform all steps of backward Elimination (old variables must have P < SLSTAY to stay)
	STEP 4: No new variables can enter and no old variables can exit, your model is now READY

	Score Comparison/ All Possible Models →
	STEP 1: Select a criterion of goodness of fit (e.g. Akaike criterion)
	STEP 2: Construct all possible regression models: 2N-1 total combinations
	STEP 3: Select the one with the best criterion
	Example: 10 columns datasets means 1023 models!!!, it is not a good model as it grows exponentially with the size of the dataset

	NOTE ::: Numbers 2, 3, and 4 are called “Stepwise Regression”
	-------------------------------------------------------------------------------------------------------------------------
	In the python practical tutorial, we are building a model that can detect if there are a linear dependencies between the 4 independent variables [R&D Spend, Administration, Marketing Spend, State] and the one dependent variable [Profit], we want to see if we can predict the profi value using the 4 independent variables.
	STEP1: Import dataset
	dataset = pd.read_csv("50_Startups.csv")
	STEP2: define the dependent and independent variables x and y columns
	x = dataset.iloc[:, :-1].values # independent variables
	y = dataset.iloc[:,4].values # dependent variables
	STEP3: we will start encoding the categorical variables using LabelEncoder and OneHotEncoder
	from sklearn.preprocessing import LabelEncoder, OneHotEncoder
	labelencoder_x = LabelEncoder()
	x[:,3] = labelencoder_x.fit_transform(x[:,3])
	# Dummy Encoding
	onehotencoder = OneHotEncoder(categorical_features=[3]) # encode the state col
	x = onehotencoder.fit_transform(x).toarray()
	STEP4: Avoid the Dummy variables trap, to eliminate the dependencies between the dummy variables, Note: for ML libraries like sklearn LinearRegression you don’t need to do this as it is already are taken care of by the library
	x = x [:, 1:]
	STEP5: Split the data to a training and testing sets
	from sklearn.cross_validation import train_test_split
	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, train_size=0.8, random_state= 0)
	STEP6: Fit the regressor to the training data set
	from sklearn.linear_model import LinearRegression
	regressor = LinearRegression()
	regressor.fit(x_train, y_train)
	STEP7: Evaluate the multivariable LR model on our Test set x_test
	y_pred = regressor.predict(x_test) # compare between y_pred and y_test
	If you are satisfied with your model results..You can not do the following,,,,


	But, I want to get better results:
	Build the optimal multiple LR model using Backward Elimination, we are here building the optimal model by eliminating the statistically insignificant variables that don’t have major impact on predicting the independent variable. We will find the independent variables team that have an impact [positive ‘increase profit’’/ negative ‘decrease profit’] on the dependent variable prediction.
	STEP1: import the statsmodel library
	import statsmodels.formula.api as sm
	STEP2: we will add a column of ones to the x values matrix of independent variables to teach the statsmodels library that the multivariable LR equation is y = b0 + b1X1 + ..bnxn includes the constant b0, b0 is called the ‘intercept’
	x = np.append(arr = np.ones((50, 1)).astype(int), values = x, axis = 1)
	STEP3: create a matrix x_optimal that will include the optimal statistically significant variables with P-value < SL
	x_optimal = x[:,[0,1,2,3,4,5]]
	STEP4: fit all the possible independent variables to the OLS (ordinary least squares) model
	regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
	STEP5: then check the independent variables P-value if it is below the significance level SL, we will remove the independent variables with P-value > 0.05, the following line will return a very useful statistical summary of the x values
	regressor_OLS.summary()
	coef std err t P>\|t\| [95.0% Conf. Int.]
	------------------------------------------------------------------------------
	const 5.013e+04 6884.820 7.281 0.000 3.62e+04 6.4e+04
	x1 198.7888 3371.007 0.059 0.953 -6595.030 6992.607
	x2 -41.8870 3256.039 -0.013 0.990 -6604.003 6520.229
	x3 0.8060 0.046 17.369 0.000 0.712 0.900
	x4 -0.0270 0.052 -0.517 0.608 -0.132 0.078
	x5 0.0270 0.017 1.574 0.123 -0.008 0.062

	#### note, that x2 variable has the highest P-value 0.990 which is > 0.05, so we will remove x2
	x_optimal = x[:,[0,1,3,4,5]]
	regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
	regressor_OLS.summary()
	#### note, that x1 variable has the highest P-value 0.953 which is > 0.05, so we will remove x1
	x_optimal = x[:,[0,3,4,5]]
	regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
	regressor_OLS.summary()
	#### note, that x4 variable has the highest P-value 0.608 which is > 0.05, so we will remove x4
	x_optimal = x[:,[0,3,5]]
	regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
	regressor_OLS.summary()

	#### note, that x5 variable has the highest P-value 0.060 which is > 0.05, so we will remove x5
	x_optimal = x[:,[0,3]]
	regressor_OLS = sm.OLS(endog = y, exog = x_optimal).fit()
	regressor_OLS.summary()

	Now x_optimal is fully created, with the 2 variables x0 and x3 both are highly statistically significant and have a major impact on the prediction.


	NOTE:: that in linear regression models, you don’t need to do ‘feature scaling’ because this is being taking care of by the sklearn Linear regression library and it is the same for LR implementation in R.

	NOTE:: the lower the P-value is the more statistically significant the independent variable will be and the more impact it will have on the prediction of the dependent variable.
	The threshold you should compare your P-value to is 0.05.

	-----------------------------------------------------------------------------------------------------------------------
	R-Language Tutorial

	# Multiple Linear Regression

	# import data
	dataset = read.csv('50_Startups.csv')

	# Encoding categorical data
	dataset$State = factor(dataset$State,
	levels = c("California", "New York", "Florida"),
	labels = c (1,2,3))
	# the R-libray took care of the dummy variables trap

	# splitting data into train and test set
	library(caTools)
	set.seed(123)
	split = sample.split(dataset$Profit, SplitRatio = 0.8) #Profit is the dependent variable, splitRation is the training set ratio
	train_set = subset(dataset, split == TRUE)
	test_set = subset(dataset, split == FALSE)

	# Fitting Multiple Linear regression to the training set
	#regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State)
	regressor = lm(formula = Profit ~ ., data = train_set) # . --> means a combination of all the independent variables
	summary(regressor)
	# output
	> summary(regressor)

	Call:
	lm(formula = Profit ~ ., data = train_set)

	Residuals:
	Min 1Q Median 3Q Max
	-33128 -4865 5 6098 18065

	Coefficients:
	Estimate Std. Error t value Pr(>\|t\|)
	(Intercept) 4.977e+04 7.516e+03 6.622 1.36e-07 ***
	R.D.Spend 7.986e-01 5.604e-02 14.251 6.70e-16 ***
	Administration -2.942e-02 5.828e-02 -0.505 0.617
	Marketing.Spend 3.268e-02 2.127e-02 1.537 0.134
	State2 -1.213e+02 3.751e+03 -0.032 0.974
	State3 1.162e+02 4.048e+03 0.029 0.977
	---
	Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 9908 on 34 degrees of freedom
	Multiple R-squared: 0.9499, Adjusted R-squared: 0.9425
	F-statistic: 129 on 5 and 34 DF, p-value: < 2.2e-16

	NOTE:: that ‘***’ means that the variable is highly significant and its P-value is the least as it is only between 0 and 0.001

	NOTE:: We should not use Multiple Linear Regression to predict a dependent variable that is growing exponentially with time.

	NOTE:: In R any space in the column names is converted into ‘.’

	So we can notice that the only variable that has a high statistical significance is R&D Spend column.

	# Predicting the Test set results
	y_pred = predict(regressor, newdata = test_set)
	y_pred_new = predict(regressor_new, newdata = test_set)



	# Now, we will build the optimal model using Backward Elimination
	# here we will build the LR model on all the dataset so it can learn the significant and
	# insignificant independent variables in the whole dataset
	# so we are fitting the whole model to the regressor
	regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
	data = dataset)

	###### then we will go through the backward elimination by removing the independent variables with P-value > SL(significance level) =0.05 ######

	> regressor_opt = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
	+ data = dataset)
	> summary(regressor_opt)

	Call:
	lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
	State, data = dataset)

	Residuals:
	Min 1Q Median 3Q Max
	-33504 -4736 90 6672 17338

	Coefficients:
	Estimate Std. Error t value Pr(>\|t\|)
	(Intercept) 5.013e+04 6.885e+03 7.281 4.44e-09 ***
	R.D.Spend 8.060e-01 4.641e-02 17.369 < 2e-16 ***
	Administration -2.700e-02 5.223e-02 -0.517 0.608
	Marketing.Spend 2.698e-02 1.714e-02 1.574 0.123
	State2 -4.189e+01 3.256e+03 -0.013 0.990
	State3 1.988e+02 3.371e+03 0.059 0.953
	---
	Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 9439 on 44 degrees of freedom
	Multiple R-squared: 0.9508, Adjusted R-squared: 0.9452
	F-statistic: 169.9 on 5 and 44 DF, p-value: < 2.2e-16

	> regressor_opt = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
	+ data = dataset)
	> summary(regressor_opt)

	Call:
	lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
	data = dataset)

	Residuals:
	Min 1Q Median 3Q Max
	-33534 -4795 63 6606 17275

	Coefficients:
	Estimate Std. Error t value Pr(>\|t\|)
	(Intercept) 5.012e+04 6.572e+03 7.626 1.06e-09 ***
	R.D.Spend 8.057e-01 4.515e-02 17.846 < 2e-16 ***
	Administration -2.682e-02 5.103e-02 -0.526 0.602
	Marketing.Spend 2.723e-02 1.645e-02 1.655 0.105
	---
	Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 9232 on 46 degrees of freedom
	Multiple R-squared: 0.9507, Adjusted R-squared: 0.9475
	F-statistic: 296 on 3 and 46 DF, p-value: < 2.2e-16

	> regressor_opt = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
	+ data = dataset)
	> summary(regressor_opt)

	Call:
	lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset)

	Residuals:
	Min 1Q Median 3Q Max
	-33645 -4632 -414 6484 17097

	Coefficients:
	Estimate Std. Error t value Pr(>\|t\|)
	(Intercept) 4.698e+04 2.690e+03 17.464 <2e-16 ***
	R.D.Spend 7.966e-01 4.135e-02 19.266 <2e-16 ***
	Marketing.Spend 2.991e-02 1.552e-02 1.927 0.06 .
	---
	Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 9161 on 47 degrees of freedom
	Multiple R-squared: 0.9505, Adjusted R-squared: 0.9483
	F-statistic: 450.8 on 2 and 47 DF, p-value: < 2.2e-16

	> regressor_opt = lm(formula = Profit ~ R.D.Spend,
	+ data = dataset)
	> summary(regressor_opt)

	Call:
	lm(formula = Profit ~ R.D.Spend, data = dataset)

	Residuals:
	Min 1Q Median 3Q Max
	-34351 -4626 -375 6249 17188

	Coefficients:
	Estimate Std. Error t value Pr(>\|t\|)
	(Intercept) 4.903e+04 2.538e+03 19.32 <2e-16 ***
	R.D.Spend 8.543e-01 2.931e-02 29.15 <2e-16 ***
	---
	Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	Residual standard error: 9416 on 48 degrees of freedom
	Multiple R-squared: 0.9465, Adjusted R-squared: 0.9454
	F-statistic: 849.8 on 1 and 48 DF, p-value: < 2.2e-16

	---------------------------------------------------------------------------------------------------------------------

	NOTE:: I did some experiment to find out the more optimal LR model, using the following 2 models

	# first model
	regressor_opt1 = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
	data = dataset)
	summary(regressor_opt1)
	# second model
	regressor_opt = lm(formula = Profit ~ R.D.Spend,
	data = dataset)
	summary(regressor_opt)

	y_pred = predict(regressor_opt, newdata = test_set)
	y_pred = predict(regressor_opt1, newdata = test_set)

	Observation:: the prediction using the first model was better which, means the y_pred values using regressor_opt1 is closer to the test_set (actual dependent variable values). A possible reason for that is that the ‘ Marketing.Spend’ independent variable has a P-value of 0.06 which is too close to the significant level SL, so it is an arbitrary decision whether to remove it or not. And I can say that removing it make the prediction results worse.
	IMP NOTE
	Also, By observing the value of Multiple R-squared and Adjusted R-squared →

	You can see that adding another independent to the model will increase Multiple R-squared value and decrease the Adjusted R-squared value which means that adding this variable didn’t help your model. So the Adjusted R-squared value is an excellent measure for how good is the regression model is.

	For example: When removing the dummy variable ‘State 2’ and ‘State 3’ the value of Multiple R-squared decreased and the value of Adjusted R-squared increases which means that it was good to remove the dummy variables as there were doing no good for the regression model.

	Do you remember this observation ?!

	↳
	Observation:: the prediction using the first model was better which, means the y_pred values using regressor_opt1 is closer to the test_set (actual dependent variable values). A possible reason for that is that the ‘ Marketing.Spend’ independent variable has a P-value of 0.06 which is too close to the significant level SL, so it is an arbitrary decision whether to remove it or not. And I can say that removing it make the prediction results worse.
	↲

	Now I Know why, By observing the Adjusted R-squared value in those 2 models we can see that by removing ‘Marketing Spend’ independent variable this decreased the regression model performance although the P-value of ‘Marketing Spend’ is 0.06 which is > Significant value which equals 0.05 and according the ‘Backward Elimination’ regression method ‘Marketing Spend’ should ‘have been removed.

	Linear Regression Coefficients Interpretation

	By increasing the independent variables ‘R&D Spend’ and ‘Marketing Spend’ with 1 unit the dependent variable ‘Profit’ will increase by a factor equals the coefficients of the previous both independent variables. Why ‘R&D Spend’ and ‘Marketing Spend’? because building the regression model using only those 2 independent variables gave us the best model.