Skip to content

Instantly share code, notes, and snippets.

@puchufo
Last active October 21, 2020 16:42
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save puchufo/63bb2a34831fc4147d8c66ad29b19cd3 to your computer and use it in GitHub Desktop.
Save puchufo/63bb2a34831fc4147d8c66ad29b19cd3 to your computer and use it in GitHub Desktop.
Data Analysis with Python
Python libraries for Data Analysis:
1. Scientific computing libraries:
Pandas -> data structures and tools
Numpy -> array and matrices
Scipy -> integrals, solving differential equations, optimization
2. Visualization libraries
Matplot -> plots & graphs, most popular
Seaborn -> plots: heat maps, time series, violin plots
3. Algorithmic libraries
Scikit-learn -> Machine learning: regression, classification,...
Statsmodel -> Explore data, estimate statistical models and perform statistical tests.
PANDAS
Formats: xls, csv, json, hdf
import pandas as pd
url = "https:\\l......"
df = pd.read_csv(url, header = None) # read_csv assumes headers by default in the csv file.
df.head(n) -> muestra las n primeras lineas del dataframe
df.tail(n) -> shows the last n rows from the dataframe
To assign column names:
Create a list with the names of the columns: hearders = [col1, col2, col3,....]
Assign the headers to the df: df.columns = headers
Export the csv to a path:
path = "C:\Windows\..."
df.to_csv(path)
Formats:
pd.read_csv() pd.to_csv()
pd.read_json() pd.to_json()
pd.read_excel() pd.to_excel()
pd.read_sql() pd.to_sql()
Check data types of the df: df.dtypes
Check statistical data of the df: df.describe()
df.describe(include="all") for list all the attibutes of all the columns
df.info() provides a concise summary of the first 30 rows and the last 30 rows
df.describe(include=['object'])
You can select the columns of a data frame by indicating the name of each column, for example, you can select the three columns as follows:
dataframe[[' column 1 ',column 2', 'column 3']]
Where "column" is the name of the column, you can apply the method ".describe()" to get the statistics of those columns as follows:
dataframe[[' column 1 ',column 2', 'column 3'] ].describe()
Drop missing values from a column: df.dropna(subset=["price"], axis=0)
df.columns -> print the headers
DATA WRANGLING
For help: http/pandas.pydata.org/
To drop missing values: dataframe.dropna(subset=['nombre columna'], axis= , inplace = true ) axis=0 to drop the row, axis = 1 to drop the column, inplace writes the result in same dataframe. If not add inplace parameter, the dataframe is not changed.
To replace missing values: dataframe.replace(missing_value, new_value)
For example, to replace a NAN value with the mean of the column:
df.replace("?", np.nan, inplace = True)
mean = dataframe["column"].mean()
dataframe["column"].replace(np.nan, mean)
Evaluating for Missing Data
The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
.isnull()
.notnull()
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data."True" stands for missing value, while "False" stands for not missing value.
missing_data = df.isnull()
missing_data.head(5)
Count missing values in each column
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value, "False" means the value is present in the dataset. In the body of the for loop the method ".value_counts()" counts the number of "True" values.
for column in missing_data.columns.values.tolist():
print(column)
print (missing_data[column].value_counts())
print("")
Calculate the average of the column
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
To see which values are present in a particular column, we can use the ".value_counts()" method:
df['num-of-doors'].value_counts()
We can see that four doors are the most common type. We can also use the ".idxmax()" method to calculate for us the most common type automatically:
df['num-of-doors'].value_counts().idxmax()
DATA FORMATTING
For example, in the data set of car prices, to change the consumption from miles/gallon to liters / 100 km:
df["city-mpg"] = 235 / df["city-mpg"] -> divide y aplica el cambio a toda la columna
df.rename(columns={"city-mpg":"city-L/100km}, inplace = true)
Correct data types, example: from objets to numbers.
Identidy data types: dataframe.dtypes()
Convert data types: dataframe.astype()
ex: df["price"] = df["price"].astype("int") (int, float, object...)
DATA NORMALIZATION:
Simple feature scaling: Xnew = Xold /Xmax
df["column1"] = df["column1"]/df["column1"].max()
Min-max feature scaling: Xnew = (Xold - Xmin)/(Xmax- Xmin)
df["column1"] = (df["column1"] - df["column1"].min()) /
(df["column1"].max() - df["column1"].min())
Z-score feature scaling: Xnew = (Xold - Average of the feature mu)/ Standar deviation sigma ---- tipically varies between -3 and 3, but can by higher
df["column1"] = (df["column1"] - df["column1"].mean()) /
df["column1"].std()
BINNING:
Is grouping values into bins.
Can covert numericals into categorical variables by grouping them into a set of "bins".
To bin in Python:
bins = np.linspace(min(df["price"]),max(df["price"]),4) -> create 3 bins (need 4 dividers)
group_names = ['low', 'medium', 'high']
df["price_binned"] = pd.cut(df["price"], bins, labels=group_names, include_lowest = True)
Plot:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Or:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
a = (0,1,2)
# draw historgram of attribute "horsepower" with bins = 3
plt.pyplot.hist(df["horsepower"], bins = 3)
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
CONVERT CATEGORICAL VARIABLES INTO QUANTITATIVE VARIABLES, OR INDICATOR VARIABLES
One-Hot encoding: we create new feature columns for each category and set to 1 or 0.
pd.get_dummies(df["fuel"])
merge data frame "df" and "dummy_variable_1"
df = pd.concat([df, dummy_variable_1], axis=1)
drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)
EXPLORATORY DATA ANALYSIS
import pandas as pd
import numpy as np
path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
df = pd.read_csv(path)
df.head()
%%capture
! pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df.corr()
df[['bore','stroke' ,'compression-ratio','horsepower']].corr()
Describe statistics: df.describe()
df.describe(include=['object'])
Summarize the categorical data: drive_wheels_counts = df.("drive_wheels").values_count()
drive_wheels_counts.rename(columns={"drive_wheels":"value_counts"}, inplace = True)
drive_wheels_counts.index.name="drive_wheels"
O:
df['drive-wheels'].value_counts()
df['drive-wheels'].value_counts().to_frame() -> to convert to dafa frame
Box plots: sns.boxplot(x="body-style", y="price", data=df)
Box plots are great way to visualize numeric data, since you can visualize the various distributions of the data. The main features of the box plot shows are the median of the data which represents where the middle data point is.
The upper quartile shows where the 75th percentile is.
The lower quartile shows where the 25th percentile is.
The data between the upper and lower quartile represents the interquartile range.
Next, you have the lower and upper extremes. These are calculated as 1.5 times the interquartilre range above the 75th percentile and as 1.5 times the IQR below the 25th percentile.
Finally, box plots also display outliers as individual dots that occur outside the upper and lower extremes. With box plots, you can easily spot outliers and also see the distribution and skewness of the data. Box plots make it easy to compare between groups. In this example, using box plot we can see the distribution of different categories at the drive wheels feature over price feature. We can see that the distribution of price between the rear wheel drive and the other categories are distinct, but the price per front wheel drive and four wheel drive are almost indistinguishable.
Scatter plots:
Each observation in the scatter plot is represented as a point. This plot shows the relationship between two variables. The predictive variable is the variable that you were using to predict an outcome. In this case, our predictive variable is the engine size. The target variable is the variable that you are trying to predict. In this case, our target variable is the price since this would be the outcome.
x=df["engine_size"]
y=df["price"]
plt.scatter(x,y)
plt.title("Scatter plot Engine Size vs. Price)
plt.ylabel("Price")
plt.xlabel("Engine Size")
Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.
In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using "regplot", which plots the scatterplot plus the fitted regression line for the data.
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
GroupBy in Python:
The group by method is used on categorical variables, groups the data into subsets according to the different categories of that variable. You can group by a single variable or you can group by multiple variables by passing in multiple variable names.
Select the columns that we are interested:
df_test = df[['drive_wheels','body_style','price']]
df_grp = df_test.groupby(['drive_wheels','body_style'], as_index=False).mean()
Or : df[['price','drive_wheels','body-style']].groupby(['body-style','drive_wheels'],as_index= False).mean()
df['drive-wheels'].unique() -> to see all the categories of a varialbe
To see the result in a table with each variable in an axis:
df_pivot =df_grp.pivot(index= 'drive_wheels', columns='body_style')
HeatMaps:
takes a rectangular grid of data and assigns a color intensity based on the data value at the grid points. It is a great way to plot the target variable over multiple variables and through this get visual clues with the relationship between these variables and the target
plt.pcolor(df_pivot, cmap = 'RdBu')
plt.colorbar()
plt.show()
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')
#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index
#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)
#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)
#rotate label if too long
plt.xticks(rotation=90)
fig.colorbar(im)
plt.show()
Correlation:
Correlation is a statistical metric for measuring to what extent different variables are interdependent. In other words, when we look at two variables over time, if one variable changes how does this affect change in the other variable?
sns.regplot(x='engine_size', y='prices', data=df)
plt.ylim(0,)
Pearson Correlation method will give you two values;
the correlation coefficient:
close to one implies a large positive correlation
close to negative one, implies a large negative correlation
value close to zero, implies no correlation between the variables
the p-value: how certain we are about the correlation that we calculated
less than 0.001 gives us a strong certainty about the correlation coefficient that we calculated
between 0.001 and 0.05 gives us moderate certainty
between 0.05 and 0.1 will give us a week certainty
larger than 0.1 will give us no certainty of correlation at all.
ANOVA (Analysis of Variance):
Statiscal comparison of groups, example: average price of different vehicles makes
Finding correlation between different groups of a categorical value
The test returns:
F-test calculates the ratio of variation between the groups mean over the variation within each of the sample groups. F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
The p-value shows whether the obtained result is statistically significant.
df_anova=df[["make","price"]]
grouped_anova=df_anova.groupby(["make"])
import scipy
anova_results_1=stats.f_oneway(grouped_anova.getgroup("honda")[price],grouped_anova.getgroup("subaru")[price])
Large F >>> 1 and small p === 0 indicates a correlation between a categorical variable and another variable.
MODEL DEVELOPMENT:
LINEAR REGRESSION: (SLR)
Linear regression will refer to one independent variable to make a prediction. Multiple linear regression will refer to multiple
independent variables to make a prediction. Simple linear regression or SLR is a method to help us understand the relationship
between two variables. The predictor independent variable x and the target dependent variable y.
The parameter b_0 is the intercept.
The parameter b_1 is the slope
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y) -> get the parameters
Yhat = lm.prediction(X) -> get a prediction
b0 (interecept) -> lm.intercept_
b1 (slope) -> lm.coef_
MULTIPLE LINEAR REGRESSION: (MLR)
is used to explain the relationship between one continuous target y variable and two or more predictor x variables.
If we have for example 4 predictor variables then b_0 intercept x equal zero b _1 the coefficient or parameter of x_1,
b_2 the coefficient of parameter x_2 and so on. If there are only two variables then we can visualize the values.
Y_^= b0 +b1x1 +b2x2 +....
Z = df[['highway-mpg', 'engine_size', 'horse_power']]
Y = df['price']
lm.fit(Z, Y) -> get the parameters
Yhat = lm.prediction(X) -> get a prediction
b0 (interecept) -> lm.intercept_
b1 (slope) -> lm.coef_
Model Evaluation using Visualization
Now that we've developed some models, how do we evaluate our models and how do we choose the best one? One way to do this is by using visualization.
import the visualization package: seaborn
# import the visualization package: seaborn
import seaborn as sns
%matplotlib inline
Regression Plot
When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.
This plot will show a combination of a scattered data points (a scatter plot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).
Let's visualize Horsepower as potential predictor variable of price:
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.regplot(x="highway-mpg", y="price", data=df)
plt.ylim(0,)
We can see from this plot that price is negatively correlated to highway-mpg, since the regression slope is negative. One thing to keep in mind when looking at a regression plot is to pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data, and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for this data. Let's compare this plot to the regression plot of "peak-rpm".
plt.figure(figsize=(width, height))
sns.regplot(x="peak-rpm", y="price", data=df)
plt.ylim(0,)
Comparing the regression plot of "peak-rpm" and "highway-mpg" we see that the points for "highway-mpg" are much closer to the generated line and on the average decrease. The points for "peak-rpm" have more spread around the predicted line, and it is much harder to determine if the points are decreasing or increasing as the "highway-mpg" increases.
Residual Plot
A good way to visualize the variance of the data is to use a residual plot.
What is a residual?
The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.
So what is a residual plot?
A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.
What do we pay attention to when looking at a residual plot?
We look at the spread of the residuals:
- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.
width = 12
height = 10
plt.figure(figsize=(width, height))
sns.residplot(df['highway-mpg'], df['price'])
plt.show()
What is this plot telling us?
We can see from this residual plot that the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data.
Multiple Linear Regression
How do we visualize a model for Multiple Linear Regression? This gets a bit more complicated because you can't visualize it with regression or residual plot.
One way to look at the fit of the model is by looking at the distribution plot: We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.
First lets make a prediction: parameter hist= false, to have a distribution and not a histograme
Y_hat = lm.predict(Z)
plt.figure(figsize=(width, height))
ax1 = sns.distplot(df['price'], hist=False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)
plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price (in dollars)')
plt.ylabel('Proportion of Cars')
plt.show()
plt.close()
We can see that the fitted values are reasonably close to the actual values, since the two distributions ove
Part 3: Polynomial Regression and Pipelines
Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.
We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
There are different orders of polynomial regression:
Quadratic - 2nd order
𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋2+𝑏2𝑋2
Cubic - 3rd order
𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋2+𝑏2𝑋2+𝑏3𝑋3
Higher order:
𝑌=𝑎+𝑏1𝑋2+𝑏2𝑋2+𝑏3𝑋3....
We saw earlier that a linear model did not provide the best fit while using highway-mpg as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead.
We will use the following function to plot the data:
-
def PlotPolly(model, independent_variable, dependent_variabble, Name):
x_new = np.linspace(15, 55, 100)
y_new = model(x_new)
plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
ax = plt.gca()
ax.set_facecolor((0.898, 0.898, 0.898))
fig = plt.gcf()
plt.xlabel(Name)
plt.ylabel('Price of Cars')
plt.show()
plt.close()
lets get the variables
x = df['highway-mpg']
y = df['price']
Let's fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.
# Here we use a polynomial of the 3rd order (cubic)
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)
3 2
-1.557 x + 204.8 x - 8965 x + 1.379e+05
Let's plot the function
PlotPolly(p, x, y, 'highway-mpg')
np.polyfit(x, y, 3)
array([-1.55663829e+00, 2.04754306e+02, -8.96543312e+03, 1.37923594e+05])
We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.
The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2)polynomial with two variables is given by:
𝑌ℎ𝑎𝑡=𝑎+𝑏1𝑋1+𝑏2𝑋2+𝑏3𝑋1𝑋2+𝑏4𝑋21+𝑏5𝑋22
We can perform a polynomial transform on multiple features. First, we import the module:
from sklearn.preprocessing import PolynomialFeatures
We create a PolynomialFeatures object of degree 2:
pr=PolynomialFeatures(degree=2)
pr
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)
Z_pr=pr.fit_transform(Z)
The original data is of 201 samples and 4 features
Z.shape
(201, 4)
after the transformation, there 201 samples and 15 features
Z_pr.shape
(201, 15)
Pipeline
Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
We create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor.
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
we input the list as an argument to the pipeline constructor
pipe=Pipeline(Input)
pipe
Pipeline(memory=None,
steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False))])
We can normalize the data, perform a transform and fit the model simultaneously.
pipe.fit(Z,y)
Pipeline(memory=None,
steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('polynomial', PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)), ('model', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False))])
Similarly, we can normalize the data, perform a transform and produce a prediction simultaneously
ypipe=pipe.predict(Z)
ypipe[0:4]
Part 4: Measures for In-Sample Evaluation
When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.
Two very important measures that are often used in Statistics to determine the accuracy of a model are:
R^2 / R-squared
Mean Squared Error (MSE)
R-squared
R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.
The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.
Mean Squared Error (MSE)
The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ).
Model 1: Simple Linear Regression
Let's calculate the R^2
#highway_mpg_fit
X=df[['highway-mpg']]
lm.fit(X, Y)
# Find the R^2
print('The R-square is: ', lm.score(X, Y))
The R-square is: 0.4965911884339175
We can say that ~ 49.659% of the variation of the price is explained by this simple linear model "horsepower_fit".
Let's calculate the MSE
We can predict the output i.e., "yhat" using the predict method, where X is the input variable:
Yhat=lm.predict(X)
print('The output of the first four predicted value is: ', Yhat[0:4])
The output of the first four predicted value is: [16236.50464347 16236.50464347 17058.23802179 13771.3045085 ]
lets import the function mean_squared_error from the module metrics
from sklearn.metrics import mean_squared_error
we compare the predicted results with the actual results
mse = mean_squared_error(df['price'], Yhat)
print('The mean square error of price and predicted value is: ', mse)
The mean square error of price and predicted value is: 31635042.944639895
Model 2: Multiple Linear Regression
Let's calculate the R^2
# fit the model
lm.fit(Z, df['price'])
# Find the R^2
print('The R-square is: ', lm.score(Z, df['price']))
The R-square is: 0.8093562806577458
We can say that ~ 80.896 % of the variation of price is explained by this multiple linear regression "multi_fit".
Let's calculate the MSE
we produce a prediction
Y_predict_multifit = lm.predict(Z)
we compare the predicted results with the actual results
print('The mean square error of price and predicted value using multifit is: ', \
mean_squared_error(df['price'], Y_predict_multifit))
The mean square error of price and predicted value using multifit is: 11980366.870726489
Model 3: Polynomial Fit
Let's calculate the R^2
let’s import the function r2_score from the module metrics as we are using a different function
from sklearn.metrics import r2_score
We apply the function to get the value of r^2
r_squared = r2_score(y, p(x))
print('The R-square value is: ', r_squared)
The R-square value is: 0.6741946663906517
We can say that ~ 67.419 % of the variation of price is explained by this polynomial fit
MSE
We can also calculate the MSE:
mean_squared_error(df['price'], p(x))
20474146.426361226
Part 5: Prediction and Decision Making
Prediction
In the previous section, we trained the model using the method fit. Now we will use the method predict to produce a prediction. Lets import pyplot for plotting; we will also be using some functions from numpy.
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
Create a new input
new_input=np.arange(1, 101, 1).reshape(-1, 1)
Fit the model
lm.fit(X, Y)
lm
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Produce a prediction
yhat=lm.predict(new_input)
yhat[0:5]
array([37601.57247984, 36779.83910151, 35958.10572319, 35136.37234487,
34314.63896655])
we can plot the data
plt.plot(new_input, yhat)
plt.show()
Decision Making: Determining a Good Model Fit
Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?
What is a good R-squared value?
When comparing models, the model with the higher R-squared value is a better fit for the data.
What is a good MSE?
When comparing models, the model with the smallest MSE value is a better fit for the data.
Let's take a look at the values for the different models.
Simple Linear Regression: Using Highway-mpg as a Predictor Variable of Price.
R-squared: 0.49659118843391759
MSE: 3.16 x10^7
Multiple Linear Regression: Using Horsepower, Curb-weight, Engine-size, and Highway-mpg as Predictor Variables of Price.
R-squared: 0.80896354913783497
MSE: 1.2 x10^7
Polynomial Fit: Using Highway-mpg as a Predictor Variable of Price.
R-squared: 0.6741946663906514
MSE: 2.05 x 10^7
Simple Linear Regression model (SLR) vs Multiple Linear Regression model (MLR)
Usually, the more variables you have, the better your model is at predicting, but this is not always true. Sometimes you may not have enough data, you may run into numerical problems, or many of the variables may not be useful and or even act as noise. As a result, you should always check the MSE and R^2.
So to be able to compare the results of the MLR vs SLR models, we look at a combination of both the R-squared and MSE to make the best conclusion about the fit of the model.
MSEThe MSE of SLR is 3.16x10^7 while MLR has an MSE of 1.2 x10^7. The MSE of MLR is much smaller.
R-squared: In this case, we can also see that there is a big difference between the R-squared of the SLR and the R-squared of the MLR. The R-squared for the SLR (~0.497) is very small compared to the R-squared for the MLR (~0.809).
This R-squared in combination with the MSE show that MLR seems like the better model fit in this case, compared to SLR.
Simple Linear Model (SLR) vs Polynomial Fit
MSE: We can see that Polynomial Fit brought down the MSE, since this MSE is smaller than the one from the SLR.
R-squared: The R-squared for the Polyfit is larger than the R-squared for the SLR, so the Polynomial Fit also brought up the R-squared quite a bit.
Since the Polynomial Fit resulted in a lower MSE and a higher R-squared, we can conclude that this was a better fit model than the simple linear regression for predicting Price with Highway-mpg as a predictor variable.
Multiple Linear Regression (MLR) vs Polynomial Fit
MSE: The MSE for the MLR is smaller than the MSE for the Polynomial Fit.
R-squared: The R-squared for the MLR is also much larger than for the Polynomial Fit.
Conclusion:
Comparing these three models, we conclude that the MLR model is the best model to be able to predict price from our dataset. This result makes sense, since we have 27 variables in total, and we know that more than one of those variables are potential predictors of the final car price.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment