misho-kr/Introduction to Linear Modeling in Python.md

## Introduction to Linear Modeling in Python.md

      
    Raw
  

              Introduction to Linear Modeling in Python.md
            
          
    Introduction to Linear Modeling in Python

Lecturer

Jason Vestuto, Data Scientist, University of Texas at Austin
Exploring Linear Trends

We start the course with an initial exploration of linear relationships, including some motivating examples of how linear models are used, and demonstrations of data visualization methods from matplotlib. We then use descriptive statistics to quantify the shape of our data and use correlation to quantify the strength of linear relationships between two variables.

Models are used to make predictions for future performance (extrapolation)
Models can interpolate the data
Will use "matplotlib.pyplot" package to visualize the models
Linear models use the slope-intersept form of linear equation

slope: dy / dx (rise-over-run)
intercept: when x=0, y=?


Review of Single Variable Statistics


Name
Formula


Mean
sum(x) / len(x)


Deviation
x - mean(x)


Variance
np.mean(dx * dx)


Standard Deviation
np.sqrt(variance)


Quantifying Linear Relationships

Covariance

Deviation of two variables: dx, dy
Co-vary means to vary together: deviation_products = dx * dy
Covariance as the mean: covariance = np.mean(dx*dy)


Correlation

divide deviations by standard deviation:

zx = dx/np.std(x)
zy = dy/np.std(y)


mean of the normalize deviations: correlation = np.mean(zx*zy)


Normalization helps to compare (correlate) signals of different magnitudes
Magnitude and direction of correlation (positive/negative, greater/equal/less than 0)


Building Linear Models

Here we look at the parts that go into building a linear model. Using the concept of a Taylor Series, we focus on the parameters slope and intercept, how they define the model, and how to interpret the them in several applied contexts. We apply a variety of python modules to find the model that best fits the data, by computing the optimal values of slope and intercept, using least-squares, numpy, statsmodels, and scikit-learn.

Taylor Series

Approximates any curve
Polynomial form: y = a0 + a1x + a2x^2 + a3x^3 + .... + anx^n
often, first order is enough: y = a0 + a1*x


First order with a1=0: flat line
First order with a1=1: diagonal line
Second order with a2=1: exponential line
Problems when overfitting the model -- predictions beyond the training data are significantly incorrect
Interpreting Slope and Intercept

y = a0*x + a1
x = independent variable, e.g. time
y = dependent variable, e.g. distance traveled
slope, everage slope


Residuals, squared, variations

>>> residuals = y_model - y_data
>>> len(residuals) == len(y_data)
  True
>>> print( np.sum(residuals)
  0.0
>>> residuals_squared = np.square(y_model - y_data)
>>> print( np.sum(residuals_squared) )
  65.1
>>> RSS = np.sum(residuals_squared)
>>> print( RSS )
  5.9

Least-Squares Optimization

Minima of RSS
a1 = covariance(x, y) / variance(x)
a0 = mean(y) − a1 * mean(x)


Optimized by Numpy

>>> x_mean = np.sum(x) / len(x)
>>> y_mean = np.sum(y) / len(y)
>>> x_dev = x - x_mean
>>> y_dev = y - y_mean
>>> a1 = np.sum( x_dev * y_dev ) / np.sum( x_dev^2 )
>>> a0 = y_mean - (a1 * x_mean)

Optimized by Scipy

>>> from scipy import optimize
>>> x_data, y_data = load_data()
>>> def model_func(x, a0, a1):
>>> return a0 + (a1 * x)
>>> param_opt, param_cov = optimize.curve_fit(model_func, x_data, y_data)
>>> a0 = param_opt[0] # a0 is the intercept in y = a0 + a1*x
>>> a1 = param_opt[1] # a1 is the slope in y = a0 + a1*x

Optimized by Statsmodels

>>> from statsmodels.formula.api import ols
>>> x_data, y_data = load_data()
>>> df = pd.DataFrame(dict(x_name=x_data, y_name=y_data))
>>> model_fit = ols(formula="y_name ~ x_name", data=df).fit()
>>> y_model = model_fit.predict(dx)
>>> x_model = x_data
>>> a0 = model_fit.params['Intercept']
>>> a1 = model_fit.params['x_name']
Making Model Predictions

Next we will apply models to real data and make predictions. We will explore some of the most common pit-falls and limitations of predictions, and we evaluate and compare models by quantifying and contrasting several measures of goodness-of-fit, including RMSE and R-squared.

Optimized by Scikit-Learn

from sklearn.linear_model import LinearRegression
 
# Initialize a general model
model = LinearRegression(fit_intercept=True)

# Load and shape the data
x_raw, y_raw = load_data()
x_data = x_raw.reshape(len(y_raw),1) y_data = y_raw.reshape(len(y_raw),1)

# Fit the model to the data
model_fit = model.fit(x_data, y_data)

Predications, Parameters, Uncertainty

a0 = model_fit.params['Intercept']
a1 = model_fit.params['times']

e0 = model_fit.bse['Intercept']
e1 = model_fit.bse['times']

Limits of prediction, Domain of Validity
Goodness of Fit -- 3 different R's

Building Models

RSS


Evaluating Models

RMSE
R-squared


residuals = y_model - y_data
RSS = np.sum( np.square(residuals) )

mean_squared_residuals = np.sum( np.square(residuals) ) / len(residuals) 

MSE = np.mean( np.square(residuals) )

RMSE = np.sqrt(np.mean( np.square(residuals)))
RMSE = np.std(residuals)

# Deviations
deviations = np.mean(y_data) - y_data
VAR = np.sum(np.square(deviations))

# R-squared
r_squared = 1 - (RSS / VAR)
r = correlation(y_data, y_model)

RMSE vs R-Squared

RMSE -- how much variation is residual
R-squared -- what fraction of variation is linear
Name	Formula
Mean	sum(x) / len(x)
Deviation	x - mean(x)
Variance	np.mean(dx * dx)
Standard Deviation	np.sqrt(variance)