Skip to content

Instantly share code, notes, and snippets.

@Mengyuz
Created June 5, 2015 15:19
Show Gist options
  • Save Mengyuz/ff1946b589df4693bf9b to your computer and use it in GitHub Desktop.
Save Mengyuz/ff1946b589df4693bf9b to your computer and use it in GitHub Desktop.
import numpy as np
import pandas
import statsmodels.api as sm
"""
In this question, you need to:
1) implement the linear_regression() procedure
2) Select features (in the predictions procedure) and make predictions.
"""
def linear_regression(features, values):
"""
Perform linear regression given a data set with an arbitrary number of features.
This can be the same code as in the lesson #3 exercise.
"""
###########################
### YOUR CODE GOES HERE ###
###########################
features = sm.add_constant(features)
model = sm.OLS(values, features)
results = model.fit()
intercept = results.params[0]
params = results.params[1:]
return intercept, params
def predictions(dataframe):
'''
The NYC turnstile data is stored in a pandas dataframe called weather_turnstile.
Using the information stored in the dataframe, let's predict the ridership of
the NYC subway using linear regression with gradient descent.
You can download the complete turnstile weather dataframe here:
https://www.dropbox.com/s/meyki2wl9xfa7yk/turnstile_data_master_with_weather.csv
Your prediction should have a R^2 value of 0.40 or better.
You need to experiment using various input features contained in the dataframe.
We recommend that you don't use the EXITSn_hourly feature as an input to the
linear model because we cannot use it as a predictor: we cannot use exits
counts as a way to predict entry counts.
Note: Due to the memory and CPU limitation of our Amazon EC2 instance, we will
give you a random subet (~10%) of the data contained in
turnstile_data_master_with_weather.csv. You are encouraged to experiment with
this exercise on your own computer, locally. If you do, you may want to complete Exercise
8 using gradient descent, or limit your number of features to 10 or so, since ordinary
least squares can be very slow for a large number of features.
If you receive a "server has encountered an error" message, that means you are
hitting the 30-second limit that's placed on running your program. Try using a
smaller number of features.
'''
# Select Features (try different features!)
features = dataframe[['rain', 'precipi', 'Hour', 'meantempi']]
# Add UNIT to features using dummy variables
dummy_units = pandas.get_dummies(dataframe['UNIT'], prefix='unit')
features = features.join(dummy_units)
# Values
values = dataframe['ENTRIESn_hourly']
# Get the numpy arrays
features_array = features.values
values_array = values.values
# Perform linear regression
intercept, params = linear_regression(features_array, values_array)
predictions = intercept + np.dot(features_array, params)
return predictions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment