Skip to content

Instantly share code, notes, and snippets.

@ffund
Created October 19, 2022 23:14
Show Gist options
  • Save ffund/02ca0d9c797b78b28f88b6d7ece6fa2d to your computer and use it in GitHub Desktop.
Save ffund/02ca0d9c797b78b28f88b6d7ece6fa2d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train a `LinearRegression` classifier"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from datetime import datetime\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.metrics import r2_score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this question, we will try to predict the number of bicycle trips across the Fremont Bridge in Seattle, WA using weather, season, and other factors as potential predictive features.\n",
"\n",
"First, we'll load two datasets - a dataset of bike trips and a dataset of weather from October 2012 through August 2015- indexing by date:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bike = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)\n",
"weather = pd.read_csv('BicycleWeather.csv', index_col='DATE', parse_dates=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add some code here to inspect the data, see the names of features, and see the data types - the cell below will not be graded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will do some further processing on this dataset in preparation for training a model. We will construct a new data frame, with the target variable and the features that we will use."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we will prepare the target variable - the value we want to predict - by computing the number of bike trips per day, and saving the result in a new data frame called `daily_bike`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"daily_bike = bike.resample('d').sum()\n",
"daily_bike['Total'] = daily_bike.sum(axis=1)\n",
"daily_bike = daily_bike[['Total']] # remove other columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we'll prepare some features and add them to this `daily_bike` data frame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, since bike traffic patterns vary by day of the week, we will add a binary variable corresponding to each day of the week:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']\n",
"for i in range(7):\n",
" daily_bike[days[i]] = (daily_bike.index.dayofweek == i).astype(float)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another relevant feature might be the number of daylight hours. The following cell uses a standard astronomical calculation to add this information to `daily_bike`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def hours_of_daylight(date, axis=23.44, latitude=47.61):\n",
" \"\"\"Compute the hours of daylight for the given date\"\"\"\n",
" days = (date - datetime(2000, 12, 21)).days\n",
" m = (1. - np.tan(np.radians(latitude))\n",
" * np.tan(np.radians(axis) * np.cos(days * 2 * np.pi / 365.25)))\n",
" return 24. * np.degrees(np.arccos(1 - np.clip(m, 0, 2))) / 180.\n",
"\n",
"daily_bike['daylight_hrs'] = list(map(hours_of_daylight, daily_bike.index))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll add some weather features!\n",
"\n",
"* average temperature (in degrees celsius)\n",
"* total precipitation (in inches)\n",
"* a binary variable indicating whether the day is dry (zero precipitation) or not"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# temperatures are given in 1/10 deg C, so this converts to deg C\n",
"weather['TMIN'] /= 10\n",
"weather['TMAX'] /= 10\n",
"weather['temp_c'] = 0.5 * (weather['TMIN'] + weather['TMAX'])\n",
"\n",
"# precipitation is given in 1/10 mm, this converts to inches\n",
"weather['precip_in'] = weather['PRCP']/254\n",
"\n",
"# this indicates whether the day is dry - zero precipitation - or not\n",
"weather['dry_day'] = (weather['precip_in'] == 0).astype(int)\n",
"\n",
"daily_bike = daily_bike.join(weather[['precip_in', 'temp_c', 'dry_day']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we'll add a counter that just increases every day. This will help us see if there is a general trend over time, e.g. if bike riding is becoming more or less popular in general over the duration of this data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"daily_bike['counter'] = (daily_bike.index - daily_bike.index[0]).days / 365."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add some code here to inspect `daily_bike` - the cell below will not be graded."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"daily_bike.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it's time to split this `daily_bike` data into training and test sets, using `train_test_split`! Reserve the last 30% of the data for testing, and use the first 70% for training. \n",
"\n",
"The following cell should create `Xtr` and `Xts` as pandas data frames including only the features, and `ytr` and `yts` as 1d numpy arrays containing the target variable.\n",
"\n",
"(Note that this is time series data, and it is ordered by time. It would not be appropriate to shuffle this data.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#grade (write your code in this cell and DO NOT DELETE THIS LINE)\n",
"\n",
"# Xtr = ...\n",
"# Xts = ...\n",
"# ytr = ...\n",
"# yts = ...\n",
"\n",
"# Make sure target variable is a numpy array in the shape that the auto-grader expects (1d)\n",
"ytr = np.array(ytr).reshape(-1, )\n",
"yts = np.array(yts).reshape(-1, )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are ready to fit the `LinearRegression`. Using the default settings, fit the model on the training data. Then, use it to make predictions for the test samples, and save this prediction in `yts_hat`. Evaluate the R2 score of the model on the test data, and save this in `rsq`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#grade (write your code in this cell and DO NOT DELETE THIS LINE)\n",
"\n",
"# yts_hat = ...\n",
"# rsq = ..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To evaluate your work, compare the actual and predicted values visually:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"\n",
"_ = plt.plot(Xts.index, yts, label=\"Actual\")\n",
"_ = plt.plot(Xts.index, yts_hat, label=\"Predicted\")\n",
"_ = plt.legend()\n",
"_ = plt.xticks(rotation = 90) \n",
"_ = plt.ylabel(\"Total bike trips\")\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.10 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"vscode": {
"interpreter": {
"hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment