Skip to content

Instantly share code, notes, and snippets.

@gumdropsteve
Last active December 18, 2019 00:39
Show Gist options
  • Save gumdropsteve/47e1ed4293c1ada257c896b5e7fa70c0 to your computer and use it in GitHub Desktop.
Save gumdropsteve/47e1ed4293c1ada257c896b5e7fa70c0 to your computer and use it in GitHub Desktop.
cuML_Taxi_Fare_Prediction.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/gumdropsteve/6eddc72fe8a33a3f1d0421b7c35f07e8/blazingsql_cuml_taxi_fare_prediction.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "l4fOFMjbRvkZ"
},
"source": [
"# BlazingSQL + cuML NYC Taxi Cab Fare Prediction\n",
"\n",
"This demo uses pubically availible [NYC Taxi Cab Data](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) to predict the total fare of a taxi ride in New York City given the pickup and dropoff locations. \n",
"\n",
"In this notebook, we will cover: \n",
"- How to read & query CSV files BlazingSQL.\n",
"- How to implement a linear regression model with cuML.\n",
"\n",
"![Impression](https://www.google-analytics.com/collect?v=1&tid=UA-39814657-5&cid=555&t=event&ec=guides&ea=taxi_fare_prediction&dt=taxi_fare_prediction)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "RnUyVHwHmKyk"
},
"source": [
"## Import packages "
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"from cuml import LinearRegression\n",
"from blazingsql import BlazingContext"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create BlazingContext\n",
"You can think of the BlazingContext much like a Spark Context (i.e. where information such as FileSystems you have registered and Tables you have created will be stored). If you have issues running this cell, restart runtime and try running it again."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "St0Yc3O7zW3e",
"outputId": "7c9db9e2-c652-4aae-e153-c413d26202db"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"lo\n",
"BlazingContext ready\n"
]
}
],
"source": [
"bc = BlazingContext()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "Gt0TPBqif50q"
},
"source": [
"### Download Data\n",
"For this demo we will train our model with 25,000,000 rows of data from 5 csv files (5,000,000 rows each). \n",
"\n",
"The cell below will download them from AWS for you."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "6uR3aNkqD1Pb"
},
"outputs": [],
"source": [
"!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_00.csv\n",
"!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_01.csv\n",
"!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_02.csv\n",
"!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_03.csv\n",
"!wget https://blazingsql-colab.s3.amazonaws.com/taxi_data/taxi_04.csv "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"BlazingSQL makes use of the full file path when generating tables, the cell below finds the path to this notebook, and then adds a wildcard (*) so you can create a table from all 5 files at once."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'/home/winston/taxi*'"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# identify file path to wherever this folder\n",
"path = !pwd\n",
"\n",
"# extract string of path from SList \n",
"path = path[0] + '/'\n",
"\n",
"# add taxi wildcard to filepath \n",
"taxi_path = path + 'taxi*'\n",
"\n",
"# what's it look like?\n",
"taxi_path"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract, transform, load\n",
"In order to train our Linear Regression model, we must first preform ETL so to prepare our data.\n",
"\n",
"### ETL: Read and Join CSVs"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 11.3 ms, sys: 1.24 ms, total: 12.5 ms\n",
"Wall time: 9.9 ms\n"
]
},
{
"data": {
"text/plain": [
"<pyblazing.apiv2.context.BlazingTable at 0x7efdb1cd7198>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"# tag column names and types\n",
"col_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', \n",
" 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']\n",
"col_types = ['date64', 'float32', 'float32', 'float32',\n",
" 'float32', 'float32', 'float32']\n",
"\n",
"# create a table from all 5 taxi files at once\n",
"bc.create_table('taxi', taxi_path, names=col_names, dtype=col_types)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30334\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>key</th>\n",
" <th>fare_amount</th>\n",
" <th>pickup_longitude</th>\n",
" <th>pickup_latitude</th>\n",
" <th>dropoff_longitude</th>\n",
" <th>dropoff_latitude</th>\n",
" <th>passenger_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>24999995</th>\n",
" <td>2011-02-24 16:06:26.001</td>\n",
" <td>6.9</td>\n",
" <td>-73.966537</td>\n",
" <td>40.804974</td>\n",
" <td>-73.949043</td>\n",
" <td>40.804226</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24999996</th>\n",
" <td>2009-09-22 19:20:22.009</td>\n",
" <td>9.7</td>\n",
" <td>-73.980064</td>\n",
" <td>40.752533</td>\n",
" <td>-74.006432</td>\n",
" <td>40.739613</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24999997</th>\n",
" <td>2012-04-19 02:17:32.001</td>\n",
" <td>14.1</td>\n",
" <td>-73.998512</td>\n",
" <td>40.745308</td>\n",
" <td>-73.953186</td>\n",
" <td>40.799362</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24999998</th>\n",
" <td>2012-06-08 11:09:47.006</td>\n",
" <td>3.3</td>\n",
" <td>-73.953636</td>\n",
" <td>40.778801</td>\n",
" <td>-73.946068</td>\n",
" <td>40.775555</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24999999</th>\n",
" <td>2009-06-21 11:07:00.036</td>\n",
" <td>6.5</td>\n",
" <td>-73.981583</td>\n",
" <td>40.772572</td>\n",
" <td>-73.963326</td>\n",
" <td>40.762135</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" key fare_amount pickup_longitude \\\n",
"24999995 2011-02-24 16:06:26.001 6.9 -73.966537 \n",
"24999996 2009-09-22 19:20:22.009 9.7 -73.980064 \n",
"24999997 2012-04-19 02:17:32.001 14.1 -73.998512 \n",
"24999998 2012-06-08 11:09:47.006 3.3 -73.953636 \n",
"24999999 2009-06-21 11:07:00.036 6.5 -73.981583 \n",
"\n",
" pickup_latitude dropoff_longitude dropoff_latitude \\\n",
"24999995 40.804974 -73.949043 40.804226 \n",
"24999996 40.752533 -74.006432 40.739613 \n",
"24999997 40.745308 -73.953186 40.799362 \n",
"24999998 40.778801 -73.946068 40.775555 \n",
"24999999 40.772572 -73.963326 40.762135 \n",
"\n",
" passenger_count \n",
"24999995 2.0 \n",
"24999996 1.0 \n",
"24999997 2.0 \n",
"24999998 1.0 \n",
"24999999 1.0 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# query the whole table and display the tail of the results \n",
"bc.sql(\"select * from taxi\").tail() # note: BlazingSQL queries return cuDF DataFrame results "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ETL: Query Table for Training Data"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30334\n",
"CPU times: user 2.64 s, sys: 2.11 s, total: 4.75 s\n",
"Wall time: 2.68 s\n"
]
}
],
"source": [
"%%time\n",
"# extract time columns, long & lat, # riders (all floats)\n",
"query = '''\n",
" select \n",
" cast(hour(key) as float) hours, \n",
" cast(dayofmonth(key) as float) days,\n",
" cast(month(key) as float) months, \n",
" cast(year(key) - 2000 as float) years, \n",
" cast(dropoff_longitude - pickup_longitude as float) longitude_distance, \n",
" cast(dropoff_latitude - pickup_latitude as float) latitude_distance, \n",
" cast(passenger_count as float) passenger_count\n",
" from \n",
" taxi\n",
" '''\n",
"\n",
"# run query on table (returns cuDF DataFrame)\n",
"X_train = bc.sql(query)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hours</th>\n",
" <th>days</th>\n",
" <th>months</th>\n",
" <th>years</th>\n",
" <th>longitude_distance</th>\n",
" <th>latitude_distance</th>\n",
" <th>passenger_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.0</td>\n",
" <td>12.0</td>\n",
" <td>4.0</td>\n",
" <td>9.0</td>\n",
" <td>0.051445</td>\n",
" <td>0.050167</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>16.0</td>\n",
" <td>28.0</td>\n",
" <td>1.0</td>\n",
" <td>10.0</td>\n",
" <td>0.038200</td>\n",
" <td>0.056992</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>20.0</td>\n",
" <td>15.0</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" <td>0.080971</td>\n",
" <td>-0.006474</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>10.0</td>\n",
" <td>6.0</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" <td>-0.011604</td>\n",
" <td>-0.009991</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>20.0</td>\n",
" <td>22.0</td>\n",
" <td>4.0</td>\n",
" <td>10.0</td>\n",
" <td>0.008423</td>\n",
" <td>-0.009270</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hours days months years longitude_distance latitude_distance \\\n",
"0 3.0 12.0 4.0 9.0 0.051445 0.050167 \n",
"1 16.0 28.0 1.0 10.0 0.038200 0.056992 \n",
"2 20.0 15.0 8.0 10.0 0.080971 -0.006474 \n",
"3 10.0 6.0 2.0 10.0 -0.011604 -0.009991 \n",
"4 20.0 22.0 4.0 10.0 0.008423 -0.009270 \n",
"\n",
" passenger_count \n",
"0 3.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 3.0 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fill null values from remaining origional columns\n",
"X_train['longitude_distance'] = X_train['longitude_distance'].fillna(0)\n",
"X_train['latitude_distance'] = X_train['latitude_distance'].fillna(0)\n",
"X_train['passenger_count'] = X_train['passenger_count'].fillna(0)\n",
"\n",
"# how's it look?\n",
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "m97O0c5HzW39",
"outputId": "cc6e484c-a297-4611-f6b5-04db48d31a3e"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30334\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>fare_amount</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>14.6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>16.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>5.7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4.9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>6.9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" fare_amount\n",
"0 14.6\n",
"1 16.9\n",
"2 5.7\n",
"3 4.9\n",
"4 6.9"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# query dependent variable y\n",
"y_train = bc.sql(\"select fare_amount from taxi\")\n",
"\n",
"y_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-J3ZzyD6iwQy"
},
"source": [
"## Linear Regression\n",
"### LR: Train Model"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 531
},
"colab_type": "code",
"id": "tVUZvT9TB6Ii",
"outputId": "d61c0249-47ee-40b8-a72f-9d62383f23dd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Coefficients:\n",
"0 -0.027069\n",
"1 0.003295\n",
"2 0.107199\n",
"3 0.636710\n",
"4 0.000932\n",
"5 -0.000494\n",
"6 0.092028\n",
"dtype: float32\n",
"\n",
"Y intercept:\n",
"3.3607497215270996\n",
"\n",
"CPU times: user 367 ms, sys: 128 ms, total: 495 ms\n",
"Wall time: 794 ms\n"
]
}
],
"source": [
"%%time\n",
"# create model\n",
"lr = LinearRegression()\n",
"\n",
"# train model on 25,000,000 rows of data\n",
"reg = lr.fit(X_train, y_train)\n",
"\n",
"# display results\n",
"print(f\"Coefficients:\\n{reg.coef_}\\n\")\n",
"print(f\"Y intercept:\\n{reg.intercept_}\\n\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "pHtni9xcl-ht"
},
"source": [
"### LR: Use Model to Predict Future Taxi Fares \n",
"\n",
"For this we are using a second dataset with no fare amount. The cell below will download this dataset for you."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "C8wND8RbzW4F",
"outputId": "59797d96-021b-4938-c8b6-cec4d7b9056b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2019-12-17 23:28:58-- https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv\n",
"Resolving blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)... 52.219.24.36\n",
"Connecting to blazingsql-demos.s3-us-west-1.amazonaws.com (blazingsql-demos.s3-us-west-1.amazonaws.com)|52.219.24.36|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 982916 (960K) [text/csv]\n",
"Saving to: ‘test.csv’\n",
"\n",
"test.csv 100%[===================>] 959.88K 4.42MB/s in 0.2s \n",
"\n",
"2019-12-17 23:28:59 (4.42 MB/s) - ‘test.csv’ saved [982916/982916]\n",
"\n"
]
}
],
"source": [
"# download test data\n",
"!wget 'https://blazingsql-demos.s3-us-west-1.amazonaws.com/test.csv'"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yRM5PosNiuGh",
"outputId": "a2b291d2-edce-4701-931e-c80ecf263baf"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 11.7 ms, sys: 194 µs, total: 11.9 ms\n",
"Wall time: 9.5 ms\n"
]
},
{
"data": {
"text/plain": [
"<pyblazing.apiv2.context.BlazingTable at 0x7efdb1f36e80>"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"# set column names and types\n",
"column_names = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', \n",
" 'dropoff_longitude', 'dropoff_latitude', 'passenger_count']\n",
"column_types = ['date64', 'float32', 'float32', 'float32', 'float32', 'float32', 'float32']\n",
"\n",
"# create test table \n",
"bc.create_table('test', path+'test.csv', names=column_names, dtype=column_types)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "g4I8AJ51dpW5",
"outputId": "6213681d-1f9f-4aeb-c195-125804c5d064"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"30334\n",
"CPU times: user 30 ms, sys: 11.9 ms, total: 42 ms\n",
"Wall time: 28 ms\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hours</th>\n",
" <th>days</th>\n",
" <th>months</th>\n",
" <th>years</th>\n",
" <th>longitude_distance</th>\n",
" <th>latitude_distance</th>\n",
" <th>passenger_count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>13.0</td>\n",
" <td>27.0</td>\n",
" <td>1.0</td>\n",
" <td>15.0</td>\n",
" <td>-0.008110</td>\n",
" <td>-0.019970</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>13.0</td>\n",
" <td>27.0</td>\n",
" <td>1.0</td>\n",
" <td>15.0</td>\n",
" <td>-0.012024</td>\n",
" <td>0.019814</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>11.0</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" <td>11.0</td>\n",
" <td>0.002869</td>\n",
" <td>-0.005119</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>21.0</td>\n",
" <td>1.0</td>\n",
" <td>12.0</td>\n",
" <td>12.0</td>\n",
" <td>-0.009277</td>\n",
" <td>-0.016178</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>21.0</td>\n",
" <td>1.0</td>\n",
" <td>12.0</td>\n",
" <td>12.0</td>\n",
" <td>-0.022537</td>\n",
" <td>-0.045345</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hours days months years longitude_distance latitude_distance \\\n",
"0 13.0 27.0 1.0 15.0 -0.008110 -0.019970 \n",
"1 13.0 27.0 1.0 15.0 -0.012024 0.019814 \n",
"2 11.0 8.0 10.0 11.0 0.002869 -0.005119 \n",
"3 21.0 1.0 12.0 12.0 -0.009277 -0.016178 \n",
"4 21.0 1.0 12.0 12.0 -0.022537 -0.045345 \n",
"\n",
" passenger_count \n",
"0 1.0 \n",
"1 1.0 \n",
"2 1.0 \n",
"3 1.0 \n",
"4 1.0 "
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"# extract time columns, long & lat, # riders (all floats)\n",
"query = '''\n",
" select \n",
" cast(hour(key) as float) hours, \n",
" cast(dayofmonth(key) as float) days,\n",
" cast(month(key) as float) months, \n",
" cast(year(key) - 2000 as float) years, \n",
" cast(dropoff_longitude - pickup_longitude as float) longitude_distance, \n",
" cast(dropoff_latitude - pickup_latitude as float) latitude_distance, \n",
" cast(passenger_count as float) passenger_count\n",
" from \n",
" test\n",
" '''\n",
"\n",
"# run query on table (returns cuDF DataFrame)\n",
"X_test = bc.sql(query)\n",
"\n",
"# fill null values \n",
"X_test['longitude_distance'] = X_test['longitude_distance'].fillna(0)\n",
"X_test['latitude_distance'] = X_test['latitude_distance'].fillna(0)\n",
"X_test['passenger_count'] = X_test['passenger_count'].fillna(0)\n",
"\n",
"# how's it look? \n",
"X_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "zCft6P5QkepN",
"outputId": "7e954f1a-9e1a-49a3-9e2a-53e410e9e230"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 3 µs, sys: 0 ns, total: 3 µs\n",
"Wall time: 5.72 µs\n"
]
},
{
"data": {
"text/plain": [
"0 12.847698\n",
"1 12.847675\n",
"2 11.257184\n",
"3 11.814528\n",
"4 11.814531\n",
"5 11.814524\n",
"6 11.223511\n",
"dtype: float32"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%time\n",
"# predict fares \n",
"predictions = lr.predict(X_test)\n",
"\n",
"# display first 7 predictions\n",
"predictions.head(7)"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "GdjUjJ42l2BI",
"outputId": "cc8898b4-7289-4a77-baec-5b92168570fc"
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>hours</th>\n",
" <th>days</th>\n",
" <th>months</th>\n",
" <th>years</th>\n",
" <th>longitude_distance</th>\n",
" <th>latitude_distance</th>\n",
" <th>passenger_count</th>\n",
" <th>predicted_fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2082</th>\n",
" <td>22.0</td>\n",
" <td>13.0</td>\n",
" <td>12.0</td>\n",
" <td>11.0</td>\n",
" <td>0.012161</td>\n",
" <td>0.018772</td>\n",
" <td>1.0</td>\n",
" <td>11.190294</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2152</th>\n",
" <td>12.0</td>\n",
" <td>7.0</td>\n",
" <td>6.0</td>\n",
" <td>11.0</td>\n",
" <td>0.032173</td>\n",
" <td>0.019627</td>\n",
" <td>1.0</td>\n",
" <td>10.798038</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2682</th>\n",
" <td>22.0</td>\n",
" <td>25.0</td>\n",
" <td>1.0</td>\n",
" <td>12.0</td>\n",
" <td>-0.009979</td>\n",
" <td>-0.051266</td>\n",
" <td>1.0</td>\n",
" <td>10.687368</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6094</th>\n",
" <td>15.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>10.0</td>\n",
" <td>-0.005768</td>\n",
" <td>0.000046</td>\n",
" <td>1.0</td>\n",
" <td>10.515482</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2227</th>\n",
" <td>18.0</td>\n",
" <td>17.0</td>\n",
" <td>10.0</td>\n",
" <td>11.0</td>\n",
" <td>0.027267</td>\n",
" <td>0.015953</td>\n",
" <td>1.0</td>\n",
" <td>11.097368</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8146</th>\n",
" <td>8.0</td>\n",
" <td>8.0</td>\n",
" <td>7.0</td>\n",
" <td>14.0</td>\n",
" <td>-0.002602</td>\n",
" <td>-0.007404</td>\n",
" <td>2.0</td>\n",
" <td>13.018948</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1226</th>\n",
" <td>16.0</td>\n",
" <td>29.0</td>\n",
" <td>9.0</td>\n",
" <td>10.0</td>\n",
" <td>-0.011093</td>\n",
" <td>-0.012394</td>\n",
" <td>1.0</td>\n",
" <td>10.447118</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3841</th>\n",
" <td>23.0</td>\n",
" <td>26.0</td>\n",
" <td>2.0</td>\n",
" <td>15.0</td>\n",
" <td>0.018929</td>\n",
" <td>-0.054253</td>\n",
" <td>1.0</td>\n",
" <td>12.680952</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4315</th>\n",
" <td>21.0</td>\n",
" <td>8.0</td>\n",
" <td>9.0</td>\n",
" <td>12.0</td>\n",
" <td>0.002541</td>\n",
" <td>0.015274</td>\n",
" <td>1.0</td>\n",
" <td>11.515991</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3829</th>\n",
" <td>7.0</td>\n",
" <td>17.0</td>\n",
" <td>8.0</td>\n",
" <td>10.0</td>\n",
" <td>0.019821</td>\n",
" <td>0.019352</td>\n",
" <td>1.0</td>\n",
" <td>10.544013</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hours days months years longitude_distance latitude_distance \\\n",
"2082 22.0 13.0 12.0 11.0 0.012161 0.018772 \n",
"2152 12.0 7.0 6.0 11.0 0.032173 0.019627 \n",
"2682 22.0 25.0 1.0 12.0 -0.009979 -0.051266 \n",
"6094 15.0 9.0 10.0 10.0 -0.005768 0.000046 \n",
"2227 18.0 17.0 10.0 11.0 0.027267 0.015953 \n",
"8146 8.0 8.0 7.0 14.0 -0.002602 -0.007404 \n",
"1226 16.0 29.0 9.0 10.0 -0.011093 -0.012394 \n",
"3841 23.0 26.0 2.0 15.0 0.018929 -0.054253 \n",
"4315 21.0 8.0 9.0 12.0 0.002541 0.015274 \n",
"3829 7.0 17.0 8.0 10.0 0.019821 0.019352 \n",
"\n",
" passenger_count predicted_fare \n",
"2082 1.0 11.190294 \n",
"2152 1.0 10.798038 \n",
"2682 1.0 10.687368 \n",
"6094 1.0 10.515482 \n",
"2227 1.0 11.097368 \n",
"8146 2.0 13.018948 \n",
"1226 1.0 10.447118 \n",
"3841 1.0 12.680952 \n",
"4315 1.0 11.515991 \n",
"3829 1.0 10.544013 "
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# add predictions to test table \n",
"X_test['predicted_fare'] = predictions\n",
"\n",
"# how's that look? (convert to pandas for .sample())\n",
"X_test.to_pandas().sample(10)"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "BlazingSQL_cuML_Taxi_Fare_Prediction.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment