Skip to content

Instantly share code, notes, and snippets.

@darthgera123
Created March 27, 2020 11:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darthgera123/82ecbc939eeb7b42f91353176d71e9e0 to your computer and use it in GitHub Desktop.
Save darthgera123/82ecbc939eeb7b42f91353176d71e9e0 to your computer and use it in GitHub Desktop.
Baseline Submission for YPMSD Challenge
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Baseline Submission for the Challenge YPMSD"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split \n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn import metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_data = pd.read_csv('../data/public/train.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean and analyse the data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>timbre_mean_0</th>\n",
" <th>timbre_mean_1</th>\n",
" <th>timbre_mean_2</th>\n",
" <th>timbre_mean_3</th>\n",
" <th>timbre_mean_4</th>\n",
" <th>timbre_mean_5</th>\n",
" <th>timbre_mean_6</th>\n",
" <th>timbre_mean_7</th>\n",
" <th>timbre_mean_8</th>\n",
" <th>...</th>\n",
" <th>timbre_cov_68</th>\n",
" <th>timbre_cov_69</th>\n",
" <th>timbre_cov_70</th>\n",
" <th>timbre_cov_71</th>\n",
" <th>timbre_cov_72</th>\n",
" <th>timbre_cov_73</th>\n",
" <th>timbre_cov_74</th>\n",
" <th>timbre_cov_75</th>\n",
" <th>timbre_cov_76</th>\n",
" <th>timbre_cov_77</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2001</td>\n",
" <td>49.94357</td>\n",
" <td>21.47114</td>\n",
" <td>73.07750</td>\n",
" <td>8.74861</td>\n",
" <td>-17.40628</td>\n",
" <td>-13.09905</td>\n",
" <td>-25.01202</td>\n",
" <td>-12.23257</td>\n",
" <td>7.83089</td>\n",
" <td>...</td>\n",
" <td>13.01620</td>\n",
" <td>-54.40548</td>\n",
" <td>58.99367</td>\n",
" <td>15.37344</td>\n",
" <td>1.11144</td>\n",
" <td>-23.08793</td>\n",
" <td>68.40795</td>\n",
" <td>-1.82223</td>\n",
" <td>-27.46348</td>\n",
" <td>2.26327</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2001</td>\n",
" <td>48.73215</td>\n",
" <td>18.42930</td>\n",
" <td>70.32679</td>\n",
" <td>12.94636</td>\n",
" <td>-10.32437</td>\n",
" <td>-24.83777</td>\n",
" <td>8.76630</td>\n",
" <td>-0.92019</td>\n",
" <td>18.76548</td>\n",
" <td>...</td>\n",
" <td>5.66812</td>\n",
" <td>-19.68073</td>\n",
" <td>33.04964</td>\n",
" <td>42.87836</td>\n",
" <td>-9.90378</td>\n",
" <td>-32.22788</td>\n",
" <td>70.49388</td>\n",
" <td>12.04941</td>\n",
" <td>58.43453</td>\n",
" <td>26.92061</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2001</td>\n",
" <td>50.95714</td>\n",
" <td>31.85602</td>\n",
" <td>55.81851</td>\n",
" <td>13.41693</td>\n",
" <td>-6.57898</td>\n",
" <td>-18.54940</td>\n",
" <td>-3.27872</td>\n",
" <td>-2.35035</td>\n",
" <td>16.07017</td>\n",
" <td>...</td>\n",
" <td>3.03800</td>\n",
" <td>26.05866</td>\n",
" <td>-50.92779</td>\n",
" <td>10.93792</td>\n",
" <td>-0.07568</td>\n",
" <td>43.20130</td>\n",
" <td>-115.00698</td>\n",
" <td>-0.05859</td>\n",
" <td>39.67068</td>\n",
" <td>-0.66345</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2001</td>\n",
" <td>48.24750</td>\n",
" <td>-1.89837</td>\n",
" <td>36.29772</td>\n",
" <td>2.58776</td>\n",
" <td>0.97170</td>\n",
" <td>-26.21683</td>\n",
" <td>5.05097</td>\n",
" <td>-10.34124</td>\n",
" <td>3.55005</td>\n",
" <td>...</td>\n",
" <td>34.57337</td>\n",
" <td>-171.70734</td>\n",
" <td>-16.96705</td>\n",
" <td>-46.67617</td>\n",
" <td>-12.51516</td>\n",
" <td>82.58061</td>\n",
" <td>-72.08993</td>\n",
" <td>9.90558</td>\n",
" <td>199.62971</td>\n",
" <td>18.85382</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2001</td>\n",
" <td>50.97020</td>\n",
" <td>42.20998</td>\n",
" <td>67.09964</td>\n",
" <td>8.46791</td>\n",
" <td>-15.85279</td>\n",
" <td>-16.81409</td>\n",
" <td>-12.48207</td>\n",
" <td>-9.37636</td>\n",
" <td>12.63699</td>\n",
" <td>...</td>\n",
" <td>9.92661</td>\n",
" <td>-55.95724</td>\n",
" <td>64.92712</td>\n",
" <td>-17.72522</td>\n",
" <td>-1.49237</td>\n",
" <td>-7.50035</td>\n",
" <td>51.76631</td>\n",
" <td>7.88713</td>\n",
" <td>55.66926</td>\n",
" <td>28.74903</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 91 columns</p>\n",
"</div>"
],
"text/plain": [
" year timbre_mean_0 timbre_mean_1 timbre_mean_2 timbre_mean_3 \\\n",
"0 2001 49.94357 21.47114 73.07750 8.74861 \n",
"1 2001 48.73215 18.42930 70.32679 12.94636 \n",
"2 2001 50.95714 31.85602 55.81851 13.41693 \n",
"3 2001 48.24750 -1.89837 36.29772 2.58776 \n",
"4 2001 50.97020 42.20998 67.09964 8.46791 \n",
"\n",
" timbre_mean_4 timbre_mean_5 timbre_mean_6 timbre_mean_7 timbre_mean_8 \\\n",
"0 -17.40628 -13.09905 -25.01202 -12.23257 7.83089 \n",
"1 -10.32437 -24.83777 8.76630 -0.92019 18.76548 \n",
"2 -6.57898 -18.54940 -3.27872 -2.35035 16.07017 \n",
"3 0.97170 -26.21683 5.05097 -10.34124 3.55005 \n",
"4 -15.85279 -16.81409 -12.48207 -9.37636 12.63699 \n",
"\n",
" ... timbre_cov_68 timbre_cov_69 timbre_cov_70 timbre_cov_71 \\\n",
"0 ... 13.01620 -54.40548 58.99367 15.37344 \n",
"1 ... 5.66812 -19.68073 33.04964 42.87836 \n",
"2 ... 3.03800 26.05866 -50.92779 10.93792 \n",
"3 ... 34.57337 -171.70734 -16.96705 -46.67617 \n",
"4 ... 9.92661 -55.95724 64.92712 -17.72522 \n",
"\n",
" timbre_cov_72 timbre_cov_73 timbre_cov_74 timbre_cov_75 timbre_cov_76 \\\n",
"0 1.11144 -23.08793 68.40795 -1.82223 -27.46348 \n",
"1 -9.90378 -32.22788 70.49388 12.04941 58.43453 \n",
"2 -0.07568 43.20130 -115.00698 -0.05859 39.67068 \n",
"3 -12.51516 82.58061 -72.08993 9.90558 199.62971 \n",
"4 -1.49237 -7.50035 51.76631 7.88713 55.66926 \n",
"\n",
" timbre_cov_77 \n",
"0 2.26327 \n",
"1 26.92061 \n",
"2 -0.66345 \n",
"3 18.85382 \n",
"4 28.74903 \n",
"\n",
"[5 rows x 91 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>timbre_mean_0</th>\n",
" <th>timbre_mean_1</th>\n",
" <th>timbre_mean_2</th>\n",
" <th>timbre_mean_3</th>\n",
" <th>timbre_mean_4</th>\n",
" <th>timbre_mean_5</th>\n",
" <th>timbre_mean_6</th>\n",
" <th>timbre_mean_7</th>\n",
" <th>timbre_mean_8</th>\n",
" <th>...</th>\n",
" <th>timbre_cov_68</th>\n",
" <th>timbre_cov_69</th>\n",
" <th>timbre_cov_70</th>\n",
" <th>timbre_cov_71</th>\n",
" <th>timbre_cov_72</th>\n",
" <th>timbre_cov_73</th>\n",
" <th>timbre_cov_74</th>\n",
" <th>timbre_cov_75</th>\n",
" <th>timbre_cov_76</th>\n",
" <th>timbre_cov_77</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>...</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" <td>463715.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1998.386095</td>\n",
" <td>43.385488</td>\n",
" <td>1.261091</td>\n",
" <td>8.650195</td>\n",
" <td>1.130763</td>\n",
" <td>-6.512725</td>\n",
" <td>-9.565527</td>\n",
" <td>-2.384609</td>\n",
" <td>-1.793722</td>\n",
" <td>3.714584</td>\n",
" <td>...</td>\n",
" <td>15.743361</td>\n",
" <td>-73.067753</td>\n",
" <td>41.423976</td>\n",
" <td>37.780868</td>\n",
" <td>0.345259</td>\n",
" <td>17.599280</td>\n",
" <td>-26.364826</td>\n",
" <td>4.444985</td>\n",
" <td>19.739307</td>\n",
" <td>1.323326</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>10.939767</td>\n",
" <td>6.079139</td>\n",
" <td>51.613473</td>\n",
" <td>35.264750</td>\n",
" <td>16.334672</td>\n",
" <td>22.855820</td>\n",
" <td>12.836758</td>\n",
" <td>14.580245</td>\n",
" <td>7.961876</td>\n",
" <td>10.579241</td>\n",
" <td>...</td>\n",
" <td>32.086356</td>\n",
" <td>175.376872</td>\n",
" <td>121.794610</td>\n",
" <td>94.874474</td>\n",
" <td>16.153797</td>\n",
" <td>114.336522</td>\n",
" <td>174.187892</td>\n",
" <td>13.320996</td>\n",
" <td>184.843503</td>\n",
" <td>22.045404</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1922.000000</td>\n",
" <td>1.749000</td>\n",
" <td>-337.092500</td>\n",
" <td>-301.005060</td>\n",
" <td>-154.183580</td>\n",
" <td>-181.953370</td>\n",
" <td>-81.794290</td>\n",
" <td>-188.214000</td>\n",
" <td>-72.503850</td>\n",
" <td>-126.479040</td>\n",
" <td>...</td>\n",
" <td>-437.722030</td>\n",
" <td>-4402.376440</td>\n",
" <td>-1810.689190</td>\n",
" <td>-3098.350310</td>\n",
" <td>-341.789120</td>\n",
" <td>-3168.924570</td>\n",
" <td>-4319.992320</td>\n",
" <td>-236.039260</td>\n",
" <td>-7458.378150</td>\n",
" <td>-318.223330</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>1994.000000</td>\n",
" <td>39.957540</td>\n",
" <td>-26.153810</td>\n",
" <td>-11.441920</td>\n",
" <td>-8.515155</td>\n",
" <td>-20.636960</td>\n",
" <td>-18.468705</td>\n",
" <td>-10.776340</td>\n",
" <td>-6.461400</td>\n",
" <td>-2.303600</td>\n",
" <td>...</td>\n",
" <td>-1.798085</td>\n",
" <td>-139.062035</td>\n",
" <td>-20.918635</td>\n",
" <td>-4.711470</td>\n",
" <td>-6.758160</td>\n",
" <td>-31.563615</td>\n",
" <td>-101.396245</td>\n",
" <td>-2.572830</td>\n",
" <td>-59.598030</td>\n",
" <td>-8.813335</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>2002.000000</td>\n",
" <td>44.262570</td>\n",
" <td>8.371550</td>\n",
" <td>10.470520</td>\n",
" <td>-0.691610</td>\n",
" <td>-5.992740</td>\n",
" <td>-11.208850</td>\n",
" <td>-2.047850</td>\n",
" <td>-1.735440</td>\n",
" <td>3.816840</td>\n",
" <td>...</td>\n",
" <td>9.161360</td>\n",
" <td>-52.878010</td>\n",
" <td>28.709870</td>\n",
" <td>33.494550</td>\n",
" <td>0.828350</td>\n",
" <td>15.554490</td>\n",
" <td>-21.123570</td>\n",
" <td>3.111120</td>\n",
" <td>7.586950</td>\n",
" <td>0.052840</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2006.000000</td>\n",
" <td>47.833650</td>\n",
" <td>36.143780</td>\n",
" <td>29.741165</td>\n",
" <td>8.756995</td>\n",
" <td>7.749590</td>\n",
" <td>-2.422590</td>\n",
" <td>6.515710</td>\n",
" <td>2.905130</td>\n",
" <td>9.950960</td>\n",
" <td>...</td>\n",
" <td>26.248290</td>\n",
" <td>13.620660</td>\n",
" <td>89.419995</td>\n",
" <td>77.674700</td>\n",
" <td>8.495715</td>\n",
" <td>67.743725</td>\n",
" <td>52.299850</td>\n",
" <td>9.948955</td>\n",
" <td>86.203115</td>\n",
" <td>9.670740</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>2011.000000</td>\n",
" <td>61.970140</td>\n",
" <td>384.065730</td>\n",
" <td>322.851430</td>\n",
" <td>289.527430</td>\n",
" <td>262.068870</td>\n",
" <td>119.815590</td>\n",
" <td>172.402680</td>\n",
" <td>105.210280</td>\n",
" <td>146.297950</td>\n",
" <td>...</td>\n",
" <td>840.973380</td>\n",
" <td>4469.454870</td>\n",
" <td>3210.701700</td>\n",
" <td>1672.647100</td>\n",
" <td>260.544900</td>\n",
" <td>3662.065650</td>\n",
" <td>2833.608950</td>\n",
" <td>463.419500</td>\n",
" <td>7393.398440</td>\n",
" <td>600.766240</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 91 columns</p>\n",
"</div>"
],
"text/plain": [
" year timbre_mean_0 timbre_mean_1 timbre_mean_2 \\\n",
"count 463715.000000 463715.000000 463715.000000 463715.000000 \n",
"mean 1998.386095 43.385488 1.261091 8.650195 \n",
"std 10.939767 6.079139 51.613473 35.264750 \n",
"min 1922.000000 1.749000 -337.092500 -301.005060 \n",
"25% 1994.000000 39.957540 -26.153810 -11.441920 \n",
"50% 2002.000000 44.262570 8.371550 10.470520 \n",
"75% 2006.000000 47.833650 36.143780 29.741165 \n",
"max 2011.000000 61.970140 384.065730 322.851430 \n",
"\n",
" timbre_mean_3 timbre_mean_4 timbre_mean_5 timbre_mean_6 \\\n",
"count 463715.000000 463715.000000 463715.000000 463715.000000 \n",
"mean 1.130763 -6.512725 -9.565527 -2.384609 \n",
"std 16.334672 22.855820 12.836758 14.580245 \n",
"min -154.183580 -181.953370 -81.794290 -188.214000 \n",
"25% -8.515155 -20.636960 -18.468705 -10.776340 \n",
"50% -0.691610 -5.992740 -11.208850 -2.047850 \n",
"75% 8.756995 7.749590 -2.422590 6.515710 \n",
"max 289.527430 262.068870 119.815590 172.402680 \n",
"\n",
" timbre_mean_7 timbre_mean_8 ... timbre_cov_68 timbre_cov_69 \\\n",
"count 463715.000000 463715.000000 ... 463715.000000 463715.000000 \n",
"mean -1.793722 3.714584 ... 15.743361 -73.067753 \n",
"std 7.961876 10.579241 ... 32.086356 175.376872 \n",
"min -72.503850 -126.479040 ... -437.722030 -4402.376440 \n",
"25% -6.461400 -2.303600 ... -1.798085 -139.062035 \n",
"50% -1.735440 3.816840 ... 9.161360 -52.878010 \n",
"75% 2.905130 9.950960 ... 26.248290 13.620660 \n",
"max 105.210280 146.297950 ... 840.973380 4469.454870 \n",
"\n",
" timbre_cov_70 timbre_cov_71 timbre_cov_72 timbre_cov_73 \\\n",
"count 463715.000000 463715.000000 463715.000000 463715.000000 \n",
"mean 41.423976 37.780868 0.345259 17.599280 \n",
"std 121.794610 94.874474 16.153797 114.336522 \n",
"min -1810.689190 -3098.350310 -341.789120 -3168.924570 \n",
"25% -20.918635 -4.711470 -6.758160 -31.563615 \n",
"50% 28.709870 33.494550 0.828350 15.554490 \n",
"75% 89.419995 77.674700 8.495715 67.743725 \n",
"max 3210.701700 1672.647100 260.544900 3662.065650 \n",
"\n",
" timbre_cov_74 timbre_cov_75 timbre_cov_76 timbre_cov_77 \n",
"count 463715.000000 463715.000000 463715.000000 463715.000000 \n",
"mean -26.364826 4.444985 19.739307 1.323326 \n",
"std 174.187892 13.320996 184.843503 22.045404 \n",
"min -4319.992320 -236.039260 -7458.378150 -318.223330 \n",
"25% -101.396245 -2.572830 -59.598030 -8.813335 \n",
"50% -21.123570 3.111120 7.586950 0.052840 \n",
"75% 52.299850 9.948955 86.203115 9.670740 \n",
"max 2833.608950 463.419500 7393.398440 600.766240 \n",
"\n",
"[8 rows x 91 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split Data for Train and Validation"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X = train_data.drop('year',1)\n",
"y = train_data['year']\n",
"# Validation testing\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the Classifier and Train"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor = LinearRegression() \n",
"regressor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check which variables have the most impact"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Coefficient</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>timbre_mean_0</th>\n",
" <td>0.873376</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timbre_mean_1</th>\n",
" <td>-0.055835</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timbre_mean_2</th>\n",
" <td>-0.043576</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timbre_mean_3</th>\n",
" <td>0.004539</td>\n",
" </tr>\n",
" <tr>\n",
" <th>timbre_mean_4</th>\n",
" <td>-0.015032</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Coefficient\n",
"timbre_mean_0 0.873376\n",
"timbre_mean_1 -0.055835\n",
"timbre_mean_2 -0.043576\n",
"timbre_mean_3 0.004539\n",
"timbre_mean_4 -0.015032"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']) \n",
"coeff_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predict on validation"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_pred = regressor.predict(X_val)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Actual</th>\n",
" <th>Predicted</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>332595</th>\n",
" <td>2004</td>\n",
" <td>2002.979558</td>\n",
" </tr>\n",
" <tr>\n",
" <th>230573</th>\n",
" <td>1989</td>\n",
" <td>1996.446079</td>\n",
" </tr>\n",
" <tr>\n",
" <th>364530</th>\n",
" <td>1987</td>\n",
" <td>1995.333451</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82857</th>\n",
" <td>2002</td>\n",
" <td>1998.163320</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108108</th>\n",
" <td>1971</td>\n",
" <td>1998.303355</td>\n",
" </tr>\n",
" <tr>\n",
" <th>446568</th>\n",
" <td>2005</td>\n",
" <td>2000.499458</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27815</th>\n",
" <td>2004</td>\n",
" <td>1995.818434</td>\n",
" </tr>\n",
" <tr>\n",
" <th>214974</th>\n",
" <td>1997</td>\n",
" <td>1999.666288</td>\n",
" </tr>\n",
" <tr>\n",
" <th>304899</th>\n",
" <td>2006</td>\n",
" <td>2005.025704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257881</th>\n",
" <td>2007</td>\n",
" <td>1998.581968</td>\n",
" </tr>\n",
" <tr>\n",
" <th>144054</th>\n",
" <td>2004</td>\n",
" <td>2001.671307</td>\n",
" </tr>\n",
" <tr>\n",
" <th>292186</th>\n",
" <td>1993</td>\n",
" <td>1991.859040</td>\n",
" </tr>\n",
" <tr>\n",
" <th>260055</th>\n",
" <td>1984</td>\n",
" <td>1989.047164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50427</th>\n",
" <td>2007</td>\n",
" <td>2001.741216</td>\n",
" </tr>\n",
" <tr>\n",
" <th>380270</th>\n",
" <td>1975</td>\n",
" <td>1994.628844</td>\n",
" </tr>\n",
" <tr>\n",
" <th>43122</th>\n",
" <td>2003</td>\n",
" <td>2004.481078</td>\n",
" </tr>\n",
" <tr>\n",
" <th>431264</th>\n",
" <td>2000</td>\n",
" <td>1999.890466</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75602</th>\n",
" <td>2009</td>\n",
" <td>1998.807686</td>\n",
" </tr>\n",
" <tr>\n",
" <th>461034</th>\n",
" <td>1987</td>\n",
" <td>1989.828458</td>\n",
" </tr>\n",
" <tr>\n",
" <th>336805</th>\n",
" <td>1996</td>\n",
" <td>1995.843295</td>\n",
" </tr>\n",
" <tr>\n",
" <th>375889</th>\n",
" <td>1999</td>\n",
" <td>2001.135930</td>\n",
" </tr>\n",
" <tr>\n",
" <th>182008</th>\n",
" <td>2008</td>\n",
" <td>2008.060799</td>\n",
" </tr>\n",
" <tr>\n",
" <th>283427</th>\n",
" <td>2002</td>\n",
" <td>1998.591879</td>\n",
" </tr>\n",
" <tr>\n",
" <th>345613</th>\n",
" <td>1955</td>\n",
" <td>1988.051633</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97235</th>\n",
" <td>1999</td>\n",
" <td>1988.706992</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Actual Predicted\n",
"332595 2004 2002.979558\n",
"230573 1989 1996.446079\n",
"364530 1987 1995.333451\n",
"82857 2002 1998.163320\n",
"108108 1971 1998.303355\n",
"446568 2005 2000.499458\n",
"27815 2004 1995.818434\n",
"214974 1997 1999.666288\n",
"304899 2006 2005.025704\n",
"257881 2007 1998.581968\n",
"144054 2004 2001.671307\n",
"292186 1993 1991.859040\n",
"260055 1984 1989.047164\n",
"50427 2007 2001.741216\n",
"380270 1975 1994.628844\n",
"43122 2003 2004.481078\n",
"431264 2000 1999.890466\n",
"75602 2009 1998.807686\n",
"461034 1987 1989.828458\n",
"336805 1996 1995.843295\n",
"375889 1999 2001.135930\n",
"182008 2008 2008.060799\n",
"283427 2002 1998.591879\n",
"345613 1955 1988.051633\n",
"97235 1999 1988.706992"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})\n",
"df1 = df.head(25)\n",
"df1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate the Performance"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean Absolute Error: 6.77395050034565\n",
"Mean Squared Error: 90.87071514117896\n",
"Root Mean Squared Error: 9.53261323778422\n"
]
}
],
"source": [
"print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred)) \n",
"print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred)) \n",
"print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Test Set"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"test_data = pd.read_csv('../data/public/test.csv')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 1</th>\n",
" <th>Unnamed: 2</th>\n",
" <th>Unnamed: 3</th>\n",
" <th>Unnamed: 4</th>\n",
" <th>Unnamed: 5</th>\n",
" <th>Unnamed: 6</th>\n",
" <th>Unnamed: 7</th>\n",
" <th>Unnamed: 8</th>\n",
" <th>Unnamed: 9</th>\n",
" <th>Unnamed: 10</th>\n",
" <th>...</th>\n",
" <th>Unnamed: 81</th>\n",
" <th>Unnamed: 82</th>\n",
" <th>Unnamed: 83</th>\n",
" <th>Unnamed: 84</th>\n",
" <th>Unnamed: 85</th>\n",
" <th>Unnamed: 86</th>\n",
" <th>Unnamed: 87</th>\n",
" <th>Unnamed: 88</th>\n",
" <th>Unnamed: 89</th>\n",
" <th>Unnamed: 90</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>45.44200</td>\n",
" <td>-30.74976</td>\n",
" <td>31.78587</td>\n",
" <td>4.63569</td>\n",
" <td>-15.14894</td>\n",
" <td>0.23370</td>\n",
" <td>-11.97968</td>\n",
" <td>-9.59708</td>\n",
" <td>6.48111</td>\n",
" <td>-8.89073</td>\n",
" <td>...</td>\n",
" <td>-8.84046</td>\n",
" <td>-0.15439</td>\n",
" <td>137.44210</td>\n",
" <td>77.54739</td>\n",
" <td>-4.22875</td>\n",
" <td>-61.92657</td>\n",
" <td>-33.52722</td>\n",
" <td>-3.86253</td>\n",
" <td>36.42400</td>\n",
" <td>7.17309</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>52.67814</td>\n",
" <td>-2.88914</td>\n",
" <td>43.95268</td>\n",
" <td>-1.39209</td>\n",
" <td>-14.93379</td>\n",
" <td>-15.86877</td>\n",
" <td>1.19379</td>\n",
" <td>0.31401</td>\n",
" <td>-4.44235</td>\n",
" <td>-5.78934</td>\n",
" <td>...</td>\n",
" <td>-5.74356</td>\n",
" <td>-42.57910</td>\n",
" <td>-2.91103</td>\n",
" <td>48.72805</td>\n",
" <td>-3.08183</td>\n",
" <td>-9.38888</td>\n",
" <td>-7.27179</td>\n",
" <td>-4.00966</td>\n",
" <td>-68.96211</td>\n",
" <td>-5.21525</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>45.74235</td>\n",
" <td>12.02291</td>\n",
" <td>11.03009</td>\n",
" <td>-11.60763</td>\n",
" <td>11.80054</td>\n",
" <td>-11.12389</td>\n",
" <td>-5.39058</td>\n",
" <td>-1.11981</td>\n",
" <td>-7.74086</td>\n",
" <td>-3.33421</td>\n",
" <td>...</td>\n",
" <td>-4.70606</td>\n",
" <td>-24.22599</td>\n",
" <td>-35.22686</td>\n",
" <td>27.77729</td>\n",
" <td>15.38934</td>\n",
" <td>58.20036</td>\n",
" <td>-61.12698</td>\n",
" <td>-10.92522</td>\n",
" <td>26.75348</td>\n",
" <td>-5.78743</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>52.55883</td>\n",
" <td>2.87222</td>\n",
" <td>27.38848</td>\n",
" <td>-5.76235</td>\n",
" <td>-15.35766</td>\n",
" <td>-15.01592</td>\n",
" <td>-5.86893</td>\n",
" <td>-0.31447</td>\n",
" <td>-5.06922</td>\n",
" <td>-4.62734</td>\n",
" <td>...</td>\n",
" <td>-8.35215</td>\n",
" <td>-16.86791</td>\n",
" <td>-10.58277</td>\n",
" <td>40.10173</td>\n",
" <td>-0.54005</td>\n",
" <td>-11.54746</td>\n",
" <td>-45.35860</td>\n",
" <td>-4.55694</td>\n",
" <td>-43.17368</td>\n",
" <td>-3.33725</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>51.34809</td>\n",
" <td>9.02702</td>\n",
" <td>25.33757</td>\n",
" <td>-6.62537</td>\n",
" <td>0.03367</td>\n",
" <td>-12.69565</td>\n",
" <td>-3.13400</td>\n",
" <td>2.98649</td>\n",
" <td>-6.71750</td>\n",
" <td>-1.85804</td>\n",
" <td>...</td>\n",
" <td>-6.87366</td>\n",
" <td>-20.03371</td>\n",
" <td>-66.38940</td>\n",
" <td>50.56569</td>\n",
" <td>0.27747</td>\n",
" <td>67.05657</td>\n",
" <td>-55.58846</td>\n",
" <td>-7.50859</td>\n",
" <td>28.23511</td>\n",
" <td>-0.72045</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 90 columns</p>\n",
"</div>"
],
"text/plain": [
" Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 \\\n",
"0 45.44200 -30.74976 31.78587 4.63569 -15.14894 0.23370 \n",
"1 52.67814 -2.88914 43.95268 -1.39209 -14.93379 -15.86877 \n",
"2 45.74235 12.02291 11.03009 -11.60763 11.80054 -11.12389 \n",
"3 52.55883 2.87222 27.38848 -5.76235 -15.35766 -15.01592 \n",
"4 51.34809 9.02702 25.33757 -6.62537 0.03367 -12.69565 \n",
"\n",
" Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 ... Unnamed: 81 \\\n",
"0 -11.97968 -9.59708 6.48111 -8.89073 ... -8.84046 \n",
"1 1.19379 0.31401 -4.44235 -5.78934 ... -5.74356 \n",
"2 -5.39058 -1.11981 -7.74086 -3.33421 ... -4.70606 \n",
"3 -5.86893 -0.31447 -5.06922 -4.62734 ... -8.35215 \n",
"4 -3.13400 2.98649 -6.71750 -1.85804 ... -6.87366 \n",
"\n",
" Unnamed: 82 Unnamed: 83 Unnamed: 84 Unnamed: 85 Unnamed: 86 \\\n",
"0 -0.15439 137.44210 77.54739 -4.22875 -61.92657 \n",
"1 -42.57910 -2.91103 48.72805 -3.08183 -9.38888 \n",
"2 -24.22599 -35.22686 27.77729 15.38934 58.20036 \n",
"3 -16.86791 -10.58277 40.10173 -0.54005 -11.54746 \n",
"4 -20.03371 -66.38940 50.56569 0.27747 67.05657 \n",
"\n",
" Unnamed: 87 Unnamed: 88 Unnamed: 89 Unnamed: 90 \n",
"0 -33.52722 -3.86253 36.42400 7.17309 \n",
"1 -7.27179 -4.00966 -68.96211 -5.21525 \n",
"2 -61.12698 -10.92522 26.75348 -5.78743 \n",
"3 -45.35860 -4.55694 -43.17368 -3.33725 \n",
"4 -55.58846 -7.50859 28.23511 -0.72045 \n",
"\n",
"[5 rows x 90 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predict on test set"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_test = regressor.predict(test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since its integer regression, convert to integers"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_inttest = [int(i) for i in y_test]\n",
"y_inttest = np.asarray(y_inttest)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save it in correct format"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.DataFrame(y_inttest,columns=['year'])\n",
"df.to_csv('../data/public/submission.csv',index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To participate in the challenge click [here](https://www.aicrowd.com/challenges/olnwp-online-news-prediction)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment