Skip to content

Instantly share code, notes, and snippets.

@darthgera123
Created March 27, 2020 11:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darthgera123/d08796507909a96a566e87af22520fda to your computer and use it in GitHub Desktop.
Save darthgera123/d08796507909a96a566e87af22520fda to your computer and use it in GitHub Desktop.
Baseline submission for OLNWP
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Baseline Submission for the Challenge OLNWP"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split \n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn import metrics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"train_data = pd.read_csv('../data/public/train.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean and analyse the data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>timedelta</th>\n",
" <th>n_tokens_title</th>\n",
" <th>n_tokens_content</th>\n",
" <th>n_unique_tokens</th>\n",
" <th>n_non_stop_words</th>\n",
" <th>n_non_stop_unique_tokens</th>\n",
" <th>num_hrefs</th>\n",
" <th>num_self_hrefs</th>\n",
" <th>num_imgs</th>\n",
" <th>num_videos</th>\n",
" <th>...</th>\n",
" <th>min_positive_polarity</th>\n",
" <th>max_positive_polarity</th>\n",
" <th>avg_negative_polarity</th>\n",
" <th>min_negative_polarity</th>\n",
" <th>max_negative_polarity</th>\n",
" <th>title_subjectivity</th>\n",
" <th>title_sentiment_polarity</th>\n",
" <th>abs_title_subjectivity</th>\n",
" <th>abs_title_sentiment_polarity</th>\n",
" <th>shares</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>525.0</td>\n",
" <td>10.0</td>\n",
" <td>238.0</td>\n",
" <td>0.658120</td>\n",
" <td>1.0</td>\n",
" <td>0.821918</td>\n",
" <td>7.0</td>\n",
" <td>5.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.100000</td>\n",
" <td>0.4</td>\n",
" <td>-0.133333</td>\n",
" <td>-0.166667</td>\n",
" <td>-0.10</td>\n",
" <td>0.250000</td>\n",
" <td>0.000000</td>\n",
" <td>0.250000</td>\n",
" <td>0.000000</td>\n",
" <td>782</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>273.0</td>\n",
" <td>11.0</td>\n",
" <td>545.0</td>\n",
" <td>0.474170</td>\n",
" <td>1.0</td>\n",
" <td>0.587719</td>\n",
" <td>21.0</td>\n",
" <td>2.0</td>\n",
" <td>21.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>0.100000</td>\n",
" <td>0.9</td>\n",
" <td>-0.248214</td>\n",
" <td>-0.300000</td>\n",
" <td>-0.05</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0.000000</td>\n",
" <td>6200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>423.0</td>\n",
" <td>10.0</td>\n",
" <td>453.0</td>\n",
" <td>0.518265</td>\n",
" <td>1.0</td>\n",
" <td>0.669173</td>\n",
" <td>21.0</td>\n",
" <td>5.0</td>\n",
" <td>15.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.100000</td>\n",
" <td>0.5</td>\n",
" <td>-0.380000</td>\n",
" <td>-0.700000</td>\n",
" <td>-0.20</td>\n",
" <td>0.300000</td>\n",
" <td>0.200000</td>\n",
" <td>0.200000</td>\n",
" <td>0.200000</td>\n",
" <td>723</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>80.0</td>\n",
" <td>11.0</td>\n",
" <td>814.0</td>\n",
" <td>0.456885</td>\n",
" <td>1.0</td>\n",
" <td>0.608787</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.033333</td>\n",
" <td>1.0</td>\n",
" <td>-0.195312</td>\n",
" <td>-0.600000</td>\n",
" <td>-0.05</td>\n",
" <td>0.277273</td>\n",
" <td>0.218182</td>\n",
" <td>0.222727</td>\n",
" <td>0.218182</td>\n",
" <td>809</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>653.0</td>\n",
" <td>11.0</td>\n",
" <td>113.0</td>\n",
" <td>0.711712</td>\n",
" <td>1.0</td>\n",
" <td>0.878788</td>\n",
" <td>5.0</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.136364</td>\n",
" <td>0.8</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.00</td>\n",
" <td>0.375000</td>\n",
" <td>-0.125000</td>\n",
" <td>0.125000</td>\n",
" <td>0.125000</td>\n",
" <td>1600</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 60 columns</p>\n",
"</div>"
],
"text/plain": [
" timedelta n_tokens_title n_tokens_content n_unique_tokens \\\n",
"0 525.0 10.0 238.0 0.658120 \n",
"1 273.0 11.0 545.0 0.474170 \n",
"2 423.0 10.0 453.0 0.518265 \n",
"3 80.0 11.0 814.0 0.456885 \n",
"4 653.0 11.0 113.0 0.711712 \n",
"\n",
" n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs \\\n",
"0 1.0 0.821918 7.0 5.0 \n",
"1 1.0 0.587719 21.0 2.0 \n",
"2 1.0 0.669173 21.0 5.0 \n",
"3 1.0 0.608787 2.0 2.0 \n",
"4 1.0 0.878788 5.0 4.0 \n",
"\n",
" num_imgs num_videos ... min_positive_polarity \\\n",
"0 1.0 0.0 ... 0.100000 \n",
"1 21.0 1.0 ... 0.100000 \n",
"2 15.0 0.0 ... 0.100000 \n",
"3 1.0 0.0 ... 0.033333 \n",
"4 0.0 0.0 ... 0.136364 \n",
"\n",
" max_positive_polarity avg_negative_polarity min_negative_polarity \\\n",
"0 0.4 -0.133333 -0.166667 \n",
"1 0.9 -0.248214 -0.300000 \n",
"2 0.5 -0.380000 -0.700000 \n",
"3 1.0 -0.195312 -0.600000 \n",
"4 0.8 0.000000 0.000000 \n",
"\n",
" max_negative_polarity title_subjectivity title_sentiment_polarity \\\n",
"0 -0.10 0.250000 0.000000 \n",
"1 -0.05 0.000000 0.000000 \n",
"2 -0.20 0.300000 0.200000 \n",
"3 -0.05 0.277273 0.218182 \n",
"4 0.00 0.375000 -0.125000 \n",
"\n",
" abs_title_subjectivity abs_title_sentiment_polarity shares \n",
"0 0.250000 0.000000 782 \n",
"1 0.500000 0.000000 6200 \n",
"2 0.200000 0.200000 723 \n",
"3 0.222727 0.218182 809 \n",
"4 0.125000 0.125000 1600 \n",
"\n",
"[5 rows x 60 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data = train_data.drop('url',1)\n",
"train_data.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>timedelta</th>\n",
" <th>n_tokens_title</th>\n",
" <th>n_tokens_content</th>\n",
" <th>n_unique_tokens</th>\n",
" <th>n_non_stop_words</th>\n",
" <th>n_non_stop_unique_tokens</th>\n",
" <th>num_hrefs</th>\n",
" <th>num_self_hrefs</th>\n",
" <th>num_imgs</th>\n",
" <th>num_videos</th>\n",
" <th>...</th>\n",
" <th>min_positive_polarity</th>\n",
" <th>max_positive_polarity</th>\n",
" <th>avg_negative_polarity</th>\n",
" <th>min_negative_polarity</th>\n",
" <th>max_negative_polarity</th>\n",
" <th>title_subjectivity</th>\n",
" <th>title_sentiment_polarity</th>\n",
" <th>abs_title_subjectivity</th>\n",
" <th>abs_title_sentiment_polarity</th>\n",
" <th>shares</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>...</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" <td>26561.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>354.110802</td>\n",
" <td>10.403449</td>\n",
" <td>552.377282</td>\n",
" <td>0.555933</td>\n",
" <td>1.009337</td>\n",
" <td>0.696678</td>\n",
" <td>10.898648</td>\n",
" <td>3.304733</td>\n",
" <td>4.588344</td>\n",
" <td>1.259177</td>\n",
" <td>...</td>\n",
" <td>0.094825</td>\n",
" <td>0.757686</td>\n",
" <td>-0.259757</td>\n",
" <td>-0.522776</td>\n",
" <td>-0.107678</td>\n",
" <td>0.282236</td>\n",
" <td>0.071113</td>\n",
" <td>0.342243</td>\n",
" <td>0.156345</td>\n",
" <td>3369.156094</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>213.485655</td>\n",
" <td>2.122533</td>\n",
" <td>472.605248</td>\n",
" <td>4.300199</td>\n",
" <td>6.389915</td>\n",
" <td>3.987187</td>\n",
" <td>11.254509</td>\n",
" <td>3.855560</td>\n",
" <td>8.377796</td>\n",
" <td>4.212860</td>\n",
" <td>...</td>\n",
" <td>0.070493</td>\n",
" <td>0.247909</td>\n",
" <td>0.128229</td>\n",
" <td>0.290208</td>\n",
" <td>0.096784</td>\n",
" <td>0.324309</td>\n",
" <td>0.266373</td>\n",
" <td>0.188296</td>\n",
" <td>0.227084</td>\n",
" <td>10971.259269</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>8.000000</td>\n",
" <td>3.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>-1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>165.000000</td>\n",
" <td>9.000000</td>\n",
" <td>248.000000</td>\n",
" <td>0.470000</td>\n",
" <td>1.000000</td>\n",
" <td>0.625430</td>\n",
" <td>4.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.050000</td>\n",
" <td>0.600000</td>\n",
" <td>-0.327976</td>\n",
" <td>-0.700000</td>\n",
" <td>-0.125000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.166667</td>\n",
" <td>0.000000</td>\n",
" <td>948.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>339.000000</td>\n",
" <td>10.000000</td>\n",
" <td>415.000000</td>\n",
" <td>0.538251</td>\n",
" <td>1.000000</td>\n",
" <td>0.690323</td>\n",
" <td>8.000000</td>\n",
" <td>3.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.100000</td>\n",
" <td>0.800000</td>\n",
" <td>-0.253385</td>\n",
" <td>-0.500000</td>\n",
" <td>-0.100000</td>\n",
" <td>0.142857</td>\n",
" <td>0.000000</td>\n",
" <td>0.500000</td>\n",
" <td>0.000000</td>\n",
" <td>1400.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>540.000000</td>\n",
" <td>12.000000</td>\n",
" <td>724.000000</td>\n",
" <td>0.607735</td>\n",
" <td>1.000000</td>\n",
" <td>0.754011</td>\n",
" <td>14.000000</td>\n",
" <td>4.000000</td>\n",
" <td>4.000000</td>\n",
" <td>1.000000</td>\n",
" <td>...</td>\n",
" <td>0.100000</td>\n",
" <td>1.000000</td>\n",
" <td>-0.187500</td>\n",
" <td>-0.300000</td>\n",
" <td>-0.050000</td>\n",
" <td>0.500000</td>\n",
" <td>0.146667</td>\n",
" <td>0.500000</td>\n",
" <td>0.250000</td>\n",
" <td>2800.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>731.000000</td>\n",
" <td>23.000000</td>\n",
" <td>7185.000000</td>\n",
" <td>701.000000</td>\n",
" <td>1042.000000</td>\n",
" <td>650.000000</td>\n",
" <td>304.000000</td>\n",
" <td>116.000000</td>\n",
" <td>128.000000</td>\n",
" <td>91.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.500000</td>\n",
" <td>1.000000</td>\n",
" <td>690400.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 60 columns</p>\n",
"</div>"
],
"text/plain": [
" timedelta n_tokens_title n_tokens_content n_unique_tokens \\\n",
"count 26561.000000 26561.000000 26561.000000 26561.000000 \n",
"mean 354.110802 10.403449 552.377282 0.555933 \n",
"std 213.485655 2.122533 472.605248 4.300199 \n",
"min 8.000000 3.000000 0.000000 0.000000 \n",
"25% 165.000000 9.000000 248.000000 0.470000 \n",
"50% 339.000000 10.000000 415.000000 0.538251 \n",
"75% 540.000000 12.000000 724.000000 0.607735 \n",
"max 731.000000 23.000000 7185.000000 701.000000 \n",
"\n",
" n_non_stop_words n_non_stop_unique_tokens num_hrefs \\\n",
"count 26561.000000 26561.000000 26561.000000 \n",
"mean 1.009337 0.696678 10.898648 \n",
"std 6.389915 3.987187 11.254509 \n",
"min 0.000000 0.000000 0.000000 \n",
"25% 1.000000 0.625430 4.000000 \n",
"50% 1.000000 0.690323 8.000000 \n",
"75% 1.000000 0.754011 14.000000 \n",
"max 1042.000000 650.000000 304.000000 \n",
"\n",
" num_self_hrefs num_imgs num_videos ... \\\n",
"count 26561.000000 26561.000000 26561.000000 ... \n",
"mean 3.304733 4.588344 1.259177 ... \n",
"std 3.855560 8.377796 4.212860 ... \n",
"min 0.000000 0.000000 0.000000 ... \n",
"25% 1.000000 1.000000 0.000000 ... \n",
"50% 3.000000 1.000000 0.000000 ... \n",
"75% 4.000000 4.000000 1.000000 ... \n",
"max 116.000000 128.000000 91.000000 ... \n",
"\n",
" min_positive_polarity max_positive_polarity avg_negative_polarity \\\n",
"count 26561.000000 26561.000000 26561.000000 \n",
"mean 0.094825 0.757686 -0.259757 \n",
"std 0.070493 0.247909 0.128229 \n",
"min 0.000000 0.000000 -1.000000 \n",
"25% 0.050000 0.600000 -0.327976 \n",
"50% 0.100000 0.800000 -0.253385 \n",
"75% 0.100000 1.000000 -0.187500 \n",
"max 1.000000 1.000000 0.000000 \n",
"\n",
" min_negative_polarity max_negative_polarity title_subjectivity \\\n",
"count 26561.000000 26561.000000 26561.000000 \n",
"mean -0.522776 -0.107678 0.282236 \n",
"std 0.290208 0.096784 0.324309 \n",
"min -1.000000 -1.000000 0.000000 \n",
"25% -0.700000 -0.125000 0.000000 \n",
"50% -0.500000 -0.100000 0.142857 \n",
"75% -0.300000 -0.050000 0.500000 \n",
"max 0.000000 0.000000 1.000000 \n",
"\n",
" title_sentiment_polarity abs_title_subjectivity \\\n",
"count 26561.000000 26561.000000 \n",
"mean 0.071113 0.342243 \n",
"std 0.266373 0.188296 \n",
"min -1.000000 0.000000 \n",
"25% 0.000000 0.166667 \n",
"50% 0.000000 0.500000 \n",
"75% 0.146667 0.500000 \n",
"max 1.000000 0.500000 \n",
"\n",
" abs_title_sentiment_polarity shares \n",
"count 26561.000000 26561.000000 \n",
"mean 0.156345 3369.156094 \n",
"std 0.227084 10971.259269 \n",
"min 0.000000 1.000000 \n",
"25% 0.000000 948.000000 \n",
"50% 0.000000 1400.000000 \n",
"75% 0.250000 2800.000000 \n",
"max 1.000000 690400.000000 \n",
"\n",
"[8 rows x 60 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split Data for Train and Validation"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X = train_data.drop(' shares',1)\n",
"y = train_data[' shares']\n",
"# Validation testing\n",
"X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Define the Classifier and Train"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressor = LinearRegression() \n",
"regressor.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check which variables have the most impact"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Coefficient</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>timedelta</th>\n",
" <td>1.371829</td>\n",
" </tr>\n",
" <tr>\n",
" <th>n_tokens_title</th>\n",
" <td>134.279025</td>\n",
" </tr>\n",
" <tr>\n",
" <th>n_tokens_content</th>\n",
" <td>0.321616</td>\n",
" </tr>\n",
" <tr>\n",
" <th>n_unique_tokens</th>\n",
" <td>4477.371557</td>\n",
" </tr>\n",
" <tr>\n",
" <th>n_non_stop_words</th>\n",
" <td>-2579.368312</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Coefficient\n",
" timedelta 1.371829\n",
" n_tokens_title 134.279025\n",
" n_tokens_content 0.321616\n",
" n_unique_tokens 4477.371557\n",
" n_non_stop_words -2579.368312"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']) \n",
"coeff_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predict on validation"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_pred = regressor.predict(X_val)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.DataFrame({'Actual': y_val, 'Predicted': y_pred})\n",
"df1 = df.head(25)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluate the Performance"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean Absolute Error: 3174.901687993749\n",
"Mean Squared Error: 168520453.62599948\n",
"Root Mean Squared Error: 12981.542806076613\n"
]
}
],
"source": [
"print('Mean Absolute Error:', metrics.mean_absolute_error(y_val, y_pred)) \n",
"print('Mean Squared Error:', metrics.mean_squared_error(y_val, y_pred)) \n",
"print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_val, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Test Set"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"test_data = pd.read_csv('../data/public/test.csv')"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>timedelta</th>\n",
" <th>n_tokens_title</th>\n",
" <th>n_tokens_content</th>\n",
" <th>n_unique_tokens</th>\n",
" <th>n_non_stop_words</th>\n",
" <th>n_non_stop_unique_tokens</th>\n",
" <th>num_hrefs</th>\n",
" <th>num_self_hrefs</th>\n",
" <th>num_imgs</th>\n",
" <th>num_videos</th>\n",
" <th>...</th>\n",
" <th>avg_positive_polarity</th>\n",
" <th>min_positive_polarity</th>\n",
" <th>max_positive_polarity</th>\n",
" <th>avg_negative_polarity</th>\n",
" <th>min_negative_polarity</th>\n",
" <th>max_negative_polarity</th>\n",
" <th>title_subjectivity</th>\n",
" <th>title_sentiment_polarity</th>\n",
" <th>abs_title_subjectivity</th>\n",
" <th>abs_title_sentiment_polarity</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>121.0</td>\n",
" <td>12.0</td>\n",
" <td>1015.0</td>\n",
" <td>0.422018</td>\n",
" <td>1.0</td>\n",
" <td>0.545031</td>\n",
" <td>10.0</td>\n",
" <td>6.0</td>\n",
" <td>33.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>0.333534</td>\n",
" <td>0.100000</td>\n",
" <td>0.8</td>\n",
" <td>-0.160714</td>\n",
" <td>-0.50</td>\n",
" <td>-0.071429</td>\n",
" <td>0.0</td>\n",
" <td>0.00</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>532.0</td>\n",
" <td>9.0</td>\n",
" <td>503.0</td>\n",
" <td>0.569697</td>\n",
" <td>1.0</td>\n",
" <td>0.737542</td>\n",
" <td>9.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>0.419786</td>\n",
" <td>0.136364</td>\n",
" <td>1.0</td>\n",
" <td>-0.157500</td>\n",
" <td>-0.25</td>\n",
" <td>-0.100000</td>\n",
" <td>0.0</td>\n",
" <td>0.00</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>435.0</td>\n",
" <td>9.0</td>\n",
" <td>232.0</td>\n",
" <td>0.646018</td>\n",
" <td>1.0</td>\n",
" <td>0.748428</td>\n",
" <td>12.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>0.468750</td>\n",
" <td>0.375000</td>\n",
" <td>0.5</td>\n",
" <td>-0.427500</td>\n",
" <td>-1.00</td>\n",
" <td>-0.187500</td>\n",
" <td>0.0</td>\n",
" <td>0.00</td>\n",
" <td>0.5</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>134.0</td>\n",
" <td>12.0</td>\n",
" <td>171.0</td>\n",
" <td>0.722892</td>\n",
" <td>1.0</td>\n",
" <td>0.867925</td>\n",
" <td>9.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>0.500000</td>\n",
" <td>0.500000</td>\n",
" <td>0.5</td>\n",
" <td>-0.216667</td>\n",
" <td>-0.25</td>\n",
" <td>-0.166667</td>\n",
" <td>0.4</td>\n",
" <td>-0.25</td>\n",
" <td>0.1</td>\n",
" <td>0.25</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>728.0</td>\n",
" <td>11.0</td>\n",
" <td>286.0</td>\n",
" <td>0.652632</td>\n",
" <td>1.0</td>\n",
" <td>0.800000</td>\n",
" <td>5.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.303429</td>\n",
" <td>0.100000</td>\n",
" <td>0.6</td>\n",
" <td>-0.251786</td>\n",
" <td>-0.50</td>\n",
" <td>-0.100000</td>\n",
" <td>0.2</td>\n",
" <td>-0.10</td>\n",
" <td>0.3</td>\n",
" <td>0.10</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 59 columns</p>\n",
"</div>"
],
"text/plain": [
" timedelta n_tokens_title n_tokens_content n_unique_tokens \\\n",
"0 121.0 12.0 1015.0 0.422018 \n",
"1 532.0 9.0 503.0 0.569697 \n",
"2 435.0 9.0 232.0 0.646018 \n",
"3 134.0 12.0 171.0 0.722892 \n",
"4 728.0 11.0 286.0 0.652632 \n",
"\n",
" n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs \\\n",
"0 1.0 0.545031 10.0 6.0 \n",
"1 1.0 0.737542 9.0 0.0 \n",
"2 1.0 0.748428 12.0 3.0 \n",
"3 1.0 0.867925 9.0 5.0 \n",
"4 1.0 0.800000 5.0 2.0 \n",
"\n",
" num_imgs num_videos ... \\\n",
"0 33.0 1.0 ... \n",
"1 1.0 1.0 ... \n",
"2 4.0 1.0 ... \n",
"3 0.0 1.0 ... \n",
"4 0.0 0.0 ... \n",
"\n",
" avg_positive_polarity min_positive_polarity max_positive_polarity \\\n",
"0 0.333534 0.100000 0.8 \n",
"1 0.419786 0.136364 1.0 \n",
"2 0.468750 0.375000 0.5 \n",
"3 0.500000 0.500000 0.5 \n",
"4 0.303429 0.100000 0.6 \n",
"\n",
" avg_negative_polarity min_negative_polarity max_negative_polarity \\\n",
"0 -0.160714 -0.50 -0.071429 \n",
"1 -0.157500 -0.25 -0.100000 \n",
"2 -0.427500 -1.00 -0.187500 \n",
"3 -0.216667 -0.25 -0.166667 \n",
"4 -0.251786 -0.50 -0.100000 \n",
"\n",
" title_subjectivity title_sentiment_polarity abs_title_subjectivity \\\n",
"0 0.0 0.00 0.5 \n",
"1 0.0 0.00 0.5 \n",
"2 0.0 0.00 0.5 \n",
"3 0.4 -0.25 0.1 \n",
"4 0.2 -0.10 0.3 \n",
"\n",
" abs_title_sentiment_polarity \n",
"0 0.00 \n",
"1 0.00 \n",
"2 0.00 \n",
"3 0.25 \n",
"4 0.10 \n",
"\n",
"[5 rows x 59 columns]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_data = test_data.drop('url',1)\n",
"test_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predict on test set"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_test = regressor.predict(test_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since its integer regression, convert to integers"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"y_inttest = [int(i) for i in y_test]\n",
"y_inttest = np.asarray(y_inttest)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save it in correct format"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.DataFrame(y_inttest,columns=[' shares'])\n",
"df.to_csv('../data/public/submission.csv',index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To participate in the challenge click [here](https://www.aicrowd.com/challenges/olnwp-online-news-prediction)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment