ameerkat/worldnews_fastai_classifier.ipynb

## worldnews_fastai_classifier.ipynb
{
 "cells": [
  {
   "source": [
    "# Pre-Requisites\n",
    "Before you get started you're going to need [jupyter notebooks](https://jupyter.org/install), the AWS CLI and your credentials setup if you want to similarly upload/download your models to S3, and fast.ai (`pip install fastai`).\n",
    "\n",
    "Based on https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb\n",
    "\n",
    "# Setup\n",
    "The dataset is basically a zipped folder, where in the folder contains a \"submissions.json\" file that contains a JSON object per line. This is the file generated from a response to pushshift.io based on [this blog post](https://www.osrsbox.com/blog/2019/03/18/watercooler-scraping-an-entire-subreddit-2007scape/). The zipped folder also contains a data subfolder that has all the references article text, with files named in the format `<post id>.txt`."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# optional chunk for loading our dataset from S3, this dataset isn't \"public\"\n",
    "# in the sense that I'm allowing others to download from this bucket.\n",
    "!aws s3 cp s3://ameerayoub-datasets/reddit/r-worldnews-2018-2.zip ./\n",
    "!mkdir /home/ameerkat/src/reddit-post-master/20201102\n",
    "!unzip -q ./r-worldnews-2018-2.zip -d /home/ameerkat/src/reddit-post-master/20201102\n",
    "# !ls /home/ameerkat/src/reddit-post-master/20201102/r-worldnews/data -1 | wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "##\n",
    "# Config\n",
    "##\n",
    "\n",
    "subreddit_id = \"20201102/r-worldnews\"\n",
    "subreddit_name = \"r-worldnews\"\n",
    "root_path = \"/home/ameerkat/src/reddit-post-master/\"\n",
    "bs=50 # batch size for fast.ai\n",
    "max_lm_articles = 300000 # max size for LM training (> 300k we run out of memory)\n",
    "max_classifier_articles = 100000 # max size for classifier training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai import *\n",
    "from fastai.text import *\n",
    "\n",
    "import json\n",
    "import pandas as pd\n",
    "import os\n",
    "\n",
    "##\n",
    "\n",
    "subreddit_root_path = os.path.join(root_path, subreddit_id)\n",
    "submission_file_path = os.path.join(subreddit_root_path, \"submissions.loaded2.json\")\n",
    "# where all the text files with article text are\n",
    "data_path = os.path.join(subreddit_root_path, \"data\")\n",
    "# Note I use the term batch here in multiple contexts, I have a large batch which is \n",
    "# the subset of my large data set I'm processing (usually on the order of 100k articles)\n",
    "# which is dependent on memory limitations. The other batch is the training batch size\n",
    "# which is used as a paramter to fast.ai. The batch path here refers to the former.\n",
    "batch_path = os.path.join(subreddit_root_path, \"batch\")\n",
    "models_path = os.path.join(root_path, f\"models/{subreddit_name}\")\n",
    "\n",
    "##\n",
    "# Control\n",
    "##\n",
    "recreate_lm_databunch = False\n",
    "\n",
    "##\n",
    "# Setup\n",
    "##\n",
    "os.makedirs(models_path, exist_ok=True)\n",
    "os.makedirs(batch_path, exist_ok=True)\n",
    "torch.cuda.set_device(0)"
   ]
  },
  {
   "source": [
    "# Language Model Training\n",
    "\n",
    "Because of the time it takes to train this model we save the model iteratively e.g. rwn_fine_tuned_5 and rwn_fine_tuned_7 and so on."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c79158375bb447a98d70d52564524a60",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=300000.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "from random import sample\n",
    "from shutil import copyfile\n",
    "from tqdm.notebook import tqdm\n",
    "\n",
    "lm_databunch_file = os.path.join(models_path, f\"{subreddit_name}_data_lm.pkl\")\n",
    "if recreate_lm_databunch or not os.path.exists(lm_databunch_file):\n",
    "    shutil.rmtree(batch_path)\n",
    "    os.makedirs(batch_path)\n",
    "\n",
    "    all_data_files = os.listdir(data_path)\n",
    "    # here we do a random sample of the dataset, in fact we can do this in multiple iterations\n",
    "    keep_data_files = sample(all_data_files, max_lm_articles)\n",
    "    for file in tqdm(keep_data_files):\n",
    "        copyfile(os.path.join(data_path, file), os.path.join(batch_path, file))\n",
    "\n",
    "    # as per https://github.com/fastai/fastai/issues/1737\n",
    "    defaults.cpus=1\n",
    "    data_lm = (TextList.from_folder(batch_path)\n",
    "            .split_by_rand_pct(0.1)\n",
    "            .label_for_lm()\n",
    "            .databunch(bs=bs))\n",
    "    data_lm.save(lm_databunch_file)\n",
    "else:\n",
    "    data_lm = load_data(models_path, lm_databunch_file, bs=bs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You don't need to run this every time, just run it once and find the correct value and hardcode it below\n",
    "learn.lr_find()\n",
    "learn.recorder.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>3.583047</td>\n",
       "      <td>3.379510</td>\n",
       "      <td>0.392645</td>\n",
       "      <td>3:03:58</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='2' class='' max='10' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      20.00% [2/10 6:15:02<25:00:09]\n",
       "    </div>\n",
       "    \n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>3.212387</td>\n",
       "      <td>3.141225</td>\n",
       "      <td>0.424708</td>\n",
       "      <td>3:07:21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>3.081309</td>\n",
       "      <td>3.090769</td>\n",
       "      <td>0.432667</td>\n",
       "      <td>3:07:40</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>\n",
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='29969' class='' max='51184' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      58.55% [29969/51184 1:45:54<1:14:58 3.0464]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))\n",
    "learn.save(os.path.join(models_path, 'rwn_fit_head'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>3.200785</td>\n",
       "      <td>3.145802</td>\n",
       "      <td>0.425478</td>\n",
       "      <td>3:06:32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>3.142123</td>\n",
       "      <td>3.066345</td>\n",
       "      <td>0.434870</td>\n",
       "      <td>3:07:09</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>3.206728</td>\n",
       "      <td>3.012934</td>\n",
       "      <td>0.442035</td>\n",
       "      <td>3:07:22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>3.075316</td>\n",
       "      <td>2.958103</td>\n",
       "      <td>0.450822</td>\n",
       "      <td>3:07:29</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>2.960969</td>\n",
       "      <td>2.935036</td>\n",
       "      <td>0.454467</td>\n",
       "      <td>3:07:53</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "IOPub message rate exceeded.\n",
      "The notebook server will temporarily stop sending output\n",
      "to the client in order to avoid crashing it.\n",
      "To change this limit, set the config variable\n",
      "`--NotebookApp.iopub_msg_rate_limit`.\n",
      "\n",
      "Current values:\n",
      "NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)\n",
      "NotebookApp.rate_limit_window=3.0 (secs)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "learn.load(os.path.join(models_path, 'rwn_fit_head'))\n",
    "learn.unfreeze()\n",
    "learn.fit_one_cycle(5, 1e-3, moms=(0.8,0.7))\n",
    "learn.save(os.path.join(models_path, 'rwn_fine_tuned_5'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>3.099285</td>\n",
       "      <td>3.018965</td>\n",
       "      <td>0.441490</td>\n",
       "      <td>3:07:38</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>3.020579</td>\n",
       "      <td>2.933351</td>\n",
       "      <td>0.454636</td>\n",
       "      <td>3:07:48</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.load(os.path.join(models_path, 'rwn_fine_tuned_5'))\n",
    "learn.unfreeze()\n",
    "learn.fit_one_cycle(2, 1e-3, moms=(0.8,0.7))\n",
    "learn.save(os.path.join(models_path, 'rwn_fine_tuned_7'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>3.054885</td>\n",
       "      <td>3.002252</td>\n",
       "      <td>0.444502</td>\n",
       "      <td>3:07:43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>3.161615</td>\n",
       "      <td>3.026899</td>\n",
       "      <td>0.440154</td>\n",
       "      <td>3:08:01</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>3.076710</td>\n",
       "      <td>2.986131</td>\n",
       "      <td>0.446628</td>\n",
       "      <td>3:07:42</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>3.045680</td>\n",
       "      <td>2.934598</td>\n",
       "      <td>0.454457</td>\n",
       "      <td>3:07:39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>3.051756</td>\n",
       "      <td>2.914251</td>\n",
       "      <td>0.457602</td>\n",
       "      <td>3:07:45</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.load(os.path.join(models_path, 'rwn_fine_tuned_7'))\n",
    "learn.unfreeze()\n",
    "learn.fit_one_cycle(5, 1e-3, moms=(0.8,0.7))\n",
    "learn.save(os.path.join(models_path, 'rwn_fine_tuned_12'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.load(os.path.join(models_path, 'rwn_fine_tuned_12'))\n",
    "final_lm_model_path = os.path.join(models_path, 'rwn_fine_tuned_12_enc')\n",
    "learn.save_encoder(final_lm_model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learn for score prediction\n",
    "\n",
    "We'll use 5 classes to classify these (negative, neutral, okay, good, great). With cutoffs of < 0, 0-10, 10-100, 100-500, 500+. Note that these classes could in fact vary based on the subreddit. There is also an \"unknown label\" when things fall outside the calculable ranges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def score_to_class(score):\n",
    "    # lower inclusive, upper exclusive, label\n",
    "    score_brackets = [\n",
    "        [None, 1, \"negative\"],\n",
    "        [1, 2, \"neutral\"],\n",
    "        [2, 10, \"okay\"],\n",
    "        [10, 100, \"good\"],\n",
    "        [100, None, \"great\"]\n",
    "    ]\n",
    "\n",
    "    for lower_bound, upper_bound, label in score_brackets:\n",
    "        if lower_bound == None or score >= lower_bound:\n",
    "            if upper_bound == None or score < upper_bound:\n",
    "                return label\n",
    "    return \"unknown\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>text</th>\n",
       "      <th>target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>xxbos xxmaj our local experts specialize in xxmaj homes xxmaj for xxmaj sale , representing both xxmaj home xxmaj buyers and xxmaj home xxmaj sellers . \\n  xxmaj local xxmaj real xxmaj estate xxmaj pros \\n  &gt; \\n  xxmaj albuquerque , xxup nm | xxmaj rhode xxmaj island | xxmaj illinois | xxmaj xxunk , xxup fl | xxmaj xxunk , xxup ar | xxmaj new xxmaj</td>\n",
       "      <td>neutral</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>xxbos xxmaj trump warns xxmaj kim xxmaj jong - un he could end up like xxmaj libya 's xxmaj gaddafi unless he makes nuclear deal \\n  xxmaj the comments were made at the xxmaj white   xxmaj house   \\n  xxmaj thursday 17 xxmaj may 2018 xxunk \\n  { { ^morethanten } } \\n  { { total } } comments \\n  { { / morethanten</td>\n",
       "      <td>great</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>xxbos xxmaj news &gt; xxmaj world &gt; xxmaj asia \\n  xxmaj south and xxmaj north xxmaj korea to hold another round of high - level talks on the border \\n  xxmaj talks would follow unprecedented xxmaj april meeting between xxmaj north xxmaj korean leader xxmaj kim xxmaj jong - un and xxmaj south xxmaj korean president xxmaj moon xxmaj jae - in \\n  xxmaj tuesday 15 xxmaj</td>\n",
       "      <td>good</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>xxbos xxmaj pope xxmaj francis xxmaj blasts ‘ xxmaj fake xxmaj news , ’ xxmaj compares xxmaj it to ‘ xxmaj crafty xxmaj serpent ’ xxmaj who xxmaj deceived xxmaj eve \\n  “ xxmaj we need to unmask what could be called the ‘ snake tactics ’ , ” says pope \\n  xxmaj jon xxmaj levine | xxmaj january 24 , 2018 @ xxunk xxup am xxmaj last</td>\n",
       "      <td>negative</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>xxbos xxup cnn 's xxmaj anthony xxmaj bourdain dead at 61 \\n  xxmaj by xxmaj brian xxmaj stelter , xxup cnn \\n  xxmaj updated 8:49 xxup pm xxup et , xxmaj fri xxmaj june 8 , 2018 \\n  xxmaj chat with us in xxmaj facebook xxmaj messenger . xxmaj find out what 's happening in the world as it unfolds . \\n  xxup just xxup watched</td>\n",
       "      <td>okay</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# we're going to split the loaded text data across multiple .pkl files since \n",
    "# they can be very large\n",
    "file_idx = 0\n",
    "max_line_idx = max_classifier_articles\n",
    "cur_line_idx = 0\n",
    "loaded_fields = [\"created_utc\", \"id\", \"score\", \"title\", \"url\"]\n",
    "\n",
    "# merge in the text that comes from the <id>.txt file and the label which comes \n",
    "# from the score_to_class function above\n",
    "def initialize_data_dict():\n",
    "    global merged_result\n",
    "    merged_result = {\"text\": [], \"label\": []}\n",
    "    for field in loaded_fields:\n",
    "        merged_result[field] = []\n",
    "\n",
    "def save_data_dict(file_idx):\n",
    "    global data\n",
    "    data_frame = pd.DataFrame.from_dict(merged_result)\n",
    "    data = (TextList.from_df(data_frame)\n",
    "            .split_by_rand_pct(0.2)\n",
    "            .label_from_df(cols=\"label\")\n",
    "            .databunch(bs=bs))\n",
    "    data.save(os.path.join(models_path, f'rwn_data_clas_{file_idx}.pkl'))\n",
    "\n",
    "initialize_data_dict()\n",
    "for line in open(submission_file_path, \"r\"):\n",
    "    if cur_line_idx > max_line_idx:\n",
    "        cur_line_idx = 0\n",
    "        save_data_dict(file_idx)\n",
    "        file_idx += 1\n",
    "        initialize_data_dict()\n",
    "\n",
    "    o = json.loads(line)\n",
    "    text_file_path = os.path.join(data_path, o[\"id\"] + \".txt\")\n",
    "    if not os.path.exists(text_file_path):\n",
    "        continue\n",
    "    cur_line_idx += 1\n",
    "    merged_result[\"label\"].append(score_to_class(o[\"score\"]))\n",
    "    with open(text_file_path, \"r\") as text_file:\n",
    "        merged_result[\"text\"].append(text_file.read())\n",
    "    for field in loaded_fields:\n",
    "        merged_result[field].append(o[field])\n",
    "\n",
    "if merged_result:\n",
    "    save_data_dict(file_idx)\n",
    "\n",
    "data.show_batch()"
   ]
  },
  {
   "source": [
    "# I only train on one \"file\" of loaded data at a time for memory constraint reasons, \n",
    "# you have to change the id and retrain\n",
    "file_idx_to_use = 0\n",
    "\n",
    "data = load_data(models_path, f'rwn_data_clas_{file_idx_to_use}.pkl', bs=bs)\n",
    "data_lm = load_data(models_path, lm_databunch_file, bs=bs)\n",
    "data.vocab.itos = data_lm.vocab.itos\n",
    "learn = text_classifier_learner(data, AWD_LSTM, drop_mult=0.5)\n",
    "learn.load_encoder(final_lm_model_path)"
   ],
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.321669</td>\n",
       "      <td>1.266969</td>\n",
       "      <td>0.462900</td>\n",
       "      <td>27:41</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.fit_one_cycle(1, 5e-3)\n",
    "learn.save(os.path.join(models_path, 'first'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='2' class='' max='5' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      40.00% [2/5 57:17<1:25:56]\n",
       "    </div>\n",
       "    \n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.302755</td>\n",
       "      <td>1.241917</td>\n",
       "      <td>0.470550</td>\n",
       "      <td>32:02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1.346404</td>\n",
       "      <td>1.243466</td>\n",
       "      <td>0.464650</td>\n",
       "      <td>25:15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table><p>\n",
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='943' class='' max='1600' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      58.94% [943/1600 13:32<09:26 1.3236]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.load(os.path.join(models_path, 'first'))\n",
    "learn.fit_one_cycle(5, 5e-3)\n",
    "learn.save(os.path.join(models_path, 'second'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.217338</td>\n",
       "      <td>1.135739</td>\n",
       "      <td>0.507369</td>\n",
       "      <td>36:14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>1.194410</td>\n",
       "      <td>1.140757</td>\n",
       "      <td>0.506670</td>\n",
       "      <td>35:03</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>1.219232</td>\n",
       "      <td>1.141407</td>\n",
       "      <td>0.509768</td>\n",
       "      <td>29:28</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>1.157828</td>\n",
       "      <td>1.132628</td>\n",
       "      <td>0.509418</td>\n",
       "      <td>29:37</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>1.225461</td>\n",
       "      <td>1.151327</td>\n",
       "      <td>0.504172</td>\n",
       "      <td>34:45</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='0' class='' max='15' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      0.00% [0/15 00:00<00:00]\n",
       "    </div>\n",
       "    \n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: left;\">\n",
       "      <th>epoch</th>\n",
       "      <th>train_loss</th>\n",
       "      <th>valid_loss</th>\n",
       "      <th>accuracy</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table><p>\n",
       "\n",
       "    <div>\n",
       "        <style>\n",
       "            /* Turns off some styling */\n",
       "            progress {\n",
       "                /* gets rid of default border in Firefox and Opera. */\n",
       "                border: none;\n",
       "                /* Needs to be in here for Safari polyfill so background images work as expected. */\n",
       "                background-size: auto;\n",
       "            }\n",
       "            .progress-bar-interrupted, .progress-bar-interrupted::-webkit-progress-bar {\n",
       "                background: #F44336;\n",
       "            }\n",
       "        </style>\n",
       "      <progress value='1104' class='' max='1601' style='width:300px; height:20px; vertical-align: middle;'></progress>\n",
       "      68.96% [1104/1601 18:11<08:11 1.1724]\n",
       "    </div>\n",
       "    "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "learn.load(os.path.join(models_path, 'second'))\n",
    "learn.fit_one_cycle(5, 5e-3)\n",
    "learn.save(os.path.join(models_path, 'third'))\n",
    "learn.freeze_to(-3)\n",
    "learn.fit_one_cycle(15, slice(5e-3/(2.6**4), 5e-3), moms=(0.8,0.7))\n",
    "learn.save(os.path.join(models_path, 'third_fine'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn.load(os.path.join(models_path, 'third'))\n",
    "learn.export(os.path.join(models_path, 'third.pkl'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inference"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting boilerpipe3\n",
      "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/07/65/06b50fb9dc05a2a286f5e169b11d4048c15b17e69f0b54b61bf34d8c14b6/boilerpipe3-1.3.tar.gz (1.3MB)\n",
      "\u001b[K     |████████████████████████████████| 1.3MB 1.6MB/s eta 0:00:01\n",
      "\u001b[?25hCollecting JPype1 (from boilerpipe3)\n",
      "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/03/25/d137466fb4a6b145c3dfac3a53475a77b0c6c320fc14d9cb273703c2c12c/JPype1-0.7.3-cp37-cp37m-manylinux1_x86_64.whl (2.9MB)\n",
      "\u001b[K     |████████████████████████████████| 2.9MB 3.6MB/s eta 0:00:01\n",
      "\u001b[?25hCollecting charade (from boilerpipe3)\n",
      "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/74/26/565610c87e951b8a3182df890589c280a16c5897cfbca97eebd73705e0c6/charade-1.0.3.tar.gz (168kB)\n",
      "\u001b[K     |████████████████████████████████| 174kB 4.5MB/s eta 0:00:01\n",
      "\u001b[?25hBuilding wheels for collected packages: boilerpipe3, charade\n",
      "  Building wheel for boilerpipe3 (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for boilerpipe3: filename=boilerpipe3-1.3-cp37-none-any.whl size=1321065 sha256=5472f8f1c74b3af67052cc02664d2f205be61b98defb55874c5ca263d6794cb9\n",
      "  Stored in directory: /home/ameerkat/.cache/pip/wheels/b0/3f/95/1451acd92dc1a911f9f3b7877a1c9dda45009bab520bb9417c\n",
      "  Building wheel for charade (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for charade: filename=charade-1.0.3-cp37-none-any.whl size=187072 sha256=52f25cee729c4923b6cd3e5513aef69654884aebe728e4dc0c776f8569d35355\n",
      "  Stored in directory: /home/ameerkat/.cache/pip/wheels/17/e4/b6/f27d4d6c000855ea7180a28099cd8d758b1d5debce4fac2d65\n",
      "Successfully built boilerpipe3 charade\n",
      "Installing collected packages: JPype1, charade, boilerpipe3\n",
      "Successfully installed JPype1-0.7.3 boilerpipe3-1.3 charade-1.0.3\n"
     ]
    }
   ],
   "source": [
    "!pip install boilerpipe3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Viewpoints and reviews of Digital Startups\n",
      "Viewpoints and reviews of Digital Startups\n",
      "Menu\n",
      "Home » Advice » Airbnb to Cut 25% of Workforce: What does a new-normal mean for the gig economy?\n",
      "Airbnb to Cut 25% of Workforce: What does a new-normal mean for the gig economy?\n",
      "Gig workers were among the hardest hit the business segments as a fallout of the COVID-19 pandemic, lockdowns around the globe. Gig workers earn when they work, and hence most of them saw their earnings come to a grinding halt as large segments of the world went under lockdown and stay-at-home orders in March and April.\n",
      "Gig economy startups like Uber, Airbnb, Lyft, Etsy or TaskRabbit, food and package delivery startups and others attracted individuals who were attracted to flexible schedules and the ability to drive their earnings at their personal pace. This sector employs millions of people across the globe.  The Bureau of Labor Statistics in the US reported in 2017 that 55 million people in the U.S. are “gig workers”.  In Europe, 9.7 percent of adults from 14 EU countries participated in the Gig economy. Asian giants like India also added millions of flexi workers in the past few years.  The official estimates aren’t very accurate since many people also moonlight as gig-workers to complement their regular jobs.\n",
      "Until the pandemic stuck, the gig economy was heralded as pathbreaker that could upend existing business models. Enabled by well designed digital platforms, many of these consolidators prompted individuals to explore opportunities as self-employed drivers, task and service providers. Gig consolidators attracted individuals who didn’t want to be shackled by corporate norms with an ability to choose assignments that make the most of their talents and reflect their true interests. However, as the global economy slowly limps back, many industry leaders wonder what the ‘new normal’ and social distancing could mean for the gig-economy.\n",
      "Most companies and governments classify gig-workers as freelancers and not full-time employees. These don’t have protections like guaranteed wages, sick pay and health care, which corporate workers take for granted. According to a recent World Economic Forum report\n",
      "Gig workers are quitting jobs due to a fall in demand and safety concerns.\n",
      "The majority of gig workers now have no income due to COVID-19.\n",
      "Almost 70% of gig workers were not satisfied with the support provided by the company they work for.\n",
      "Image: observer.com\n",
      "Airbnb effect on Gig economy\n",
      "The Gig economy darling, Airbnb Inc. announced that it is cutting a quarter of its workforce (about 1,900 jobs) while reducing investments in noncore operations.\n",
      "Airbnb’s Co-founder and Chief Executive Brian Chesky told employees about the cuts in a memo Tuesday, adding that the company’s revenue forecast for this year is “less than half” of last year’s level. “We are collectively living through the most harrowing crisis of our lifetime, and as it began to unfold, global travel came to a standstill,” Mr. Chesky told employees in a memo Tuesday. “Airbnb’s business has been hit hard.”\n",
      "Rideshare giant Uber is considering layoffs of 5000 of staff (about 20% of the workforce). Its big rival Lyft also announced it would be reducing the employee count by 17% (about 982 employees) and furloughing an additional 288, due to the effects of the COVID-19 pandemic and its impact on its business.\n",
      "These layoff announcements are just the tip of the iceberg since they only count full-time employees employed by these firms directly in support, IT and operations management. These announcements don’t count the number of gig-workers who either drop off the platforms or are unable to continue to make a living as uber drivers or Airbnb hosts.\n",
      "Longer term lessons on sharing economy startups\n",
      "The success of sharing economy hinges on a healthy ecosystem that includes\n",
      "Independent workers paid by the gig (i.e., a task or a project)\n",
      "Consumers who need a specific service, for example, a ride to their next destination, or items delivered\n",
      "Gig consolidators including app-based technology platforms connect the worker to the consumer\n",
      "The announcement by the gig-working giant is likely to send ripples among other gig-economy giants. Startups have been innovating in business models across a wide cross-section of the economy trying to upend traditional businesses in travel, hospitality and supply chain segments. In a post-COVID world individuals will be hesitant to become gig workers without a social safety net. Likewise, consumers will be slow to get back to their old habits.\n",
      "Fewer people are likely to travel till the end of 2020, and the few people who venture out may be hesitant to get on a stranger’s car or home without adequate safeguards. All of this will put additional cost burdens and expenses on gig workers, service providers and home hosts.\n",
      "Startups, entrepreneurs and gig consolidators will have to factor in the new normal while re-designing newer sharing-economy models. Gig companies will also have to address the additional financial risks gig workers take while engaging with the platform.\n",
      "Other References\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(Category neutral, tensor(3), tensor([0.0663, 0.0317, 0.0459, 0.7888, 0.0673]))"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "learn.load(os.path.join(models_path, 'third'))\n",
    "test_url = \"http://www.mydigitalstartup.net/2020/05/06/gig-economy/\"\n",
    "from boilerpipe.extract import Extractor\n",
    "extractor = Extractor(extractor='ArticleExtractor', url=test_url)\n",
    "extracted_text = extractor.getText()\n",
    "print(extracted_text)\n",
    "\n",
    "learn.predict(extracted_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Upload"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "upload: models/worldnews/first.pth to s3://ameerayoub-datasets/reddit/models/worldnews/first.pth\n",
      "upload: models/worldnews/models/tmp.pth to s3://ameerayoub-datasets/reddit/models/worldnews/models/tmp.pth\n",
      "upload: models/worldnews/rwn_fine_tuned_12_enc.pth to s3://ameerayoub-datasets/reddit/models/worldnews/rwn_fine_tuned_12_enc.pth\n",
      "upload: models/worldnews/rwn_fine_tuned_12.pth to s3://ameerayoub-datasets/reddit/models/worldnews/rwn_fine_tuned_12.pth\n",
      "upload: models/worldnews/rwn_fine_tuned_5.pth to s3://ameerayoub-datasets/reddit/models/worldnews/rwn_fine_tuned_5.pth\n",
      "upload: models/worldnews/rwn_fit_head.pth to s3://ameerayoub-datasets/reddit/models/worldnews/rwn_fit_head.pth\n",
      "upload: models/worldnews/rwn_fine_tuned_7.pth to s3://ameerayoub-datasets/reddit/models/worldnews/rwn_fine_tuned_7.pth\n",
      "upload: models/worldnews/second.pth to s3://ameerayoub-datasets/reddit/models/worldnews/second.pth\n",
      "upload: models/worldnews/rwn_data_clas.pkl to s3://ameerayoub-datasets/reddit/models/worldnews/rwn_data_clas.pkl\n",
      "upload: models/worldnews/third.pkl to s3://ameerayoub-datasets/reddit/models/worldnews/third.pkl\n",
      "upload: models/worldnews/third.pth to s3://ameerayoub-datasets/reddit/models/worldnews/third.pth\n",
      "upload: models/worldnews/r_worldnews_data_lm.pkl to s3://ameerayoub-datasets/reddit/models/worldnews/r_worldnews_data_lm.pkl\n"
     ]
    }
   ],
   "source": [
    "!aws s3 sync ./models s3://ameerayoub-datasets/reddit/models"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}