samk3211/Classifying Yelp Reviews using BERT.ipynb

## Classifying Yelp Reviews using BERT.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Importing the necessary modules\n",
    "import pandas as pd\n",
    "import tarfile\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The downloaded dataset is in tgz format. We can open it using `tarfile.open()` and then extract the csv files using the `extractall()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_tg = tarfile.open('data/yelp_review_polarity_csv.tgz')\n",
    "data_tg.extractall('data')\n",
    "data_tg.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tarfile.TarFile at 0x196b1b28048>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_tg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at the the first 5 rows of the train and test datasets to understand the data we are dealing with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>I don't know what Dr. Goldberg was like before...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>I'm writing this review to give you a heads up...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>All the food is great here. But the best thing...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0                                                  1\n",
       "0  1  Unfortunately, the frustration of being Dr. Go...\n",
       "1  2  Been going to Dr. Goldberg for over 10 years. ...\n",
       "2  1  I don't know what Dr. Goldberg was like before...\n",
       "3  1  I'm writing this review to give you a heads up...\n",
       "4  2  All the food is great here. But the best thing..."
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " train_df = pd.read_csv('data/yelp_review_polarity_csv/train.csv', header=None)\n",
    "\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>Contrary to other reviews, I have zero complai...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Last summer I had an appointment to get new ti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Friendly staff, same starbucks fair you get an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>The food is good. Unfortunately the service is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>Even when we didn't have a car Filene's Baseme...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0                                                  1\n",
       "0  2  Contrary to other reviews, I have zero complai...\n",
       "1  1  Last summer I had an appointment to get new ti...\n",
       "2  2  Friendly staff, same starbucks fair you get an...\n",
       "3  1  The food is good. Unfortunately the service is...\n",
       "4  2  Even when we didn't have a car Filene's Baseme..."
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_df = pd.read_csv('data/yelp_review_polarity_csv/test.csv', header=None)\n",
    "\n",
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that in our dataset a label of 1 means the review is bad while a label of 2 means the review is good.\n",
    "\n",
    "Let's change this to a more standard pattern — 0 and 1 labels. Let's have a label 0 for the bad review and a label 1 for the good review:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[0] = (train_df[0] == 2).astype(int)\n",
    "test_df[0] = (test_df[0] == 2).astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>I don't know what Dr. Goldberg was like before...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>I'm writing this review to give you a heads up...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>All the food is great here. But the best thing...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0                                                  1\n",
       "0  0  Unfortunately, the frustration of being Dr. Go...\n",
       "1  1  Been going to Dr. Goldberg for over 10 years. ...\n",
       "2  0  I don't know what Dr. Goldberg was like before...\n",
       "3  0  I'm writing this review to give you a heads up...\n",
       "4  1  All the food is great here. But the best thing..."
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Contrary to other reviews, I have zero complai...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>Last summer I had an appointment to get new ti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Friendly staff, same starbucks fair you get an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>The food is good. Unfortunately the service is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Even when we didn't have a car Filene's Baseme...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0                                                  1\n",
       "0  1  Contrary to other reviews, I have zero complai...\n",
       "1  0  Last summer I had an appointment to get new ti...\n",
       "2  1  Friendly staff, same starbucks fair you get an...\n",
       "3  0  The food is good. Unfortunately the service is...\n",
       "4  1  Even when we didn't have a car Filene's Baseme..."
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Making things BERT friendly\n",
    "\n",
    "1. First let's make the data compliant with BERT:\n",
    "\n",
    "    - Column 0: An ID for the row. (Required both for *train* and *test* data.)<br>\n",
    "    - Column 1: The class label for the row. (Required only for *train* data.)<br>\n",
    "    - Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it. (Required only for *train* data.)<br>\n",
    "    - Column 3: The text examples we want to classify. (Required both for *train* and *test* data.)<BR><br>\n",
    "    \n",
    "2. We need to split the files into the format expected by BERT: BERT comes with data loading classes that expects two files called *train* and *dev* for training. In addition, BERT’s data loading classes can also use a *test* file but it expects the test file to be unlabelled. <br><br>\n",
    "\n",
    "3. Once the data is in the correct format, we need to save the files as .tsv (BERT doesn't take .csv as input.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>label</th>\n",
       "      <th>alpha</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>80338</th>\n",
       "      <td>80338</td>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "      <td>The best Italian around.....service matches th...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>170833</th>\n",
       "      <td>170833</td>\n",
       "      <td>0</td>\n",
       "      <td>a</td>\n",
       "      <td>This place used to be good...we've been going ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40179</th>\n",
       "      <td>40179</td>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "      <td>We stumbled upon this location while heading f...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>383376</th>\n",
       "      <td>383376</td>\n",
       "      <td>0</td>\n",
       "      <td>a</td>\n",
       "      <td>Last night we went to this location.  It was r...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>125165</th>\n",
       "      <td>125165</td>\n",
       "      <td>1</td>\n",
       "      <td>a</td>\n",
       "      <td>Quiet place with dim lighting, cozy atmosphere...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            id  label alpha                                               text\n",
       "80338    80338      1     a  The best Italian around.....service matches th...\n",
       "170833  170833      0     a  This place used to be good...we've been going ...\n",
       "40179    40179      1     a  We stumbled upon this location while heading f...\n",
       "383376  383376      0     a  Last night we went to this location.  It was r...\n",
       "125165  125165      1     a  Quiet place with dim lighting, cozy atmosphere..."
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Creating training dataframe according to BERT by adding the required columns\n",
    "df_bert = pd.DataFrame({\n",
    "    'id':range(len(train_df)),\n",
    "    'label':train_df[0],\n",
    "    'alpha':['a']*train_df.shape[0],\n",
    "    'text': train_df[1].replace(r'\\n', ' ', regex=True)\n",
    "})\n",
    "\n",
    "\n",
    "# Splitting training data file into *train* and *dev*\n",
    "df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)\n",
    "\n",
    "df_bert_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Contrary to other reviews, I have zero complai...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Last summer I had an appointment to get new ti...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Friendly staff, same starbucks fair you get an...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>The food is good. Unfortunately the service is...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Even when we didn't have a car Filene's Baseme...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   id                                               text\n",
       "0   0  Contrary to other reviews, I have zero complai...\n",
       "1   1  Last summer I had an appointment to get new ti...\n",
       "2   2  Friendly staff, same starbucks fair you get an...\n",
       "3   3  The food is good. Unfortunately the service is...\n",
       "4   4  Even when we didn't have a car Filene's Baseme..."
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Creating test dataframe according to BERT\n",
    "df_bert_test = pd.DataFrame({\n",
    "    'id':range(len(test_df)),\n",
    "    'text': test_df[1].replace(r'\\n', ' ', regex=True)\n",
    "})\n",
    "\n",
    "df_bert_test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Saving dataframes to .tsv format as required by BERT\n",
    "df_bert_train.to_csv('data/train.tsv', sep='\\t', index=False, header=False)\n",
    "df_bert_dev.to_csv('data/dev.tsv', sep='\\t', index=False, header=False)\n",
    "df_bert_test.to_csv('data/test.tsv', sep='\\t', index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready for training using the scripts in the BERT repo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Importing the necessary modules\n",
	"import pandas as pd\n",
	"import tarfile\n",
	"from sklearn.preprocessing import LabelEncoder\n",
	"from sklearn.model_selection import train_test_split"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The downloaded dataset is in tgz format. We can open it using `tarfile.open()` and then extract the csv files using the `extractall()` method:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"data_tg = tarfile.open('data/yelp_review_polarity_csv.tgz')\n",
	"data_tg.extractall('data')\n",
	"data_tg.close()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<tarfile.TarFile at 0x196b1b28048>"
	]
	},
	"execution_count": 3,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"data_tg"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's look at the the first 5 rows of the train and test datasets to understand the data we are dealing with:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>0</th>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>1</td>\n",
	" <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>2</td>\n",
	" <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>1</td>\n",
	" <td>I don't know what Dr. Goldberg was like before...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>1</td>\n",
	" <td>I'm writing this review to give you a heads up...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>2</td>\n",
	" <td>All the food is great here. But the best thing...</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" 0 1\n",
	"0 1 Unfortunately, the frustration of being Dr. Go...\n",
	"1 2 Been going to Dr. Goldberg for over 10 years. ...\n",
	"2 1 I don't know what Dr. Goldberg was like before...\n",
	"3 1 I'm writing this review to give you a heads up...\n",
	"4 2 All the food is great here. But the best thing..."
	]
	},
	"execution_count": 4,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	" train_df = pd.read_csv('data/yelp_review_polarity_csv/train.csv', header=None)\n",
	"\n",
	"train_df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>0</th>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>2</td>\n",
	" <td>Contrary to other reviews, I have zero complai...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>1</td>\n",
	" <td>Last summer I had an appointment to get new ti...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>2</td>\n",
	" <td>Friendly staff, same starbucks fair you get an...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>1</td>\n",
	" <td>The food is good. Unfortunately the service is...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>2</td>\n",
	" <td>Even when we didn't have a car Filene's Baseme...</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" 0 1\n",
	"0 2 Contrary to other reviews, I have zero complai...\n",
	"1 1 Last summer I had an appointment to get new ti...\n",
	"2 2 Friendly staff, same starbucks fair you get an...\n",
	"3 1 The food is good. Unfortunately the service is...\n",
	"4 2 Even when we didn't have a car Filene's Baseme..."
	]
	},
	"execution_count": 5,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"test_df = pd.read_csv('data/yelp_review_polarity_csv/test.csv', header=None)\n",
	"\n",
	"test_df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can see that in our dataset a label of 1 means the review is bad while a label of 2 means the review is good.\n",
	"\n",
	"Let's change this to a more standard pattern — 0 and 1 labels. Let's have a label 0 for the bad review and a label 1 for the good review:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [],
	"source": [
	"train_df[0] = (train_df[0] == 2).astype(int)\n",
	"test_df[0] = (test_df[0] == 2).astype(int)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>0</th>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>0</td>\n",
	" <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>1</td>\n",
	" <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>0</td>\n",
	" <td>I don't know what Dr. Goldberg was like before...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>0</td>\n",
	" <td>I'm writing this review to give you a heads up...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>1</td>\n",
	" <td>All the food is great here. But the best thing...</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" 0 1\n",
	"0 0 Unfortunately, the frustration of being Dr. Go...\n",
	"1 1 Been going to Dr. Goldberg for over 10 years. ...\n",
	"2 0 I don't know what Dr. Goldberg was like before...\n",
	"3 0 I'm writing this review to give you a heads up...\n",
	"4 1 All the food is great here. But the best thing..."
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"train_df.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>0</th>\n",
	" <th>1</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>1</td>\n",
	" <td>Contrary to other reviews, I have zero complai...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>0</td>\n",
	" <td>Last summer I had an appointment to get new ti...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>1</td>\n",
	" <td>Friendly staff, same starbucks fair you get an...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>0</td>\n",
	" <td>The food is good. Unfortunately the service is...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>1</td>\n",
	" <td>Even when we didn't have a car Filene's Baseme...</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" 0 1\n",
	"0 1 Contrary to other reviews, I have zero complai...\n",
	"1 0 Last summer I had an appointment to get new ti...\n",
	"2 1 Friendly staff, same starbucks fair you get an...\n",
	"3 0 The food is good. Unfortunately the service is...\n",
	"4 1 Even when we didn't have a car Filene's Baseme..."
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"test_df.head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#### Making things BERT friendly\n",
	"\n",
	"1. First let's make the data compliant with BERT:\n",
	"\n",
	" - Column 0: An ID for the row. (Required both for train and test data.)<br>\n",
	" - Column 1: The class label for the row. (Required only for train data.)<br>\n",
	" - Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it. (Required only for train data.)<br>\n",
	" - Column 3: The text examples we want to classify. (Required both for train and test data.)<BR><br>\n",
	" \n",
	"2. We need to split the files into the format expected by BERT: BERT comes with data loading classes that expects two files called train and dev for training. In addition, BERT’s data loading classes can also use a test file but it expects the test file to be unlabelled. <br><br>\n",
	"\n",
	"3. Once the data is in the correct format, we need to save the files as .tsv (BERT doesn't take .csv as input.)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>id</th>\n",
	" <th>label</th>\n",
	" <th>alpha</th>\n",
	" <th>text</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>80338</th>\n",
	" <td>80338</td>\n",
	" <td>1</td>\n",
	" <td>a</td>\n",
	" <td>The best Italian around.....service matches th...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>170833</th>\n",
	" <td>170833</td>\n",
	" <td>0</td>\n",
	" <td>a</td>\n",
	" <td>This place used to be good...we've been going ...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>40179</th>\n",
	" <td>40179</td>\n",
	" <td>1</td>\n",
	" <td>a</td>\n",
	" <td>We stumbled upon this location while heading f...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>383376</th>\n",
	" <td>383376</td>\n",
	" <td>0</td>\n",
	" <td>a</td>\n",
	" <td>Last night we went to this location. It was r...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>125165</th>\n",
	" <td>125165</td>\n",
	" <td>1</td>\n",
	" <td>a</td>\n",
	" <td>Quiet place with dim lighting, cozy atmosphere...</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" id label alpha text\n",
	"80338 80338 1 a The best Italian around.....service matches th...\n",
	"170833 170833 0 a This place used to be good...we've been going ...\n",
	"40179 40179 1 a We stumbled upon this location while heading f...\n",
	"383376 383376 0 a Last night we went to this location. It was r...\n",
	"125165 125165 1 a Quiet place with dim lighting, cozy atmosphere..."
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Creating training dataframe according to BERT by adding the required columns\n",
	"df_bert = pd.DataFrame({\n",
	" 'id':range(len(train_df)),\n",
	" 'label':train_df[0],\n",
	" 'alpha':['a']*train_df.shape[0],\n",
	" 'text': train_df[1].replace(r'\\n', ' ', regex=True)\n",
	"})\n",
	"\n",
	"\n",
	"# Splitting training data file into train and dev\n",
	"df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)\n",
	"\n",
	"df_bert_train.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<style scoped>\n",
	" .dataframe tbody tr th:only-of-type {\n",
	" vertical-align: middle;\n",
	" }\n",
	"\n",
	" .dataframe tbody tr th {\n",
	" vertical-align: top;\n",
	" }\n",
	"\n",
	" .dataframe thead th {\n",
	" text-align: right;\n",
	" }\n",
	"</style>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>id</th>\n",
	" <th>text</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>0</td>\n",
	" <td>Contrary to other reviews, I have zero complai...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>1</td>\n",
	" <td>Last summer I had an appointment to get new ti...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>2</td>\n",
	" <td>Friendly staff, same starbucks fair you get an...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>3</td>\n",
	" <td>The food is good. Unfortunately the service is...</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>4</td>\n",
	" <td>Even when we didn't have a car Filene's Baseme...</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" id text\n",
	"0 0 Contrary to other reviews, I have zero complai...\n",
	"1 1 Last summer I had an appointment to get new ti...\n",
	"2 2 Friendly staff, same starbucks fair you get an...\n",
	"3 3 The food is good. Unfortunately the service is...\n",
	"4 4 Even when we didn't have a car Filene's Baseme..."
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"# Creating test dataframe according to BERT\n",
	"df_bert_test = pd.DataFrame({\n",
	" 'id':range(len(test_df)),\n",
	" 'text': test_df[1].replace(r'\\n', ' ', regex=True)\n",
	"})\n",
	"\n",
	"df_bert_test.head()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Saving dataframes to .tsv format as required by BERT\n",
	"df_bert_train.to_csv('data/train.tsv', sep='\\t', index=False, header=False)\n",
	"df_bert_dev.to_csv('data/dev.tsv', sep='\\t', index=False, header=False)\n",
	"df_bert_test.to_csv('data/test.tsv', sep='\\t', index=False, header=False)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now we are ready for training using the scripts in the BERT repo."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}