Skip to content

Instantly share code, notes, and snippets.

@samk3211
Last active December 13, 2023 19:49
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save samk3211/1d233b29ce5acc93f4a3e8c13db8ccd3 to your computer and use it in GitHub Desktop.
Save samk3211/1d233b29ce5acc93f4a3e8c13db8ccd3 to your computer and use it in GitHub Desktop.
This gist is part of my blogpost on BERT. Find the complete blogpost, covering both theory and hands-on part, here: https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Importing the necessary modules\n",
"import pandas as pd\n",
"import tarfile\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The downloaded dataset is in tgz format. We can open it using `tarfile.open()` and then extract the csv files using the `extractall()` method:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_tg = tarfile.open('data/yelp_review_polarity_csv.tgz')\n",
"data_tg.extractall('data')\n",
"data_tg.close()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<tarfile.TarFile at 0x196b1b28048>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_tg"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the the first 5 rows of the train and test datasets to understand the data we are dealing with:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>I don't know what Dr. Goldberg was like before...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>I'm writing this review to give you a heads up...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>All the food is great here. But the best thing...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 1 Unfortunately, the frustration of being Dr. Go...\n",
"1 2 Been going to Dr. Goldberg for over 10 years. ...\n",
"2 1 I don't know what Dr. Goldberg was like before...\n",
"3 1 I'm writing this review to give you a heads up...\n",
"4 2 All the food is great here. But the best thing..."
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" train_df = pd.read_csv('data/yelp_review_polarity_csv/train.csv', header=None)\n",
"\n",
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>Contrary to other reviews, I have zero complai...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Last summer I had an appointment to get new ti...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Friendly staff, same starbucks fair you get an...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>The food is good. Unfortunately the service is...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>Even when we didn't have a car Filene's Baseme...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 2 Contrary to other reviews, I have zero complai...\n",
"1 1 Last summer I had an appointment to get new ti...\n",
"2 2 Friendly staff, same starbucks fair you get an...\n",
"3 1 The food is good. Unfortunately the service is...\n",
"4 2 Even when we didn't have a car Filene's Baseme..."
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_df = pd.read_csv('data/yelp_review_polarity_csv/test.csv', header=None)\n",
"\n",
"test_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that in our dataset a label of 1 means the review is bad while a label of 2 means the review is good.\n",
"\n",
"Let's change this to a more standard pattern — 0 and 1 labels. Let's have a label 0 for the bad review and a label 1 for the good review:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"train_df[0] = (train_df[0] == 2).astype(int)\n",
"test_df[0] = (test_df[0] == 2).astype(int)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Unfortunately, the frustration of being Dr. Go...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>I don't know what Dr. Goldberg was like before...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>I'm writing this review to give you a heads up...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>All the food is great here. But the best thing...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 0 Unfortunately, the frustration of being Dr. Go...\n",
"1 1 Been going to Dr. Goldberg for over 10 years. ...\n",
"2 0 I don't know what Dr. Goldberg was like before...\n",
"3 0 I'm writing this review to give you a heads up...\n",
"4 1 All the food is great here. But the best thing..."
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Contrary to other reviews, I have zero complai...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>Last summer I had an appointment to get new ti...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>Friendly staff, same starbucks fair you get an...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>The food is good. Unfortunately the service is...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>Even when we didn't have a car Filene's Baseme...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 1 Contrary to other reviews, I have zero complai...\n",
"1 0 Last summer I had an appointment to get new ti...\n",
"2 1 Friendly staff, same starbucks fair you get an...\n",
"3 0 The food is good. Unfortunately the service is...\n",
"4 1 Even when we didn't have a car Filene's Baseme..."
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Making things BERT friendly\n",
"\n",
"1. First let's make the data compliant with BERT:\n",
"\n",
" - Column 0: An ID for the row. (Required both for *train* and *test* data.)<br>\n",
" - Column 1: The class label for the row. (Required only for *train* data.)<br>\n",
" - Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it. (Required only for *train* data.)<br>\n",
" - Column 3: The text examples we want to classify. (Required both for *train* and *test* data.)<BR><br>\n",
" \n",
"2. We need to split the files into the format expected by BERT: BERT comes with data loading classes that expects two files called *train* and *dev* for training. In addition, BERT’s data loading classes can also use a *test* file but it expects the test file to be unlabelled. <br><br>\n",
"\n",
"3. Once the data is in the correct format, we need to save the files as .tsv (BERT doesn't take .csv as input.)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>label</th>\n",
" <th>alpha</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>80338</th>\n",
" <td>80338</td>\n",
" <td>1</td>\n",
" <td>a</td>\n",
" <td>The best Italian around.....service matches th...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>170833</th>\n",
" <td>170833</td>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" <td>This place used to be good...we've been going ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>40179</th>\n",
" <td>40179</td>\n",
" <td>1</td>\n",
" <td>a</td>\n",
" <td>We stumbled upon this location while heading f...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>383376</th>\n",
" <td>383376</td>\n",
" <td>0</td>\n",
" <td>a</td>\n",
" <td>Last night we went to this location. It was r...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>125165</th>\n",
" <td>125165</td>\n",
" <td>1</td>\n",
" <td>a</td>\n",
" <td>Quiet place with dim lighting, cozy atmosphere...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id label alpha text\n",
"80338 80338 1 a The best Italian around.....service matches th...\n",
"170833 170833 0 a This place used to be good...we've been going ...\n",
"40179 40179 1 a We stumbled upon this location while heading f...\n",
"383376 383376 0 a Last night we went to this location. It was r...\n",
"125165 125165 1 a Quiet place with dim lighting, cozy atmosphere..."
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Creating training dataframe according to BERT by adding the required columns\n",
"df_bert = pd.DataFrame({\n",
" 'id':range(len(train_df)),\n",
" 'label':train_df[0],\n",
" 'alpha':['a']*train_df.shape[0],\n",
" 'text': train_df[1].replace(r'\\n', ' ', regex=True)\n",
"})\n",
"\n",
"\n",
"# Splitting training data file into *train* and *dev*\n",
"df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)\n",
"\n",
"df_bert_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>text</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Contrary to other reviews, I have zero complai...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Last summer I had an appointment to get new ti...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Friendly staff, same starbucks fair you get an...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>The food is good. Unfortunately the service is...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>Even when we didn't have a car Filene's Baseme...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id text\n",
"0 0 Contrary to other reviews, I have zero complai...\n",
"1 1 Last summer I had an appointment to get new ti...\n",
"2 2 Friendly staff, same starbucks fair you get an...\n",
"3 3 The food is good. Unfortunately the service is...\n",
"4 4 Even when we didn't have a car Filene's Baseme..."
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Creating test dataframe according to BERT\n",
"df_bert_test = pd.DataFrame({\n",
" 'id':range(len(test_df)),\n",
" 'text': test_df[1].replace(r'\\n', ' ', regex=True)\n",
"})\n",
"\n",
"df_bert_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Saving dataframes to .tsv format as required by BERT\n",
"df_bert_train.to_csv('data/train.tsv', sep='\\t', index=False, header=False)\n",
"df_bert_dev.to_csv('data/dev.tsv', sep='\\t', index=False, header=False)\n",
"df_bert_test.to_csv('data/test.tsv', sep='\\t', index=False, header=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are ready for training using the scripts in the BERT repo."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@TaylorHawkes
Copy link

For some reason when saving that alpha (a) saves as first column and was messing up the training.
I changed that "alpha" column to "poop" and it fixed it. (think it is just saving columns alphabetically, maybe there is better fix here haha)

df_bert = pd.DataFrame({
'id':range(len(train_df)),
'label':train_df[0],
'poop':['a']*train_df.shape[0],
'text': train_df[1].replace(r'\n', ' ', regex=True)
})

@Gaurav-AL
Copy link

Hey sir, can you tell me, how can I map words, only nouns, and adjectives to particular classes and properties of the Graph database, DO I have to create a dataset for each word and their respective classes and properties in an Ontology?

I am working on Q&A System for the graph database basically , I found research paper that do these thing but I don't think of how to achieve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment