Last active
December 13, 2023 19:49
-
-
Save samk3211/1d233b29ce5acc93f4a3e8c13db8ccd3 to your computer and use it in GitHub Desktop.
This gist is part of my blogpost on BERT. Find the complete blogpost, covering both theory and hands-on part, here: https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Importing the necessary modules\n", | |
"import pandas as pd\n", | |
"import tarfile\n", | |
"from sklearn.preprocessing import LabelEncoder\n", | |
"from sklearn.model_selection import train_test_split" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The downloaded dataset is in tgz format. We can open it using `tarfile.open()` and then extract the csv files using the `extractall()` method:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"data_tg = tarfile.open('data/yelp_review_polarity_csv.tgz')\n", | |
"data_tg.extractall('data')\n", | |
"data_tg.close()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"<tarfile.TarFile at 0x196b1b28048>" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data_tg" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's look at the the first 5 rows of the train and test datasets to understand the data we are dealing with:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>Unfortunately, the frustration of being Dr. Go...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2</td>\n", | |
" <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1</td>\n", | |
" <td>I don't know what Dr. Goldberg was like before...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>I'm writing this review to give you a heads up...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2</td>\n", | |
" <td>All the food is great here. But the best thing...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1\n", | |
"0 1 Unfortunately, the frustration of being Dr. Go...\n", | |
"1 2 Been going to Dr. Goldberg for over 10 years. ...\n", | |
"2 1 I don't know what Dr. Goldberg was like before...\n", | |
"3 1 I'm writing this review to give you a heads up...\n", | |
"4 2 All the food is great here. But the best thing..." | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
" train_df = pd.read_csv('data/yelp_review_polarity_csv/train.csv', header=None)\n", | |
"\n", | |
"train_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>2</td>\n", | |
" <td>Contrary to other reviews, I have zero complai...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>Last summer I had an appointment to get new ti...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>Friendly staff, same starbucks fair you get an...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>1</td>\n", | |
" <td>The food is good. Unfortunately the service is...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>2</td>\n", | |
" <td>Even when we didn't have a car Filene's Baseme...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1\n", | |
"0 2 Contrary to other reviews, I have zero complai...\n", | |
"1 1 Last summer I had an appointment to get new ti...\n", | |
"2 2 Friendly staff, same starbucks fair you get an...\n", | |
"3 1 The food is good. Unfortunately the service is...\n", | |
"4 2 Even when we didn't have a car Filene's Baseme..." | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"test_df = pd.read_csv('data/yelp_review_polarity_csv/test.csv', header=None)\n", | |
"\n", | |
"test_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We can see that in our dataset a label of 1 means the review is bad while a label of 2 means the review is good.\n", | |
"\n", | |
"Let's change this to a more standard pattern — 0 and 1 labels. Let's have a label 0 for the bad review and a label 1 for the good review:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"train_df[0] = (train_df[0] == 2).astype(int)\n", | |
"test_df[0] = (test_df[0] == 2).astype(int)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>Unfortunately, the frustration of being Dr. Go...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>Been going to Dr. Goldberg for over 10 years. ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>0</td>\n", | |
" <td>I don't know what Dr. Goldberg was like before...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>I'm writing this review to give you a heads up...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1</td>\n", | |
" <td>All the food is great here. But the best thing...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1\n", | |
"0 0 Unfortunately, the frustration of being Dr. Go...\n", | |
"1 1 Been going to Dr. Goldberg for over 10 years. ...\n", | |
"2 0 I don't know what Dr. Goldberg was like before...\n", | |
"3 0 I'm writing this review to give you a heads up...\n", | |
"4 1 All the food is great here. But the best thing..." | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"train_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>Contrary to other reviews, I have zero complai...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>0</td>\n", | |
" <td>Last summer I had an appointment to get new ti...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>1</td>\n", | |
" <td>Friendly staff, same starbucks fair you get an...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>0</td>\n", | |
" <td>The food is good. Unfortunately the service is...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>1</td>\n", | |
" <td>Even when we didn't have a car Filene's Baseme...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" 0 1\n", | |
"0 1 Contrary to other reviews, I have zero complai...\n", | |
"1 0 Last summer I had an appointment to get new ti...\n", | |
"2 1 Friendly staff, same starbucks fair you get an...\n", | |
"3 0 The food is good. Unfortunately the service is...\n", | |
"4 1 Even when we didn't have a car Filene's Baseme..." | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"test_df.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Making things BERT friendly\n", | |
"\n", | |
"1. First let's make the data compliant with BERT:\n", | |
"\n", | |
" - Column 0: An ID for the row. (Required both for *train* and *test* data.)<br>\n", | |
" - Column 1: The class label for the row. (Required only for *train* data.)<br>\n", | |
" - Column 2: A column of the same letter for all rows — this is a throw-away column that we need to include because BERT expects it. (Required only for *train* data.)<br>\n", | |
" - Column 3: The text examples we want to classify. (Required both for *train* and *test* data.)<BR><br>\n", | |
" \n", | |
"2. We need to split the files into the format expected by BERT: BERT comes with data loading classes that expects two files called *train* and *dev* for training. In addition, BERT’s data loading classes can also use a *test* file but it expects the test file to be unlabelled. <br><br>\n", | |
"\n", | |
"3. Once the data is in the correct format, we need to save the files as .tsv (BERT doesn't take .csv as input.)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>label</th>\n", | |
" <th>alpha</th>\n", | |
" <th>text</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>80338</th>\n", | |
" <td>80338</td>\n", | |
" <td>1</td>\n", | |
" <td>a</td>\n", | |
" <td>The best Italian around.....service matches th...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>170833</th>\n", | |
" <td>170833</td>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" <td>This place used to be good...we've been going ...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>40179</th>\n", | |
" <td>40179</td>\n", | |
" <td>1</td>\n", | |
" <td>a</td>\n", | |
" <td>We stumbled upon this location while heading f...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>383376</th>\n", | |
" <td>383376</td>\n", | |
" <td>0</td>\n", | |
" <td>a</td>\n", | |
" <td>Last night we went to this location. It was r...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>125165</th>\n", | |
" <td>125165</td>\n", | |
" <td>1</td>\n", | |
" <td>a</td>\n", | |
" <td>Quiet place with dim lighting, cozy atmosphere...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id label alpha text\n", | |
"80338 80338 1 a The best Italian around.....service matches th...\n", | |
"170833 170833 0 a This place used to be good...we've been going ...\n", | |
"40179 40179 1 a We stumbled upon this location while heading f...\n", | |
"383376 383376 0 a Last night we went to this location. It was r...\n", | |
"125165 125165 1 a Quiet place with dim lighting, cozy atmosphere..." | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Creating training dataframe according to BERT by adding the required columns\n", | |
"df_bert = pd.DataFrame({\n", | |
" 'id':range(len(train_df)),\n", | |
" 'label':train_df[0],\n", | |
" 'alpha':['a']*train_df.shape[0],\n", | |
" 'text': train_df[1].replace(r'\\n', ' ', regex=True)\n", | |
"})\n", | |
"\n", | |
"\n", | |
"# Splitting training data file into *train* and *dev*\n", | |
"df_bert_train, df_bert_dev = train_test_split(df_bert, test_size=0.01)\n", | |
"\n", | |
"df_bert_train.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<style scoped>\n", | |
" .dataframe tbody tr th:only-of-type {\n", | |
" vertical-align: middle;\n", | |
" }\n", | |
"\n", | |
" .dataframe tbody tr th {\n", | |
" vertical-align: top;\n", | |
" }\n", | |
"\n", | |
" .dataframe thead th {\n", | |
" text-align: right;\n", | |
" }\n", | |
"</style>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>id</th>\n", | |
" <th>text</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>0</td>\n", | |
" <td>Contrary to other reviews, I have zero complai...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>1</td>\n", | |
" <td>Last summer I had an appointment to get new ti...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>2</td>\n", | |
" <td>Friendly staff, same starbucks fair you get an...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>3</td>\n", | |
" <td>The food is good. Unfortunately the service is...</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>4</td>\n", | |
" <td>Even when we didn't have a car Filene's Baseme...</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" id text\n", | |
"0 0 Contrary to other reviews, I have zero complai...\n", | |
"1 1 Last summer I had an appointment to get new ti...\n", | |
"2 2 Friendly staff, same starbucks fair you get an...\n", | |
"3 3 The food is good. Unfortunately the service is...\n", | |
"4 4 Even when we didn't have a car Filene's Baseme..." | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# Creating test dataframe according to BERT\n", | |
"df_bert_test = pd.DataFrame({\n", | |
" 'id':range(len(test_df)),\n", | |
" 'text': test_df[1].replace(r'\\n', ' ', regex=True)\n", | |
"})\n", | |
"\n", | |
"df_bert_test.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Saving dataframes to .tsv format as required by BERT\n", | |
"df_bert_train.to_csv('data/train.tsv', sep='\\t', index=False, header=False)\n", | |
"df_bert_dev.to_csv('data/dev.tsv', sep='\\t', index=False, header=False)\n", | |
"df_bert_test.to_csv('data/test.tsv', sep='\\t', index=False, header=False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now we are ready for training using the scripts in the BERT repo." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Hey sir, can you tell me, how can I map words, only nouns, and adjectives to particular classes and properties of the Graph database, DO I have to create a dataset for each word and their respective classes and properties in an Ontology?
I am working on Q&A System for the graph database basically , I found research paper that do these thing but I don't think of how to achieve this.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
For some reason when saving that alpha (a) saves as first column and was messing up the training.
I changed that "alpha" column to "poop" and it fixed it. (think it is just saving columns alphabetically, maybe there is better fix here haha)
df_bert = pd.DataFrame({
'id':range(len(train_df)),
'label':train_df[0],
'poop':['a']*train_df.shape[0],
'text': train_df[1].replace(r'\n', ' ', regex=True)
})