Skip to content

Instantly share code, notes, and snippets.

@christie-cb
Last active August 7, 2022 19:16
Show Gist options
  • Save christie-cb/0b567146a404f9dc1e87b55825e4aa6e to your computer and use it in GitHub Desktop.
Save christie-cb/0b567146a404f9dc1e87b55825e4aa6e to your computer and use it in GitHub Desktop.
Introduction to NLP: N-Grams and RandomForest
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to NLP: N-Grams and RandomForest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Christie-Carol Beauchamp_\n",
"\n",
"This is an article about Natural Language Processing (and emojis 😜). What do we mean by \"natural language\"? A natural language or ordinary language is defined as any language that has evolved naturally in humans without conscious planning or premeditation. This includes languages such as English, as opposed to constructed languages such as programming languages.\n",
"\n",
"**What you will learn:** \n",
"* What is an N-gram? How can we convert a sentence into N-grams? Why are N-grams useful?\n",
"* How to generate features from a sentence and how to make them accessible to a scikit-learn classifier."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pickle\n",
"import re\n",
"from collections import Counter\n",
"\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.metrics import accuracy_score\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading our data\n",
"\n",
"Lets start by reading our data. Here's a link to the dataset: \n",
"\n",
"[🤗 HuggingFace Datasets - Emotion](https://huggingface.co/datasets/emotion), \n",
"\n",
"[Download Link](https://www.dropbox.com/s/607ptdakxuh5i4s/merged_training.pkl). "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def load_from_pickle(directory):\n",
" return pickle.load(open(directory,\"rb\"))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data shape: (20840, 2)\n"
]
}
],
"source": [
"data = load_from_pickle(directory=\"merged_training.pkl\")\n",
"\n",
"emotions = [\"sadness\", \"joy\", \"love\", \"anger\", \"fear\", \"surprise\"]\n",
"data = data[data[\"emotions\"].isin(emotions)]\n",
"\n",
"# The original dataset is huge, here we'll use only 5%:\n",
"data = data[:int(0.05 * len(data))]\n",
"print(f\"Data shape: {data.shape}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## N-Grams\n",
"\n",
"An N-gram is a sequence of N consecutive elements from a sample of text. For example, the text `\"today is Tuesday\"` contains 2 bigrams: `\"today is\"`, and `\"is Tuesday\"`.\n",
"\n",
"N-grams are useful because, for one reason, they help us to identify collocations of words. Collocations are expressions or phrases containing several words which are likely to occur together (e.g. \"United States\", \"weapon of mass destruction\", \"broad daylight\"). One interesting feature of N-grams is that their meaning can't always be drawn from their individual parts: for example, \"it's a piece of cake\", taken as a whole means \"it's easy\", yet the separate words in this phrase suggest something different entirely. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def ngram(token, n):\n",
" \"\"\"\n",
" Given a token (usually a list of strings), split it into\n",
" groups of length n (i.e. N-grams). \n",
" \"\"\"\n",
" output = []\n",
" for i in range(n-1, len(token)):\n",
" ngram = ' '.join(token[i-n+1:i+1])\n",
" output.append(ngram)\n",
" return output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will use `collection.Counter` to find how frequently these N-grams occur in our sample of text and this is the information we'll give to our model later!"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def create_feature(text, nrange=(1, 4)):\n",
" \"\"\"\n",
" Find a series of N-grams in the text and return a dictionary\n",
" telling us how commonly each N-gram occurs.\n",
" \"\"\"\n",
" text_features = []\n",
" text = text.lower()\n",
" for i in range(nrange[0], nrange[1]+1):\n",
" text_features += ngram(text.split(), i)\n",
" return Counter(text_features)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model and predict"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"y_all = data[\"emotions\"]\n",
"X_all = data[\"text\"].apply(create_feature)\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.2, random_state = 123)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Vectorization\n",
"\n",
"Next we use scikit-learn's DictVectorizer to convert our features into NumPy vectors suitable for feeding into a classifier. \n",
"\n",
"At this point, X_all looks like this:\n",
"`Counter({'the': 1, 'morning': 1, 'after': 1, 'the morning': 1, 'morning after': 1 ... }), ...`\n",
"which must be transformed into a vector like this: \n",
"\n",
"```\n",
" (0, 0)\t1.0\n",
" (0, 1)\t1.0\n",
" (0, 2)\t1.0\n",
" (0, 3)\t1.0\n",
" (0, 4)\t1.0\n",
"\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction import DictVectorizer\n",
"vectorizer = DictVectorizer(sparse = True)\n",
"X_train = vectorizer.fit_transform(X_train)\n",
"X_test = vectorizer.transform(X_test) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fitting RandomForestClassifier to the data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def train_test(clf, x_train, x_test, y_train, y_test):\n",
" clf.fit(x_train, y_train)\n",
" train_acc = accuracy_score(y_train, clf.predict(x_train))\n",
" test_acc = accuracy_score(y_test, clf.predict(x_test))\n",
" return train_acc, test_acc"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"| RandomForestClassifier | 0.996641074856046 | 0.7430422264875239 |\n"
]
}
],
"source": [
"rforest = RandomForestClassifier(n_estimators=16, n_jobs=-1, random_state=123)\n",
"train_acc, test_acc = train_test(rforest, X_train, X_test, y_train, y_test)\n",
"\n",
"print(f'| {rforest.__class__.__name__} | {train_acc} | {test_acc} |')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prediction"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def emojify(text):\n",
" input_text = re.sub('[^a-z0-9#]', ' ', text).lower()\n",
" features = create_feature(input_text, nrange=(1, 4))\n",
" features = vectorizer.transform(features)\n",
" prediction = rforest.predict(features)[0]\n",
" print(text, emoji_dict[prediction])\n",
" return emoji_dict[prediction]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"We had gelato, it was so good! 😁\n",
"My dog died last week. 😢\n",
"I have a fear of bats. 😱\n",
"They were very gracious: I felt blessed and so grateful. ❤️\n"
]
}
],
"source": [
"emoji_dict = {\"joy\":\"😁\", \"fear\":\"😱\", \"anger\":\"😠\", \"sadness\":\"😢\", \"love\": \"❤️\", \"surprise\": \"😳\"}\n",
"\n",
"texts = [\n",
" \"We had gelato, it was so good!\",\n",
" \"My dog died last week.\",\n",
" \"I have a fear of bats.\",\n",
" \"They were very gracious: I felt blessed and so grateful.\"\n",
"]\n",
"\n",
"for text in texts:\n",
" emojify(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving the model\n",
"\n",
"We're going to pickle our model and the vectorizer we trained on our data. This saves us from needing to train it again next time."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"if 'bot' not in os.listdir():\n",
" os.makedirs('bot')\n",
"\n",
"with open('bot/vectorizer.pk', 'wb') as fin:\n",
" pickle.dump(vectorizer, fin)\n",
"with open('bot/emotion_model.sav', 'wb') as fin:\n",
" pickle.dump(rforest, fin)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Emojify your Chat AI\n",
"\n",
"Next, we'll create a chat AI using HuggingFace's Accelerated Inference API and deploy it to [Chai](https://chai.ml)! Follow the link for more information on Chai and to download the mobile app.\n",
"\n",
"<img src=\"https://i.imgur.com/IjZ12pt.png\" width=\"500\">\n",
"\n",
"Chai is a platform for creating, sharing and interacting with conversational AI's. It allows us to chat with our new bot through a mobile app. This means you can show it off really easily, no need to whip out your laptop and fire up a colab instance, simply open the app and get chatting!\n",
"\n",
"There is also a bot leaderboard to climb. We can see how our new bot compares to others on the platform:\n",
"\n",
"<img src=\"https://i.imgur.com/ctPYQVZ.png\" width=\"850\">"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: chaipy in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (0.3.4)\n",
"Requirement already satisfied: halo>=0.0.31 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from chaipy) (0.0.31)\n",
"Requirement already satisfied: colorama>=0.4.4 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from chaipy) (0.4.4)\n",
"Requirement already satisfied: requests>=2.23.0 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from chaipy) (2.25.1)\n",
"Requirement already satisfied: segno>=1.3.3 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from chaipy) (1.3.3)\n",
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from chaipy) (3.10.0.0)\n",
"Requirement already satisfied: spinners>=0.0.24 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from halo>=0.0.31->chaipy) (0.0.24)\n",
"Requirement already satisfied: six>=1.12.0 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from halo>=0.0.31->chaipy) (1.16.0)\n",
"Requirement already satisfied: log-symbols>=0.0.14 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from halo>=0.0.31->chaipy) (0.0.14)\n",
"Requirement already satisfied: termcolor>=1.1.0 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from halo>=0.0.31->chaipy) (1.1.0)\n",
"Requirement already satisfied: chardet<5,>=3.0.2 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from requests>=2.23.0->chaipy) (4.0.0)\n",
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from requests>=2.23.0->chaipy) (1.26.4)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from requests>=2.23.0->chaipy) (2020.12.5)\n",
"Requirement already satisfied: idna<3,>=2.5 in /Users/ccb/.pyenv/versions/3.7.10/lib/python3.7/site-packages (from requests>=2.23.0->chaipy) (2.10)\n",
"\u001b[33mWARNING: You are using pip version 21.1.3; however, version 21.2.3 is available.\n",
"You should consider upgrading via the '/Users/ccb/.pyenv/versions/3.7.10/bin/python3 -m pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"!pip install --upgrade chaipy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import chai_py\n",
"chai_py.setup_notebook()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%write_and_run bot bot.py Blenderbot\n",
"import json\n",
"import pickle\n",
"import re\n",
"import time\n",
"from collections import Counter\n",
"\n",
"import requests\n",
"\n",
"from chai_py import ChaiBot, Update\n",
"\n",
"class Blenderbot(ChaiBot):\n",
" ENDPOINT = \"https://api-inference.huggingface.co/models/facebook/blenderbot-400M-distill\"\n",
" def setup(self):\n",
" self.headers = {\"Authorization\": \"Bearer <API-TOKEN>\"} # Get this API token from HuggingFace\n",
" self.first_response = \"😈\"\n",
" self.emoji = Emoji()\n",
"\n",
" async def on_message(self, update: Update) -> str:\n",
" \"\"\"\n",
" Every time the user sends our bot a message, this function gets triggered with\n",
" an Update, containing the user's LatestMessage. See the API reference for more\n",
" information: https://chai-py.readthedocs.io/en/latest/\n",
" \"\"\"\n",
" if update.latest_message.text == self.FIRST_MESSAGE_STRING:\n",
" return self.first_response\n",
" payload = await self.get_payload(update)\n",
" text = self.query(payload)\n",
" return self.emoji.emojify(text)\n",
"\n",
" def query(self, payload):\n",
" data = json.dumps(payload)\n",
" response = requests.post(self.ENDPOINT, headers=self.headers, data=data)\n",
" if response.status_code == 503:\n",
" response = self.sleep_to_load(response, data)\n",
" return json.loads(response.content.decode(\"utf-8\"))[\"generated_text\"]\n",
"\n",
" def sleep_to_load(self, response, data):\n",
" estimated_time = response.json()['estimated_time']\n",
" time.sleep(estimated_time)\n",
" self.logger.info(f\"Sleeping for model to load: {estimated_time}\")\n",
" data = json.loads(data)\n",
" data[\"options\"] = {\"use_cache\": False, \"wait_for_model\": True}\n",
" data = json.dumps(data)\n",
" response = requests.post(self.ENDPOINT, headers=self.headers, data=data)\n",
" return response\n",
"\n",
" async def get_payload(self, update):\n",
" messages = await self.get_messages(update.conversation_id)\n",
" past_user_inputs = [\"Hey\"]\n",
" generated_responses = [self.first_response]\n",
" for message in messages:\n",
" content = message.content\n",
" if content == self.FIRST_MESSAGE_STRING:\n",
" continue\n",
" if message.sender_uid == self.uid:\n",
" past_user_inputs.append(content)\n",
" else:\n",
" generated_responses.append(content)\n",
" return {\n",
" \"inputs\": {\n",
" \"past_user_inputs\": past_user_inputs,\n",
" \"generated_responses\": generated_responses,\n",
" \"text\": update.latest_message.text,\n",
" },\n",
" }\n",
"\n",
"class Emoji:\n",
" def __init__(self):\n",
" self.loaded_model = pickle.load(open('emotion_model.sav', 'rb'))\n",
" self.vectorizer = pickle.load(open('vectorizer.pk', 'rb'))\n",
" self.emoji_dict = {\"joy\":\"😁\", \"fear\":\"😱\", \"anger\":\"😠\", \"sadness\":\"😢\", \"love\": \"❤️\", \"surprise\": \"😳\"}\n",
"\n",
" def emojify(self, text):\n",
" \"\"\"\n",
" Given text, load it into the vectorizer and use the loaded model to predict an emoji.\n",
" \"\"\"\n",
" input_text = re.sub('[^a-z0-9#]', ' ', text)\n",
" input_text = input_text.lower()\n",
" feats = self.create_feature(input_text, nrange=(1, 4))\n",
" feats = self.vectorizer.transform(feats)\n",
" prediction = self.loaded_model.predict(feats)[0]\n",
" return text + self.emoji_dict[prediction]\n",
"\n",
" def create_feature(self, text, nrange=(1, 4)):\n",
" \"\"\"\n",
" Find a series of N-grams in the text and return a dictionary\n",
" telling us how commonly each N-gram occurs.\n",
"\n",
" \"\"\"\n",
" text_features = []\n",
" text = text.lower()\n",
" for i in range(nrange[0], nrange[1]+1):\n",
" text_features += self.ngram(text.split(), i)\n",
" return Counter(text_features)\n",
"\n",
" def ngram(self, token, n):\n",
" output = []\n",
" for i in range(n-1, len(token)):\n",
" ngram = ' '.join(token[i-n+1:i+1])\n",
" output.append(ngram)\n",
" return output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chatting with the bot\n",
"\n",
"Let's check that our bot works as expected before deploying it. We can do so using chai_py's TRoom for testing. ☕️😃"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from chai_py import TRoom\n",
"\n",
"t_room = TRoom([Blenderbot()])\n",
"t_room.test_chat([\"hi!\", \"what's up?\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Deploying to Chai\n",
"\n",
"1. Get the [Chai](https://chai.ml) app.\n",
"2. Sign up to Chai Developer Platform.\n",
"3. Scroll to the bottom and get your \"Developer Unique ID\" and \"Developer Key\" from the bottom of the [Chai Developer Platform](https://chai.ml/dev)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from chai_py.auth import set_auth\n",
"\n",
"DEV_UID = input(\"Enter dev UID: \")\n",
"DEV_KEY = input(\"Enter dev key: \")\n",
"set_auth(DEV_UID, DEV_KEY)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from chai_py import package, Metadata\n",
"\n",
"IMAGE_URL = \"https://i.imgur.com/V4CS55M.jpg\"\n",
"\n",
"package(\n",
" Metadata(\n",
" name=\"BlenderBot (emojified 🔥)\",\n",
" image_url=IMAGE_URL,\n",
" color=\"f1a2b3\",\n",
" description=\"☺️🔥🤩\",\n",
" input_class=Blenderbot,\n",
" memory=4000,\n",
" ),\n",
" requirements=[\"scikit-learn\", \"scipy\", \"numpy\"],\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"from chai_py import upload_and_deploy\n",
"\n",
"bot_uid = upload_and_deploy(\"bot/_package.zip\")\n",
"share_bot(bot_uid)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment