Skip to content

Instantly share code, notes, and snippets.

@fmnobar
Created March 18, 2023 14:52
Show Gist options
  • Save fmnobar/ff6ad064827401f7e2d5f11f5ae7bd23 to your computer and use it in GitHub Desktop.
Save fmnobar/ff6ad064827401f7e2d5f11f5ae7bd23 to your computer and use it in GitHub Desktop.
NLP Intro
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "6d820a9d-5e26-4e55-8010-1aaf1d1ab7cc",
"metadata": {},
"source": [
"# Natural Language Fundamentals - Intro & Language Model Implementation of Sentiment Analysis, Machine Translation and Named Entity Recognition\n",
"\n",
"We humans use words, sounds, gestures and symbols to convey complex concepts and abstracts to each other in different forms such as speech, writing and signs. With the advent of computers and in order to take advantage of these powerful machines, we had to come up with ways for computers to understand human communications and the existing corpus of knowledge. Hence, the Natural Language Processing (NLP), Understanding (NLU) and Generation (NLG) branches of Artificial Intelligence were born. Boundaries of these three areas are quite merky and the overall Natural Language space encompasses various applications in today's Computer and Data Science world. Probably the most common of such applications are (I) Sentiment Analysis, (II) Machine Translation and (III) Named-Entity Recognition (NER), which we will define and implement in this post. \n",
"\n",
"In order to implement these three tasks, we will leverage existing \"Pre-Trained Language Models\". Therefore, let's first understand what language modeling is. I do recommend glancing through the \"Language Modeling\" section but if you are mainly interested in the application and/or implementation, feel free to skip it.\n",
"\n",
"## Language Modeling\n",
"\n",
"Language modeling encompasses various methods that use statistics and probability, combined with machine learning architectures (such as and especially deep neural networks) to determine the likelihood of a sequence of words occurring in a string, such as a sentence. Based on the calculated probability, certain decisions can be made - for example, a model can generate a string/response to a user-provided prompt (such as ChatGPT), perform a text classification to determine whether a word in question is a noun, a verb, etc. Thanks to the large corpora of textual data available all around us these days, such language models are usually trained on vast amounts of textual data. Consequently, such models are also referred to as Large Language Models. At this point you may be wondering how this is relevant to our post - we are just getting there. These pre-trained language models then can be further trained (i.e. fine-tuned) to perform specific tasks, such as sentiment analysis, machine translation and named-entity recognition, which we will explore in further detail today. Going deeper into the architecture and training strategies of language models is beyond the intent of this post, but if you are interested in that topic, feel free to visit the post below:\n",
"\n",
"[Pre-Trained Models in NLP](https://medium.com/@fmnobar/intro-to-pre-trained-models-in-nlp-6bf7490a49fa)\n",
"\n",
"Now that we are familiar with what Natural Language space and Language Modeling are, let's go to the fun part of using these models!\n",
"\n",
"## 1. Sentiment Analysis\n",
"\n",
"Task of identifying the sentiment of a piece of text, such as whether it is positive, negative, or neutral. It is used in applications such as social media monitoring, customer feedback analysis, and product review analysis. As you can imagine, this one is quite useful for a lot of companies. For example, a large online retail company would not be able to dedicate the human resources required to manually read all the comments about various products. Instead, they can run a Sentiment Analysis model on the reviews and analyze the results. Next, let's see how we can implement this. \n",
"\n",
"### 1.1. Sentiment Analysis - Implementation\n",
"\n",
"In this example, we first load a pre-trained model from the transformers library. Then, we use the model to generate sentiment from the input sentence. Then we test this on two different sentences, one positive and one negative to verify model's performance. Below are the two sentences that we will be using:\n",
"\n",
"- \"I loved this movie!\", which we expect to be classified as a \"Positive\" sentiment by the model\n",
"- \"I did not like this movie.\", which we expect to be classified as a \"Negative\" sentiment by the model\n",
"\n",
"Let's see how it works!"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "40231f01-9d53-47d6-9008-b8ee245fb8a5",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'I loved this movie!' has a POSITIVE sentiment, with a score of 0.9999!\n",
"\n",
"'I did not like this movie.' has a NEGATIVE sentiment, with a score of 0.9925!\n",
"\n"
]
}
],
"source": [
"# Import libraries\n",
"from transformers import pipeline\n",
"\n",
"# Load the pre-trained model\n",
"nlp = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')\n",
"\n",
"# Define the function to perform sentiment analysis\n",
"def sentiment_analyzer(input_text):\n",
" # Perform sentiment analysis\n",
" result = nlp(input_text)[0]\n",
" \n",
" # Return results\n",
" return f\"'{input_text}' has a {result['label']} sentiment, with a score of {round(result['score'], 4)}!\\n\"\n",
"\n",
"# Define example sentences\n",
"sentence_1 = \"I loved this movie!\"\n",
"sentence_2 = \"I did not like this movie.\"\n",
"sentence_list = [sentence_1, sentence_2]\n",
"\n",
"# Analyze the sentiment of each sentence\n",
"for sentence in sentence_list:\n",
" print(sentiment_analyzer(sentence))\n"
]
},
{
"cell_type": "markdown",
"id": "42cc3aaa-a82f-4494-bb51-2f7ac96b0aa1",
"metadata": {},
"source": [
"Results look pretty good and as we expected.\n",
"\n",
"## 2. Machine Translation\n",
"\n",
"Task of automatically translating text from one language to another. The most wellknown example for most users is Google Translate - what Google Translate does is called Machine Translation! Applications are plentiful. For example, one can read and understand information in other languages. \n",
"\n",
"### 2.1. Machine Translation - Implementation\n",
"To implement Machine Translation, we are going to use mBART-50 from transformers library, which is a pre-trained model for Machine Translation. The steps are very similar to what we did in the Sentiment Analysis and are as follows:\n",
"\n",
"1. Installation - you may need to install transformers as follows: `pip install transformers`\n",
"2. Import the library\n",
"3. Load the pre-trained model\n",
"4. Run example sentences through the model and return the results\n",
"\n",
"What is interesting about mBART-50 is that it is a multilingual Machine Translation model, meaning it can translate to and from different languages. Let's test this capability in action!"
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "2dc5482b-ba1d-4e81-95bc-63861d8d5e89",
"metadata": {},
"outputs": [],
"source": [
"# Import library\n",
"from transformers import MBartForConditionalGeneration, MBart50TokenizerFast\n",
"\n",
"# Load model and tokenizer\n",
"model = MBartForConditionalGeneration.from_pretrained(\"facebook/mbart-large-50-many-to-many-mmt\")\n",
"tokenizer = MBart50TokenizerFast.from_pretrained(\"facebook/mbart-large-50-many-to-many-mmt\")\n",
"\n",
"def translator(source_sentence, source_language, target_language):\n",
" # Encode sentence\n",
" tokenizer.src_lang = source_language\n",
" input_ids = tokenizer(source_sentence, return_tensors=\"pt\").input_ids\n",
"\n",
" # Translate sentence\n",
" output_ids = model.generate(input_ids, forced_bos_token_id=tokenizer.lang_code_to_id[target_language])\n",
" translation = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]\n",
"\n",
" # return translation\n",
" return translation"
]
},
{
"cell_type": "markdown",
"id": "f1d74c53-2f3d-4420-ad5e-ec55613447d2",
"metadata": {},
"source": [
"In the above block, we imported the libraries, loaded the models and created the \"translator\" function that accepts a sentence (`source_sentence`), the language of the provided sentence (`source_language`) and the language that we would like the sentence to be translated to (`target_language`). Then \"translator\" returns the translation as instructed. \n",
"\n",
"Next, let's test our function by translating `Multilingual machine translation is impressive!` to French, Spanish, Italian, German, Simplified Chinese and Japanese."
]
},
{
"cell_type": "code",
"execution_count": 95,
"id": "459a8c64-92c5-4269-b971-6e3947539935",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Generating machine translations for: \n",
"'Multilingual machine translation is impressive!'\n",
"\n",
"fr_XX:\n",
"La traduction multilingue est impressionnante!\n",
"\n",
"es_XX:\n",
"La traducción de máquinas multilingues es impresionante!\n",
"\n",
"it_IT:\n",
"La traduzione multilingua è impressionante!\n",
"\n",
"de_DE:\n",
"Mehrsprachige maschinale Übersetzung ist beeindruckend!\n",
"\n",
"zh_CN:\n",
"多语言机器翻译令人印象深刻!\n",
"\n",
"ja_XX:\n",
"多言語機械翻訳は印象的です!\n",
"\n"
]
}
],
"source": [
"# Define sentence to be translated\n",
"original_sentence = 'Multilingual machine translation is impressive!'\n",
"\n",
"# Define source language\n",
"english = \"en_XX\"\n",
"\n",
"# Define target languages\n",
"french = \"fr_XX\"\n",
"spanish = \"es_XX\"\n",
"italian = \"it_IT\"\n",
"german = \"de_DE\"\n",
"simplified_chinese = \"zh_CN\"\n",
"japanese = \"ja_XX\"\n",
"\n",
"# Create a list of target languages\n",
"target_list = [french, spanish, italian, german, simplified_chinese, japanese]\n",
"\n",
"# Create a prompt list of lists\n",
"prompt_list = []\n",
"\n",
"for target in target_list:\n",
" prompt_list.append([original_sentence, english, target])\n",
"\n",
"# Create translations\n",
"print(f\"Generating machine translations for: \\n'{original_sentence}'\\n\")\n",
"\n",
"for i in enumerate(prompt_list):\n",
" translation = translator(source_sentence=i[1][0], source_language=i[1][1], target_language=i[1][2])\n",
" print(f\"{i[1][2]}:\")\n",
" print(f\"{translation}\\n\")"
]
},
{
"cell_type": "markdown",
"id": "90a928f9-9308-44c5-b053-b84a5ce76e17",
"metadata": {},
"source": [
"And we see the translations in the target languages in the results! I do not personally speak any of these languages but I verified them using Google Translate and the translations seem accurate!\n",
"\n",
"## 3. Named-Entity Recognition (NER)\n",
"\n",
"Task of identifying and categorizing named entities in text and categorizing them into pre-defined classes is called Named-Entity Recognition or NER for short. Some examples of these categories are: person names, locations, dates, organizations, numbers, etc. You may be wondering why we would need such a model. NER has many applications in the Natural Language space. For example, Visa, American Express, Amazon, etc. can use NER to identify and black-out sensitive information in a customer communication to protect customers' sensitive information, such as date of birth and credit card information. Another application for social media companies such as Meta is identifying locations and individual names in comments/posts and using them for content recommendation.\n",
"\n",
"Now that we understand what NER is, let's implement it and look at the results. \n",
"\n",
"### 3.1. NER - Implementation\n",
"\n",
"In this example, we are going to use [spaCy](https://spacy.io) pre-trained model in Python for NER. The implementation is pretty straightforward. We will follow these steps:\n",
"\n",
"1. Installation (skip if you have it) and download the required data\n",
"2. Import the library\n",
"3. Load the pre-trained tasks, inclusive of NER\n",
"4. Running an example sentence through the model and return the results\n",
"\n",
"If you need to install spaCy and download the data, use the following command ([source](https://spacy.io/usage/models)):\n",
"```\n",
"pip3 install spacy\n",
"python -m spacy download en_core_web_sm\n",
"```\n",
"I ran the installation above using the Command Line Interface. It is as simple as (I) opening the Terminal and then (II) running the above two lines. If you need a tutorial, feel free to check out this post: [CLI Tutorial](https://medium.com/towards-data-science/command-line-interface-cli-tutorial-how-advanced-users-interact-with-computers-28cf88f81ce)\n",
"\n",
"Next, let's implement and apply NER to the following sentence: \n",
"\n",
"`Farzad wrote this Medium article in March 2023, using an Apple laptop, on a Jupyter notebook!`"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "78a3d150-feff-4fa2-8842-e62ea5930c57",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Noun phrases:['Farzad', 'this Medium article', 'March', 'an Apple laptop', 'a Jupyter notebook']\n",
"\n",
"Verbs: ['write', 'use']\n",
"\n",
"Medium LOC\n",
"March 2023 DATE\n",
"Apple ORG\n",
"Jupyter PERSON\n"
]
}
],
"source": [
"# Import library\n",
"import spacy\n",
"\n",
"# Load English tokenizer, tagger, parser and NER\n",
"nlp = spacy.load(\"en_core_web_sm\")\n",
"\n",
"# Define example sentence\n",
"sentence = \"Farzad wrote this Medium article in March 2023, using an Apple laptop, on a Jupyter notebook!\"\n",
"\n",
"# Apply NER\n",
"doc = nlp(sentence)\n",
"\n",
"# Analyze syntax\n",
"print(f\"Noun phrases:{[chunk.text for chunk in doc.noun_chunks]}\\n\")\n",
"print(\"Verbs:\", [token.lemma_ for token in doc if token.pos_ == \"VERB\"])\n",
"print(\"\")\n",
"\n",
"# Find named entities, phrases and concepts\n",
"for entity in doc.ents:\n",
" print(entity.text, entity.label_)"
]
},
{
"cell_type": "markdown",
"id": "58b6769c-2dbf-4653-82ac-c5745d6d853e",
"metadata": {},
"source": [
"That's quite interesting! Let's talk about the results. The first line identified the nouns - I do not personally agree with all of them but still, that is quite impressive! The second line has correctly identified \"write\" and \"use\" as verbs and the third block has identified \"Medium\" as a location, \"March 2023\" as a date, \"Apple\" as an organization (this one is interesting since apple could also be the name of a fruit but the model recognized the company name, presumably based on the context of the sentence) and \"Jupyter\" as a person (this one needs some improvement). There are ways to further train these pre-trained models to ensure NER works more accurately for every use case but the point we wanted to articulate here was to showcase how these pre-trained language models can be used to accomplish tasks such as NER with a reasonable level of accuracy.\n",
"\n",
"## Conclusion\n",
"\n",
"In this post, we briefly walked through the world of Natural Language Processing (NLP), Understanding (NLU) and Generation (NLG) and tried to understand their importance by introducing and implementing some of the most common tasks within the Natural Language space, using language modeling. We then moved on to the introduction and language model implementation of (I) Sentiment Analysis, (II) Machine Translation and (III) Named-Entity Recognition (NER) and looked at the impressive results of these powerful pre-trained language models in multiple languages. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f59ea0d-945b-4281-91ec-b7cf508e57a1",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment