Skip to content

Instantly share code, notes, and snippets.

@ParthaSSatpathy
Last active April 11, 2022 23:36
Show Gist options
  • Save ParthaSSatpathy/ec2f24c9faf8d09a91fd9d28505a8c00 to your computer and use it in GitHub Desktop.
Save ParthaSSatpathy/ec2f24c9faf8d09a91fd9d28505a8c00 to your computer and use it in GitHub Desktop.
Introduction to Natural Language Processing Using Python
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Natural Language Processing with Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. Challenges in natural language processing frequently involve natural language understanding, natural language generation (frequently from formal, machine-readable logical forms), connecting language and machine perception, dialog systems, or some combination thereof.\n",
"\n",
"(Ref: https://en.wikipedia.org/wiki/Natural_language_processing)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is an introduction notebook where we will see how we can use Python to do some basic processing of natural language in Python."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The package which handles most of the task in Python is - 'nltk'. We will start with importing the nltk package."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have not downloaded the nltk package, check the below YouTube video how to download it. \n",
"https://www.youtube.com/watch?v=68aHmFcO-W4"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import nltk"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Create a text (Multiple Lines)\n",
"text = 'Mary had a little lamb. Her fleece was white as snow.'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can convert a text into a group of sentences and into a group of words using sent_tokenize and word_tokenize from nltk."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Import word_tokenize,sent_tokenize\n",
"from nltk.tokenize import word_tokenize,sent_tokenize"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Mary had a little lamb.', 'Her fleece was white as snow.']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using sent_tokenize we will convert the text into a list of strings or sentences.\n",
"sents = sent_tokenize(text)\n",
"sents"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[['Mary', 'had', 'a', 'little', 'lamb', '.'],\n",
" ['Her', 'fleece', 'was', 'white', 'as', 'snow', '.']]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using word_tokenize we will convert the text into a list of words.\n",
"words = [word_tokenize(word) for word in sents]\n",
"words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"English text consists of many words like - a, an, the, period(.), is, was etc. These words are not useful for our anlysis. They will not provide meaningful essence to our analysis of text.So, we will remove them.\n",
"\n",
"This is done using stopwords from nltk and punctuation from string."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from nltk.corpus import stopwords\n",
"from string import punctuation\n",
"# We will make a set of stopwords and punctuation and name it as customStopWords\n",
"# If you are aware of a list of words which are not useful, you may add that to your set\n",
"customStopWords = set(stopwords.words('english')+list(punctuation))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Mary', 'little', 'lamb', 'Her', 'fleece', 'white', 'snow']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Below we are creating wordsWOstopWords which is all the words from text which are not present\n",
"# in the customStopWords set\n",
"wordsWOstopWords = [word for word in word_tokenize(text) if word not in customStopWords]\n",
"wordsWOstopWords"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"N-Gram:\n",
"N-Grams means keeping multiple words together as they occur most often. \n",
"If we keep two words together, it’s called Bigrams. \n",
"\n",
"This is done using the collocation module in nltk. The BigramCollocationFinder function returns group of words and their frequency."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(('Her', 'fleece'), 1),\n",
" (('Mary', 'little'), 1),\n",
" (('fleece', 'white'), 1),\n",
" (('lamb', 'Her'), 1),\n",
" (('little', 'lamb'), 1),\n",
" (('white', 'snow'), 1)]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.collocations import *\n",
"bigram_measures = nltk.collocations.BigramAssocMeasures() \n",
"finder = BigramCollocationFinder.from_words(wordsWOstopWords)\n",
"sorted(finder.ngram_fd.items())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Stemming:\n",
"\n",
"Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.\n",
"\n",
"Example: A stemmer for English, for example, should identify the string \"cats\" (and possibly \"catlike\", \"catty\" etc.) as based on the root \"cat\", and \"stems\", \"stemmer\", \"stemming\", \"stemmed\" as based on \"stem\". A stemming algorithm reduces the words \"fishing\", \"fished\", and \"fisher\" to the root word, \"fish\". On the other hand, \"argue\", \"argued\", \"argues\", \"arguing\", and \"argus\" reduce to the stem \"argu\" (illustrating the case where the stem is not itself a word or root) but \"argument\" and \"arguments\" reduce to the stem \"argument\".\n",
"\n",
"(Ref: https://en.wikipedia.org/wiki/Stemming)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['mary', 'clos', 'on', 'the', 'clos', 'night', 'when', 'she', 'want', 'to', 'clos']\n"
]
}
],
"source": [
"text2 = \"Mary closed on the closing night when she wanted to close\"\n",
"from nltk.stem.lancaster import LancasterStemmer\n",
"st = LancasterStemmer()\n",
"stemmerWords = [st.stem(word) for word in word_tokenize(text2)]\n",
"print(stemmerWords)\n",
"# If you see here, words - closed, closing, and close - reduced to clos (the root)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part-of-speech:\n",
"\n",
"Sometimes, it may be required to know that a certain word in a sentence is a noun, verb, or a preposition. This is called position tagging. We can use the nlt.pos_tag() function to find that."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Mary', 'NNP'), ('closed', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('closing', 'NN'), ('night', 'NN'), ('when', 'WRB'), ('she', 'PRP'), ('wanted', 'VBD'), ('to', 'TO'), ('close', 'VB')]\n"
]
}
],
"source": [
"# Position Tagging\n",
"print(nltk.pos_tag(word_tokenize(text2)))\n",
"# Shows if part of the speech is a verb, or a noun"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Meaning of words:\n",
"\n",
"We can find the meaning of any word using the wordnet module from nltk.corpus.\n",
"When we print different meanings of words, it will show all meanings and also the part of the speech."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Synset('bass.n.01') the lowest part of the musical range\n",
"Synset('bass.n.02') the lowest part in polyphonic music\n",
"Synset('bass.n.03') an adult male singer with the lowest voice\n",
"Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae\n",
"Synset('freshwater_bass.n.01') any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)\n",
"Synset('bass.n.06') the lowest adult male singing voice\n",
"Synset('bass.n.07') the member with the lowest range of a family of musical instruments\n",
"Synset('bass.n.08') nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes\n",
"Synset('bass.s.01') having or denoting a low vocal or instrumental range\n"
]
}
],
"source": [
"# Meaning of word\n",
"from nltk.corpus import wordnet as wn\n",
"for ss in wn.synsets('bass'):\n",
" print(ss,ss.definition())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word Sense Disambiguation:\n",
"\n",
"If you see the meaning of the word - 'bass' - it has different meanings in different contexts. \n",
"Synset('bass.n.07') - is a musical instrument range, but, Synset('sea_bass.n.01') is a sea fish.\n",
"\n",
"The nltk is so powerful that it can find the context in which the word has occured."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Synset('bass.n.07') the member with the lowest range of a family of musical instruments\n",
"Synset('sea_bass.n.01') the lean flesh of a saltwater fish of the family Serranidae\n"
]
}
],
"source": [
"# We can see that bass has different meanings, sometimes a sea fish and sometimes a low music tone\n",
"# If we want to check the meaning in different context\n",
"from nltk.wsd import lesk\n",
"sensel = lesk(word_tokenize('Sing in a lower tone, along with the bass'),'bass')\n",
"print(sensel,sensel.definition())\n",
"\n",
"sensel2 = lesk(word_tokenize('This sea bass is really hard to catch'),'bass')\n",
"print(sensel2,sensel2.definition())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment