Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save JJRyan0/b0cb4015ba4e3e704f07e7ac84dffb14 to your computer and use it in GitHub Desktop.
Save JJRyan0/b0cb4015ba4e3e704f07e7ac84dffb14 to your computer and use it in GitHub Desktop.
POS & Text Chunker using NLTK.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## Part of speech tagging and text chunking using NLTK\n\n•\tDatasets:\n•\tAnonymous data with textual features converted to machine readable format for (Train and Test)\n\nApproach to analyzing documentation – Creating taxonomy of terms\n\nText Document Collection; Raw text with multiple docs from multiple regulators - The document title indicating general subject area."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**1. Parts-of-speech tagging**\n\nis the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking."
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "#Pos tagger for python\nimport nltk\nimport string\nfrom collections import Counter\ndata = (\" Claim Number, Claim Closing Date, First Payment Date,Property Address,ZIP Code,Original Amount Disbursed\")\ntokens = nltk.word_tokenize(data)\npofs = nltk.pos_tag(data)\nprint (pofs)",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "[(' ', 'JJ'), ('C', 'NNP'), ('l', 'NN'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('N', 'NNP'), ('u', 'JJ'), ('m', 'NN'), ('b', 'NN'), ('e', 'NN'), ('r', 'NN'), (',', ','), (' ', 'NNP'), ('C', 'NNP'), ('l', 'VBZ'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('C', 'NNP'), ('l', 'NN'), ('o', 'NN'), ('s', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('g', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), (' ', 'NNP'), ('F', 'NNP'), ('i', 'NN'), ('r', 'VBP'), ('s', 'NN'), ('t', 'NN'), (' ', 'NNP'), ('P', 'NNP'), ('a', 'DT'), ('y', 'NN'), ('m', 'NN'), ('e', 'NN'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), ('P', 'NNP'), ('r', 'NN'), ('o', 'NN'), ('p', 'NN'), ('e', 'NN'), ('r', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('d', 'NN'), ('d', 'NN'), ('r', 'NN'), ('e', 'NN'), ('s', 'NN'), ('s', 'NN'), (',', ','), ('Z', 'NNP'), ('I', 'PRP'), ('P', 'NNP'), (' ', 'NNP'), ('C', 'NNP'), ('o', 'MD'), ('d', 'VB'), ('e', 'NN'), (',', ','), ('O', 'NNP'), ('r', 'NN'), ('i', 'NN'), ('g', 'VBP'), ('i', 'NN'), ('n', 'VBP'), ('a', 'DT'), ('l', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('m', 'NN'), ('o', 'NN'), ('u', 'JJ'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('i', 'NN'), ('s', 'NN'), ('b', 'NN'), ('u', 'JJ'), ('r', 'NN'), ('s', 'NN'), ('e', 'NN'), ('d', 'NN')]\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Pyspark Version"
},
{
"metadata": {
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "#pos tagger pyspark\n#data = sc.textFile('file path.txt')\n#import nltk \n#words = data.flatMap(lambda x: nltk.word_tokenize(x)) \n#print words.take(10) \n#pos_word = words.map(lambda x: nltk.pos_tag([x])) \n#print pos_word.take(5)",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**1.1 Chunker**\n\nWith parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation. Chunking is shallow parsing"
},
{
"metadata": {
"scrolled": true,
"collapsed": false,
"trusted": true
},
"cell_type": "code",
"source": "#Chunker Ver/Noun Phrase\nimport nltk \nfrom nltk.chunk.regexp import *\ntest = (\"Claim Number, Application Closing Date, First Payment Date,Property State,Property ZIP Code,Original Claim Amount Disbursed,Debt to Income Ratio, Back-End at Origination, Debt to Income (DTI), Credit Score, Origination Credit Bureau Score\")\n \ntest_pofs = nltk.pos_tag(nltk.word_tokenize(test))\n#verb phrase chunker\n#create the verb phrase rule. \n#-------------------------------------------------\n#Noun Phrase chunker\n#create the noun phrase \nrule_np = ChunkRule(r'(<DT>)?(<RB>?)?<JJ|CD>*(<JJ,CD><,>)*(<NN.*>)+', 'Chunk NPs')\n#create the parser for the verb phrase.\nparser_np = RegexpChunkParser([rule_np],chunk_label='NP')\nnp = parser_np.parse(test_pofs)\nprint np",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": "(S\n (NP Claim/NNP Number/NNP)\n ,/,\n (NP Application/NNP Closing/NNP Date/NNP)\n ,/,\n (NP First/NNP Payment/NNP Date/NNP)\n ,/,\n (NP Property/NNP State/NNP)\n ,/,\n (NP Property/NNP ZIP/NNP Code/NNP)\n ,/,\n (NP Original/NNP Claim/NNP Amount/NNP)\n Disbursed/VBD\n ,/,\n (NP Debt/NNP)\n to/TO\n (NP Income/NNP Ratio/NNP)\n ,/,\n (NP Back-End/NNP)\n at/IN\n (NP Origination/NNP)\n ,/,\n (NP Debt/NNP)\n to/TO\n (NP Income/NNP)\n (/(\n (NP DTI/NNP)\n )/)\n ,/,\n (NP Credit/NNP Score/NNP)\n ,/,\n (NP Origination/NNP Credit/NNP Bureau/NNP Score/NNP))\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**2.DictVectorizer - Transformation**\n\nSequenece Classifier i.e Chunker to extract part of speech tags (POS)\n\noutput:\n\nget_feature_names() -> sparsematrix -> feature output = [Pos+1 =PP (\"the\"), Pos-1=NN(\"cat\"), Pos-2=DT(\"sat\"), word+1 = \"on\", word+2 \"the\"+ Pos+1 \"mat\"]"
}
],
"metadata": {
"kernelspec": {
"name": "python2",
"display_name": "Python 2",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11",
"file_extension": ".py",
"codemirror_mode": {
"version": 2,
"name": "ipython"
}
},
"gist": {
"id": "",
"data": {
"description": "POS & Text Chunker using NLTK.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment