Created
May 11, 2017 15:08
-
-
Save JJRyan0/b0cb4015ba4e3e704f07e7ac84dffb14 to your computer and use it in GitHub Desktop.
POS & Text Chunker using NLTK.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Part of speech tagging and text chunking using NLTK\n\n•\tDatasets:\n•\tAnonymous data with textual features converted to machine readable format for (Train and Test)\n\nApproach to analyzing documentation – Creating taxonomy of terms\n\nText Document Collection; Raw text with multiple docs from multiple regulators - The document title indicating general subject area." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**1. Parts-of-speech tagging**\n\nis the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking." | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "#Pos tagger for python\nimport nltk\nimport string\nfrom collections import Counter\ndata = (\" Claim Number, Claim Closing Date, First Payment Date,Property Address,ZIP Code,Original Amount Disbursed\")\ntokens = nltk.word_tokenize(data)\npofs = nltk.pos_tag(data)\nprint (pofs)", | |
"execution_count": 1, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "[(' ', 'JJ'), ('C', 'NNP'), ('l', 'NN'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('N', 'NNP'), ('u', 'JJ'), ('m', 'NN'), ('b', 'NN'), ('e', 'NN'), ('r', 'NN'), (',', ','), (' ', 'NNP'), ('C', 'NNP'), ('l', 'VBZ'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('C', 'NNP'), ('l', 'NN'), ('o', 'NN'), ('s', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('g', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), (' ', 'NNP'), ('F', 'NNP'), ('i', 'NN'), ('r', 'VBP'), ('s', 'NN'), ('t', 'NN'), (' ', 'NNP'), ('P', 'NNP'), ('a', 'DT'), ('y', 'NN'), ('m', 'NN'), ('e', 'NN'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), ('P', 'NNP'), ('r', 'NN'), ('o', 'NN'), ('p', 'NN'), ('e', 'NN'), ('r', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('d', 'NN'), ('d', 'NN'), ('r', 'NN'), ('e', 'NN'), ('s', 'NN'), ('s', 'NN'), (',', ','), ('Z', 'NNP'), ('I', 'PRP'), ('P', 'NNP'), (' ', 'NNP'), ('C', 'NNP'), ('o', 'MD'), ('d', 'VB'), ('e', 'NN'), (',', ','), ('O', 'NNP'), ('r', 'NN'), ('i', 'NN'), ('g', 'VBP'), ('i', 'NN'), ('n', 'VBP'), ('a', 'DT'), ('l', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('m', 'NN'), ('o', 'NN'), ('u', 'JJ'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('i', 'NN'), ('s', 'NN'), ('b', 'NN'), ('u', 'JJ'), ('r', 'NN'), ('s', 'NN'), ('e', 'NN'), ('d', 'NN')]\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Pyspark Version" | |
}, | |
{ | |
"metadata": { | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "#pos tagger pyspark\n#data = sc.textFile('file path.txt')\n#import nltk \n#words = data.flatMap(lambda x: nltk.word_tokenize(x)) \n#print words.take(10) \n#pos_word = words.map(lambda x: nltk.pos_tag([x])) \n#print pos_word.take(5)", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**1.1 Chunker**\n\nWith parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation. Chunking is shallow parsing" | |
}, | |
{ | |
"metadata": { | |
"scrolled": true, | |
"collapsed": false, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "#Chunker Ver/Noun Phrase\nimport nltk \nfrom nltk.chunk.regexp import *\ntest = (\"Claim Number, Application Closing Date, First Payment Date,Property State,Property ZIP Code,Original Claim Amount Disbursed,Debt to Income Ratio, Back-End at Origination, Debt to Income (DTI), Credit Score, Origination Credit Bureau Score\")\n \ntest_pofs = nltk.pos_tag(nltk.word_tokenize(test))\n#verb phrase chunker\n#create the verb phrase rule. \n#-------------------------------------------------\n#Noun Phrase chunker\n#create the noun phrase \nrule_np = ChunkRule(r'(<DT>)?(<RB>?)?<JJ|CD>*(<JJ,CD><,>)*(<NN.*>)+', 'Chunk NPs')\n#create the parser for the verb phrase.\nparser_np = RegexpChunkParser([rule_np],chunk_label='NP')\nnp = parser_np.parse(test_pofs)\nprint np", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "(S\n (NP Claim/NNP Number/NNP)\n ,/,\n (NP Application/NNP Closing/NNP Date/NNP)\n ,/,\n (NP First/NNP Payment/NNP Date/NNP)\n ,/,\n (NP Property/NNP State/NNP)\n ,/,\n (NP Property/NNP ZIP/NNP Code/NNP)\n ,/,\n (NP Original/NNP Claim/NNP Amount/NNP)\n Disbursed/VBD\n ,/,\n (NP Debt/NNP)\n to/TO\n (NP Income/NNP Ratio/NNP)\n ,/,\n (NP Back-End/NNP)\n at/IN\n (NP Origination/NNP)\n ,/,\n (NP Debt/NNP)\n to/TO\n (NP Income/NNP)\n (/(\n (NP DTI/NNP)\n )/)\n ,/,\n (NP Credit/NNP Score/NNP)\n ,/,\n (NP Origination/NNP Credit/NNP Bureau/NNP Score/NNP))\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "**2.DictVectorizer - Transformation**\n\nSequenece Classifier i.e Chunker to extract part of speech tags (POS)\n\noutput:\n\nget_feature_names() -> sparsematrix -> feature output = [Pos+1 =PP (\"the\"), Pos-1=NN(\"cat\"), Pos-2=DT(\"sat\"), word+1 = \"on\", word+2 \"the\"+ Pos+1 \"mat\"]" | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python2", | |
"display_name": "Python 2", | |
"language": "python" | |
}, | |
"language_info": { | |
"mimetype": "text/x-python", | |
"nbconvert_exporter": "python", | |
"name": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.11", | |
"file_extension": ".py", | |
"codemirror_mode": { | |
"version": 2, | |
"name": "ipython" | |
} | |
}, | |
"gist": { | |
"id": "", | |
"data": { | |
"description": "POS & Text Chunker using NLTK.ipynb", | |
"public": true | |
} | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment