## Part of speech tagging and text chunking using NLTK

•	Datasets:
•	Anonymous data with textual features converted to machine readable format for (Train and Test)

Approach to analyzing documentation – Creating taxonomy of terms

Text Document Collection; Raw text with multiple docs from multiple regulators - The document title indicating general subject area.
**1. Parts-of-speech tagging**

is the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking.
#Pos tagger for python
import nltk
import string
from collections import Counter
data = (" Claim Number, Claim Closing Date, First Payment Date,Property Address,ZIP Code,Original Amount Disbursed")
tokens = nltk.word_tokenize(data)
pofs = nltk.pos_tag(data)
print (pofs)
[(' ', 'JJ'), ('C', 'NNP'), ('l', 'NN'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('N', 'NNP'), ('u', 'JJ'), ('m', 'NN'), ('b', 'NN'), ('e', 'NN'), ('r', 'NN'), (',', ','), (' ', 'NNP'), ('C', 'NNP'), ('l', 'VBZ'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('C', 'NNP'), ('l', 'NN'), ('o', 'NN'), ('s', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('g', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), (' ', 'NNP'), ('F', 'NNP'), ('i', 'NN'), ('r', 'VBP'), ('s', 'NN'), ('t', 'NN'), (' ', 'NNP'), ('P', 'NNP'), ('a', 'DT'), ('y', 'NN'), ('m', 'NN'), ('e', 'NN'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), ('P', 'NNP'), ('r', 'NN'), ('o', 'NN'), ('p', 'NN'), ('e', 'NN'), ('r', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('d', 'NN'), ('d', 'NN'), ('r', 'NN'), ('e', 'NN'), ('s', 'NN'), ('s', 'NN'), (',', ','), ('Z', 'NNP'), ('I', 'PRP'), ('P', 'NNP'), (' ', 'NNP'), ('C', 'NNP'), ('o', 'MD'), ('d', 'VB'), ('e', 'NN'), (',', ','), ('O', 'NNP'), ('r', 'NN'), ('i', 'NN'), ('g', 'VBP'), ('i', 'NN'), ('n', 'VBP'), ('a', 'DT'), ('l', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('m', 'NN'), ('o', 'NN'), ('u', 'JJ'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('i', 'NN'), ('s', 'NN'), ('b', 'NN'), ('u', 'JJ'), ('r', 'NN'), ('s', 'NN'), ('e', 'NN'), ('d', 'NN')]
Pyspark Version
#pos tagger pyspark
#data = sc.textFile('file path.txt')
#import nltk 
#words = data.flatMap(lambda x: nltk.word_tokenize(x)) 
#print words.take(10) 
#pos_word = x: nltk.pos_tag([x])) 
#print pos_word.take(5)
**1.1 Chunker**

With parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation. Chunking is shallow parsing
#Chunker Ver/Noun Phrase
import nltk 
from nltk.chunk.regexp import *
test = ("Claim Number, Application Closing Date, First Payment Date,Property State,Property ZIP Code,Original Claim Amount Disbursed,Debt to Income Ratio, Back-End at Origination, Debt to Income (DTI), Credit Score, Origination Credit Bureau Score")
 
test_pofs = nltk.pos_tag(nltk.word_tokenize(test))
#verb phrase chunker
#create the verb phrase rule. 
#-------------------------------------------------
#Noun Phrase chunker
#create the noun phrase 
rule_np = ChunkRule(r'(<DT>)?(<RB>?)?<JJ|CD>*(<JJ,CD><,>)*(<NN.*>)+', 'Chunk NPs')
#create the parser for the verb phrase.
parser_np = RegexpChunkParser([rule_np],chunk_label='NP')
np = parser_np.parse(test_pofs)
print np
(S
  (NP Claim/NNP Number/NNP)
  ,/,
  (NP Application/NNP Closing/NNP Date/NNP)
  ,/,
  (NP First/NNP Payment/NNP Date/NNP)
  ,/,
  (NP Property/NNP State/NNP)
  ,/,
  (NP Property/NNP ZIP/NNP Code/NNP)
  ,/,
  (NP Original/NNP Claim/NNP Amount/NNP)
  Disbursed/VBD
  ,/,
  (NP Debt/NNP)
  to/TO
  (NP Income/NNP Ratio/NNP)
  ,/,
  (NP Back-End/NNP)
  at/IN
  (NP Origination/NNP)
  ,/,
  (NP Debt/NNP)
  to/TO
  (NP Income/NNP)
  (/(
  (NP DTI/NNP)
  )/)
  ,/,
  (NP Credit/NNP Score/NNP)
  ,/,
  (NP Origination/NNP Credit/NNP Bureau/NNP Score/NNP))
**2.DictVectorizer - Transformation**

Sequenece Classifier i.e Chunker to extract part of speech tags (POS)

output:

get_feature_names() -> sparsematrix -> feature output = [Pos+1 =PP ("the"), Pos-1=NN("cat"), Pos-2=DT("sat"), word+1 = "on", word+2 "the"+ Pos+1 "mat"]
