JJRyan0/Semantic Text Analysis using NLTK.ipynb

## Semantic Text Analysis using NLTK.ipynb
{
  "cells": [
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "## Part of speech tagging and text chunking using NLTK\n\n•\tDatasets:\n•\tAnonymous data with textual features converted to machine readable format for  (Train and Test)\n\nApproach to analyzing documentation – Creating taxonomy of terms\n\nText Document Collection; Raw text with multiple docs from multiple regulators - The document title indicating general subject area."
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "**1. Parts-of-speech tagging**\n\nis the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking."
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": true
      },
      "cell_type": "code",
      "source": "#Pos tagger for python\nimport nltk\nimport string\nfrom collections import Counter\ndata = (\" Claim Number, Claim Closing Date, First Payment Date,Property Address,ZIP Code,Original Amount Disbursed\")\ntokens = nltk.word_tokenize(data)\npofs = nltk.pos_tag(data)\nprint (pofs)",
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": "[(' ', 'JJ'), ('C', 'NNP'), ('l', 'NN'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('N', 'NNP'), ('u', 'JJ'), ('m', 'NN'), ('b', 'NN'), ('e', 'NN'), ('r', 'NN'), (',', ','), (' ', 'NNP'), ('C', 'NNP'), ('l', 'VBZ'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('C', 'NNP'), ('l', 'NN'), ('o', 'NN'), ('s', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('g', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), (' ', 'NNP'), ('F', 'NNP'), ('i', 'NN'), ('r', 'VBP'), ('s', 'NN'), ('t', 'NN'), (' ', 'NNP'), ('P', 'NNP'), ('a', 'DT'), ('y', 'NN'), ('m', 'NN'), ('e', 'NN'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), ('P', 'NNP'), ('r', 'NN'), ('o', 'NN'), ('p', 'NN'), ('e', 'NN'), ('r', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('d', 'NN'), ('d', 'NN'), ('r', 'NN'), ('e', 'NN'), ('s', 'NN'), ('s', 'NN'), (',', ','), ('Z', 'NNP'), ('I', 'PRP'), ('P', 'NNP'), (' ', 'NNP'), ('C', 'NNP'), ('o', 'MD'), ('d', 'VB'), ('e', 'NN'), (',', ','), ('O', 'NNP'), ('r', 'NN'), ('i', 'NN'), ('g', 'VBP'), ('i', 'NN'), ('n', 'VBP'), ('a', 'DT'), ('l', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('m', 'NN'), ('o', 'NN'), ('u', 'JJ'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('i', 'NN'), ('s', 'NN'), ('b', 'NN'), ('u', 'JJ'), ('r', 'NN'), ('s', 'NN'), ('e', 'NN'), ('d', 'NN')]\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "Pyspark Version"
    },
    {
      "metadata": {
        "collapsed": false,
        "trusted": true
      },
      "cell_type": "code",
      "source": "#pos tagger pyspark\n#data = sc.textFile('file path.txt')\n#import nltk      \n#words = data.flatMap(lambda x: nltk.word_tokenize(x))      \n#print words.take(10)     \n#pos_word = words.map(lambda x: nltk.pos_tag([x]))   \n#print pos_word.take(5)",
      "execution_count": 3,
      "outputs": []
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "**1.1 Chunker**\n\nWith parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation. Chunking is shallow parsing"
    },
    {
      "metadata": {
        "scrolled": true,
        "collapsed": false,
        "trusted": true
      },
      "cell_type": "code",
      "source": "#Chunker Ver/Noun Phrase\nimport nltk \nfrom nltk.chunk.regexp import *\ntest = (\"Claim Number, Application Closing Date, First Payment Date,Property State,Property ZIP Code,Original Claim Amount Disbursed,Debt to Income Ratio, Back-End at Origination, Debt to Income (DTI), Credit Score, Origination Credit Bureau Score\")\n        \ntest_pofs = nltk.pos_tag(nltk.word_tokenize(test))\n#verb phrase chunker\n#create the verb phrase rule. \n#-------------------------------------------------\n#Noun Phrase chunker\n#create the noun phrase \nrule_np = ChunkRule(r'(<DT>)?(<RB>?)?<JJ|CD>*(<JJ,CD><,>)*(<NN.*>)+', 'Chunk NPs')\n#create the parser for the verb phrase.\nparser_np = RegexpChunkParser([rule_np],chunk_label='NP')\nnp = parser_np.parse(test_pofs)\nprint np",
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": "(S\n  (NP Claim/NNP Number/NNP)\n  ,/,\n  (NP Application/NNP Closing/NNP Date/NNP)\n  ,/,\n  (NP First/NNP Payment/NNP Date/NNP)\n  ,/,\n  (NP Property/NNP State/NNP)\n  ,/,\n  (NP Property/NNP ZIP/NNP Code/NNP)\n  ,/,\n  (NP Original/NNP Claim/NNP Amount/NNP)\n  Disbursed/VBD\n  ,/,\n  (NP Debt/NNP)\n  to/TO\n  (NP Income/NNP Ratio/NNP)\n  ,/,\n  (NP Back-End/NNP)\n  at/IN\n  (NP Origination/NNP)\n  ,/,\n  (NP Debt/NNP)\n  to/TO\n  (NP Income/NNP)\n  (/(\n  (NP DTI/NNP)\n  )/)\n  ,/,\n  (NP Credit/NNP Score/NNP)\n  ,/,\n  (NP Origination/NNP Credit/NNP Bureau/NNP Score/NNP))\n",
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {},
      "cell_type": "markdown",
      "source": "**2.DictVectorizer - Transformation**\n\nSequenece Classifier i.e Chunker to extract part of speech tags (POS)\n\noutput:\n\nget_feature_names() -> sparsematrix -> feature output = [Pos+1 =PP (\"the\"), Pos-1=NN(\"cat\"), Pos-2=DT(\"sat\"), word+1 = \"on\", word+2 \"the\"+ Pos+1 \"mat\"]"
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python2",
      "display_name": "Python 2",
      "language": "python"
    },
    "language_info": {
      "mimetype": "text/x-python",
      "nbconvert_exporter": "python",
      "name": "python",
      "pygments_lexer": "ipython2",
      "version": "2.7.11",
      "file_extension": ".py",
      "codemirror_mode": {
        "version": 2,
        "name": "ipython"
      }
    },
    "gist": {
      "id": "",
      "data": {
        "description": "POS & Text Chunker using NLTK.ipynb",
        "public": true
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "## Part of speech tagging and text chunking using NLTK\n\n•\tDatasets:\n•\tAnonymous data with textual features converted to machine readable format for (Train and Test)\n\nApproach to analyzing documentation – Creating taxonomy of terms\n\nText Document Collection; Raw text with multiple docs from multiple regulators - The document title indicating general subject area."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "1. Parts-of-speech tagging\n\nis the process of converting a sentence in the form of a list of words, into a list of tuples, where each tuple is of the form (word, tag). The tag is a part-of-speech tag and signifies whether the word is a noun, adjective, verb and so on. This is a necessary step before chunking."
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "#Pos tagger for python\nimport nltk\nimport string\nfrom collections import Counter\ndata = (\" Claim Number, Claim Closing Date, First Payment Date,Property Address,ZIP Code,Original Amount Disbursed\")\ntokens = nltk.word_tokenize(data)\npofs = nltk.pos_tag(data)\nprint (pofs)",
	"execution_count": 1,
	"outputs": [
	{
	"output_type": "stream",
	"text": "[(' ', 'JJ'), ('C', 'NNP'), ('l', 'NN'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('N', 'NNP'), ('u', 'JJ'), ('m', 'NN'), ('b', 'NN'), ('e', 'NN'), ('r', 'NN'), (',', ','), (' ', 'NNP'), ('C', 'NNP'), ('l', 'VBZ'), ('a', 'DT'), ('i', 'JJ'), ('m', 'NN'), (' ', 'NNP'), ('C', 'NNP'), ('l', 'NN'), ('o', 'NN'), ('s', 'NN'), ('i', 'NN'), ('n', 'VBP'), ('g', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), (' ', 'NNP'), ('F', 'NNP'), ('i', 'NN'), ('r', 'VBP'), ('s', 'NN'), ('t', 'NN'), (' ', 'NNP'), ('P', 'NNP'), ('a', 'DT'), ('y', 'NN'), ('m', 'NN'), ('e', 'NN'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('a', 'DT'), ('t', 'NN'), ('e', 'NN'), (',', ','), ('P', 'NNP'), ('r', 'NN'), ('o', 'NN'), ('p', 'NN'), ('e', 'NN'), ('r', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('d', 'NN'), ('d', 'NN'), ('r', 'NN'), ('e', 'NN'), ('s', 'NN'), ('s', 'NN'), (',', ','), ('Z', 'NNP'), ('I', 'PRP'), ('P', 'NNP'), (' ', 'NNP'), ('C', 'NNP'), ('o', 'MD'), ('d', 'VB'), ('e', 'NN'), (',', ','), ('O', 'NNP'), ('r', 'NN'), ('i', 'NN'), ('g', 'VBP'), ('i', 'NN'), ('n', 'VBP'), ('a', 'DT'), ('l', 'NN'), (' ', 'VBZ'), ('A', 'NNP'), ('m', 'NN'), ('o', 'NN'), ('u', 'JJ'), ('n', 'JJ'), ('t', 'NN'), (' ', 'NNP'), ('D', 'NNP'), ('i', 'NN'), ('s', 'NN'), ('b', 'NN'), ('u', 'JJ'), ('r', 'NN'), ('s', 'NN'), ('e', 'NN'), ('d', 'NN')]\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Pyspark Version"
	},
	{
	"metadata": {
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "#pos tagger pyspark\n#data = sc.textFile('file path.txt')\n#import nltk \n#words = data.flatMap(lambda x: nltk.word_tokenize(x)) \n#print words.take(10) \n#pos_word = words.map(lambda x: nltk.pos_tag([x])) \n#print pos_word.take(5)",
	"execution_count": 3,
	"outputs": []
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "1.1 Chunker\n\nWith parts-of-speech tags, a chunker knows how to identify phrases based on tag patterns. These POS tags are used for grammar analysis and word sense disambiguation. Chunking is shallow parsing"
	},
	{
	"metadata": {
	"scrolled": true,
	"collapsed": false,
	"trusted": true
	},
	"cell_type": "code",
	"source": "#Chunker Ver/Noun Phrase\nimport nltk \nfrom nltk.chunk.regexp import \ntest = (\"Claim Number, Application Closing Date, First Payment Date,Property State,Property ZIP Code,Original Claim Amount Disbursed,Debt to Income Ratio, Back-End at Origination, Debt to Income (DTI), Credit Score, Origination Credit Bureau Score\")\n \ntest_pofs = nltk.pos_tag(nltk.word_tokenize(test))\n#verb phrase chunker\n#create the verb phrase rule. \n#-------------------------------------------------\n#Noun Phrase chunker\n#create the noun phrase \nrule_np = ChunkRule(r'(<DT>)?(<RB>?)?<JJ\|CD>(<JJ,CD><,>)(<NN.>)+', 'Chunk NPs')\n#create the parser for the verb phrase.\nparser_np = RegexpChunkParser([rule_np],chunk_label='NP')\nnp = parser_np.parse(test_pofs)\nprint np",
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "stream",
	"text": "(S\n (NP Claim/NNP Number/NNP)\n ,/,\n (NP Application/NNP Closing/NNP Date/NNP)\n ,/,\n (NP First/NNP Payment/NNP Date/NNP)\n ,/,\n (NP Property/NNP State/NNP)\n ,/,\n (NP Property/NNP ZIP/NNP Code/NNP)\n ,/,\n (NP Original/NNP Claim/NNP Amount/NNP)\n Disbursed/VBD\n ,/,\n (NP Debt/NNP)\n to/TO\n (NP Income/NNP Ratio/NNP)\n ,/,\n (NP Back-End/NNP)\n at/IN\n (NP Origination/NNP)\n ,/,\n (NP Debt/NNP)\n to/TO\n (NP Income/NNP)\n (/(\n (NP DTI/NNP)\n )/)\n ,/,\n (NP Credit/NNP Score/NNP)\n ,/,\n (NP Origination/NNP Credit/NNP Bureau/NNP Score/NNP))\n",
	"name": "stdout"
	}
	]
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "2.DictVectorizer - Transformation\n\nSequenece Classifier i.e Chunker to extract part of speech tags (POS)\n\noutput:\n\nget_feature_names() -> sparsematrix -> feature output = [Pos+1 =PP (\"the\"), Pos-1=NN(\"cat\"), Pos-2=DT(\"sat\"), word+1 = \"on\", word+2 \"the\"+ Pos+1 \"mat\"]"
	}
	],
	"metadata": {
	"kernelspec": {
	"name": "python2",
	"display_name": "Python 2",
	"language": "python"
	},
	"language_info": {
	"mimetype": "text/x-python",
	"nbconvert_exporter": "python",
	"name": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.11",
	"file_extension": ".py",
	"codemirror_mode": {
	"version": 2,
	"name": "ipython"
	}
	},
	"gist": {
	"id": "",
	"data": {
	"description": "POS & Text Chunker using NLTK.ipynb",
	"public": true
	}
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}