aviks/nlp.ipynb

## nlp.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Natural Language Processing in Julia\n",
    "\n",
    "Julia has a rich ecosystem of tools to work with natural languages. In this notebook, we will work with a few simple cases, and provide pointers to other possibilities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading Data\n",
    "\n",
    "The first step in working with a corpus of text is usually to load it to memory. Julia has the ability to process various source formats, such as PDFs, XML, or CSV files. It can also load certain specific corpus formats, such as Semcor or Semeval files. \n",
    "\n",
    "We will work with a corpus of Australian legal decisions, in PDF format. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "using Glob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data is delivered as a set of pdf files, about three and a half thousand documents. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3536-element Array{String,1}:\n",
       " \"corpus/pdfs/06_100.pdf\" \n",
       " \"corpus/pdfs/06_1001.pdf\"\n",
       " \"corpus/pdfs/06_1004.pdf\"\n",
       " \"corpus/pdfs/06_1005.pdf\"\n",
       " \"corpus/pdfs/06_1006.pdf\"\n",
       " \"corpus/pdfs/06_1015.pdf\"\n",
       " \"corpus/pdfs/06_1017.pdf\"\n",
       " \"corpus/pdfs/06_1018.pdf\"\n",
       " \"corpus/pdfs/06_102.pdf\" \n",
       " \"corpus/pdfs/06_1021.pdf\"\n",
       " \"corpus/pdfs/06_1022.pdf\"\n",
       " \"corpus/pdfs/06_1023.pdf\"\n",
       " \"corpus/pdfs/06_1026.pdf\"\n",
       " ⋮                        \n",
       " \"corpus/pdfs/09_976.pdf\" \n",
       " \"corpus/pdfs/09_977.pdf\" \n",
       " \"corpus/pdfs/09_978.pdf\" \n",
       " \"corpus/pdfs/09_979.pdf\" \n",
       " \"corpus/pdfs/09_980.pdf\" \n",
       " \"corpus/pdfs/09_981.pdf\" \n",
       " \"corpus/pdfs/09_983.pdf\" \n",
       " \"corpus/pdfs/09_984.pdf\" \n",
       " \"corpus/pdfs/09_985.pdf\" \n",
       " \"corpus/pdfs/09_99.pdf\"  \n",
       " \"corpus/pdfs/09_992.pdf\" \n",
       " \"corpus/pdfs/09_996.pdf\" "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "files = Glob.glob(\"corpus/pdfs/*.pdf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We look inside one of the files to see how the data is arranged"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "source": [
    "![Example PDF](pdfimg.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Taro` julia package provides a robust PDF reader, based on the Java `Apache Tika` library. We use that package to load and parse all the PDF files in the directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "ename": "LoadError",
     "evalue": "JVM already initialised",
     "output_type": "error",
     "traceback": [
      "JVM already initialised",
      "",
      " in assertnotloaded at /Users/aviks/.julia/v0.5/JavaCall/src/jvm.jl:159 [inlined]",
      " in init(::Array{String,1}) at /Users/aviks/.julia/v0.5/JavaCall/src/jvm.jl:163",
      " in init() at /Users/aviks/.julia/v0.5/JavaCall/src/jvm.jl:152",
      " in init() at /Users/aviks/.julia/v0.5/Taro/src/Taro.jl:20",
      " in include_string(::String, ::String) at ./loading.jl:441",
      " in include_string(::String, ::String) at /Users/aviks/dev/julia/julia5/usr/lib/julia/sys.dylib:?"
     ]
    }
   ],
   "source": [
    "using Taro\n",
    "Taro.init()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "meta, text = Taro.extract(files[1]);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Dict{String,String} with 20 entries:\n",
       "  \"access_permission:can_print\"        => \"true\"\n",
       "  \"access_permission:fill_in_form\"     => \"true\"\n",
       "  \"access_permission:modify_annotatio… => \"true\"\n",
       "  \"dc:format\"                          => \"application/pdf; version=1.4\"\n",
       "  \"dcterms:created\"                    => \"2017-08-31T10:42:17Z\"\n",
       "  \"xmpTPg:NPages\"                      => \"3\"\n",
       "  \"created\"                            => \"Thu Aug 31 11:42:17 BST 2017\"\n",
       "  \"Creation-Date\"                      => \"2017-08-31T10:42:17Z\"\n",
       "  \"meta:creation-date\"                 => \"2017-08-31T10:42:17Z\"\n",
       "  \"access_permission:assemble_documen… => \"true\"\n",
       "  \"X-Parsed-By\"                        => \"org.apache.tika.parser.DefaultParser\"\n",
       "  \"access_permission:can_print_degrad… => \"true\"\n",
       "  \"access_permission:can_modify\"       => \"true\"\n",
       "  \"access_permission:extract_content\"  => \"true\"\n",
       "  \"pdf:encrypted\"                      => \"false\"\n",
       "  \"producer\"                           => \"Apache FOP Version 2.0\"\n",
       "  \"pdf:PDFVersion\"                     => \"1.4\"\n",
       "  \"access_permission:extract_for_acce… => \"true\"\n",
       "  \"xmp:CreatorTool\"                    => \"Apache FOP Version 2.0\"\n",
       "  \"Content-Type\"                       => \"application/pdf\""
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "meta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"\\nLawrance v Human Rights and Equal Opportunity\\nCommission [2006] FCA 100 (9 February 2006)\\n\\n1 These are two applications for orders of review under the Administrative Decisions\\n(Judicial Review) Act 1977 (Cth) (\\\"the AD(JR) Act\\\").\\nThey concern correspondence sent to the Human Rights and Equal Opportunity\\nCommission (\\\"the Commission\\\") by the applicant in late 2005.\\nIn a letter dated 26 September 2005, the applicant wrote to the Commission concerning\\nallegations of unlawful discrimination.\\nThe Commission replied in a letter dated 7 October 2005, in which it indicated that it\\nwas not able to assist her.\\nOn 13 October 2005, the applicant again wrote to the Commission.\\nThat letter addressed alleged breaches of human rights.\\nThe applicant wrote to the Commission a third time on 7 November 2005, this time\\nconcerning allegations of sexual harassment.\\nThe Commission did not respond to the second and third letters it received from the\\napplicant.\\n2 It is now accepted in these proceedings in which the Commonwealth Attorney-\\nGeneral has intervened to act as a contradictor, that the Commission either failed to\\nrespond or responded inappropriately to the applicant's correspondence.\\nAccordingly, it is accepted by the respondents and the Attorney-General that in these\\ntwo applications it is appropriate that orders be made.\\nIt was suggested by the second respondent and the Attorney-General that orders with\\nsubstantial operative effect should be made only in matter NSD 1957 of 2005, that is\\norders made setting aside what had been done by the Commission.\\nAs to NSD 2340 of 2005, it was proposed that the application be dismissed.\\nIn my opinion, however, the preferable course is for orders to be made in both matters,\\neven though this may involve duplication, in terms of the ultimate legal effect of such\\norders.\\n3 One point of difference between the applicant and the respondents and the Attorney-\\nGeneral as to what orders should be made concerns whether orders should be made\\nrequiring the President of the Commission (\\\"the President\\\") to exercise certain powers\\nin relation to some of the complaints.\\nThe statutory scheme involves the lodgement of certain complaints, relevantly, unlawful\\ndiscrimination and sexual harassment, and the referral of the complaints to the\\nPresident under s 46PD of the Human Rights and Equal Opportunity Commission Act\\n1986 (Cth) (\\\"the HREOC Act \\\").\\nThereafter the Act requires the President to inquire and attempt to conciliate.\\nThat obligation is found in s 46PF of the HREOC Act .\\nHowever, that general duty is qualified by the power conferred on the President by s\\n46PH to terminate a complaint for one or more of the nominated grounds.\\n4 The applicant has submitted that in addition to any order requiring the Commission\\nto refer some of her complaints to the President under s 46PD of the HREOC Act ,\\n\\n\\n\\nthere should be orders made requiring the President to inquire and attempt to conciliate\\nunder s 46PF and certain other orders concerning attendant powers of the President,\\nincluding holding a conference under s 46PJ.\\n5 In my opinion, it is sufficient that an order be made requiring the Commission to refer\\nthe matter to the President.\\nI am of that view for two reasons.\\nFirstly, although this is a subsidiary issue, the President is not presently a party to these\\nproceedings.\\nThe utility of an order requiring the President to do some act under s 46P and later\\nprovisions, in the absence of the President being a party, is questionable.\\nHowever, that procedural deficiency could, of course, be remedied by the joinder of\\nthe President.\\n6 The more significant reason why I do not propose to make the orders sought by the\\napplicant is that the HREOC Act itself directs the President to do certain things upon\\nthe referral to him or her of a complaint under s 46PD.\\nIt seems unnecessary and probably inappropriate for this Court to make an order\\nrequiring the President to do that which the Act requires the President to do.\\nIn addition, some of the orders sought by the applicant presuppose that the matter will\\nbe dealt with by the President in a particular way.\\nThey presuppose, for example, that the President will not exercise the power under\\ns 46PH.\\nThat power is a discretionary power available to the President and it would be\\ninappropriate to make an order that assumes it will not be exercised.\\n7 Accordingly, I propose to make an order only that the complaints be referred to the\\nPresident.\\nIt will then be a matter for the President to determine what powers are exercised and the\\nmanner in which they are exercised consistent with the statutory scheme which prima\\nfacie requires the President to inquire into and attempt to conciliate the complaint.\\n8 Insofar as the complaint involving an allegation of a breach of human rights is\\nconcerned, it is accepted by the respondents that an appropriate order is that the matter\\nbe remitted to the Commission to be dealt with in accordance with s 20(1)(b).\\nThat paragraph provides that the Commission is required to perform certain functions\\nidentified in s 11(1)(f) , namely to inquire into any act or practice that may be inconsistent\\nwith or contrary to any human right, and consequential conduct.\\n9 The order proposed by the Attorney-General on his behalf and on behalf of the\\nrespondents is sufficiently prescriptive to enliven the processes flowing from s 20(1) of\\nthe HREOC Act (which, subject to s 20(2), places an obligation on the Commission to\\nperform the functions referred to in s 11(f)).\\n10 Another matter is an application by the applicant that her name be suppressed.\\nAlso, the applicant requested that orders be made in matter NSD 2340 of 2005\\ndirected to the respondents concerning the matters to which the complaints relate and\\nrestraining their conduct.\\n\\n\\n\\n11 Dealing with that second matter, it seems inappropriate to make the orders sought.\\nThe power to do so is found in s 16(1)(d) of the AD(JR) Act which empowers the Court\\nto make an order relevantly directing a party to refrain from doing any act or thing.\\nThat power is enlivened where the Court considers it is necessary to do justice between\\nthe parties.\\nI am not affirmatively satisfied that the orders sought are orders that are necessary to\\ndo justice between the parties.\\n12 Insofar as the alleged conduct of the second respondent is concerned, that is the\\nmatter that is the subject of complaint and it is appropriate that the procedures under\\nthe HREOC Act and related legislation be allowed to operate.\\nAs to the other orders proposed by the applicant are concerned, I am not satisfied they\\nare necessary to do justice between the parties.\\n13 In relation to the proposed order suppressing the name of the applicant, the power\\nto make such an order is found in s 50 of the Federal Court of Australia Act 1976 (Cth)\\n(\\\"the FCA Act \\\").\\nSuch an order can be made by the Court if it is of the view it is necessary in order to\\nprevent prejudice to the administration of justice.\\nAs counsel for the second respondent and the Attorney-General said, the ordinary\\nposition is that such orders are not made and the C-ourt needs to be affirmatively\\nsatisfied that the suppression or non-disclosure order is necessary.\\n14 The applicant has elected to bring these applications, as she is entitled to do.\\nIt is true that, in effect, the Commission has accepted that her earlier complaints have\\nnot been dealt with in accordance with the legislative regime.\\nIn those circumstances, the applicant, has been required to avail herself of the Court's\\nprocedures and processes to vindicate her rights under the HREOC Act to have\\ncomplaints dealt with in accordance with the legislative scheme.\\nHowever, in my view, that is not a sufficient reason to make an order under s 50 of\\nthe FCA Act .\\nThe published reasons of this Court, which will reveal her name, are limited to a\\nbroad description of the various complaints she has sought to have the Commission\\ninvestigate.\\nI certify that the preceding fourteen (14) numbered paragraphs are a true copy of the\\nReasons for Judgment herein of the Honourable Justice Moore.\\nAssociate: Dated: 28 February 2006 The Applicant appeared in person Solicitor for\\nthe First Respondent: Human Rights and Equal Opportunity Commission Counsel for\\nthe Second Respondent and the Attorney-General intervening: K Eastman Solicitor for\\nthe Second Respondent and the Attorney-General intervening: Australian Government\\nSolicitor Date of Hearing: 9 February 2006 Date of Judgment: 9 February 2006\\nAustLII: Copyright Policy | Disclaimers | Privacy Policy | Feedback URL: http://\\nwww.austlii.edu.au/au/cases/cth/FCA/2006/100.html\\n\\n\\n\""
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: Method definition getSentences(Any) in module Main at In[105]:2 overwritten at In[106]:2.\n",
      "WARNING: Method definition getTitle(Any) in module Main at In[105]:17 overwritten at In[106]:17.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "getTitle (generic function with 1 method)"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function getSentences(t)\n",
    "           s=strip.(split(t, '\\n', keep=true))\n",
    "           c = s[1]\n",
    "           results = String[]\n",
    "           for i = 2:size(s, 1)\n",
    "               if s[i] == \"\" || endswith(strip(c), ['.','?','!'])\n",
    "                   if c != \"\"; push!(results, strip(c)); end\n",
    "                   c = s[i]\n",
    "               else\n",
    "                    c = c * s[i] * \" \"\n",
    "               end\n",
    "           end\n",
    "           if c != \"\"; push!(results, strip(c)); end\n",
    "           return results\n",
    "end\n",
    "\n",
    "getTitle(t) = getSentences(t)[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"Lawrance v Human Rights and Equal Opportunity Commission [2006] FCA 100 (9 February 2006)\""
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getTitle(text) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the `TextAnalysis` julia package for basic text analysis tasks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "using TextAnalysis\n",
    "using Languages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we extract the sentences from the XML documents, and the create a `Corpus` that includes the entire set of documents. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n",
      "ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "A Corpus"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docs = Any[]\n",
    "for i in 1:1000\n",
    "    try \n",
    "        meta,txt = Taro.extract(files[i])\n",
    "        title = getTitle(txt)\n",
    "        dm = TextAnalysis.DocumentMetadata(EnglishLanguage, title, \"\", meta[\"Creation-Date\"] )\n",
    "        doc = StringDocument(txt, dm)\n",
    "        \n",
    "        push!(docs, doc)\n",
    "    catch e\n",
    "        @show e\n",
    "    end\n",
    "end\n",
    "\n",
    "crps = Corpus(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A Corpus"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `TextAnalysis` package contains basic utilities for pre-processing the documents, removing punctuation and other stop words, normalising case, and removing numbers. Finally we also `stem` the words, using the built in `Porter2` stemmer. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "prepare!(crps, strip_non_letters | strip_punctuation | strip_case | strip_stopwords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "stem!(crps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Having normalised the words in the corpus, we can now start to process it. First, we generate the `lexicon` (ie, the dictionary of all words that comprise the corpus), and then create an inverse index, so that we can quickly say, for all words, which documents they belong. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "update_lexicon!(crps)\n",
    "update_inverse_index!(crps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example we can see that the word \"_injustice_\" appears in 108 documents in the corpus. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "108-element Array{Int64,1}:\n",
       "   7\n",
       "  22\n",
       "  30\n",
       "  41\n",
       "  46\n",
       "  48\n",
       "  62\n",
       "  67\n",
       "  88\n",
       " 115\n",
       " 121\n",
       " 122\n",
       " 133\n",
       "   ⋮\n",
       " 864\n",
       " 875\n",
       " 892\n",
       " 901\n",
       " 910\n",
       " 923\n",
       " 933\n",
       " 935\n",
       " 938\n",
       " 956\n",
       " 991\n",
       " 999"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "crps[\"injustice\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A DocumentTermMatrix"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "m = DocumentTermMatrix(crps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1000×34140 sparse matrix with 620593 Int64 nonzero entries:\n",
       "\t[2    ,     1]  =  6\n",
       "\t[5    ,     1]  =  1\n",
       "\t[7    ,     1]  =  28\n",
       "\t[9    ,     1]  =  15\n",
       "\t[10   ,     1]  =  3\n",
       "\t[11   ,     1]  =  7\n",
       "\t[15   ,     1]  =  1\n",
       "\t[16   ,     1]  =  4\n",
       "\t[17   ,     1]  =  3\n",
       "\t[18   ,     1]  =  1\n",
       "\t⋮\n",
       "\t[716  , 34132]  =  1\n",
       "\t[716  , 34133]  =  12\n",
       "\t[831  , 34133]  =  1\n",
       "\t[716  , 34134]  =  1\n",
       "\t[831  , 34134]  =  1\n",
       "\t[69   , 34135]  =  1\n",
       "\t[69   , 34136]  =  45\n",
       "\t[69   , 34137]  =  1\n",
       "\t[69   , 34138]  =  1\n",
       "\t[69   , 34139]  =  1\n",
       "\t[69   , 34140]  =  2"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dt = dtm(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1000×34140 sparse matrix with 620593 Float64 nonzero entries:\n",
       "\t[2    ,     1]  =  0.000700388\n",
       "\t[5    ,     1]  =  0.000930835\n",
       "\t[7    ,     1]  =  0.00535259\n",
       "\t[9    ,     1]  =  0.00446855\n",
       "\t[10   ,     1]  =  0.00324068\n",
       "\t[11   ,     1]  =  0.00713382\n",
       "\t[15   ,     1]  =  0.000198796\n",
       "\t[16   ,     1]  =  0.00201311\n",
       "\t[17   ,     1]  =  0.00151857\n",
       "\t[18   ,     1]  =  0.000299652\n",
       "\t⋮\n",
       "\t[716  , 34132]  =  0.00425616\n",
       "\t[716  , 34133]  =  0.045949\n",
       "\t[831  , 34133]  =  0.0013798\n",
       "\t[716  , 34134]  =  0.00382909\n",
       "\t[831  , 34134]  =  0.0013798\n",
       "\t[69   , 34135]  =  0.0021235\n",
       "\t[69   , 34136]  =  0.0955576\n",
       "\t[69   , 34137]  =  0.0021235\n",
       "\t[69   , 34138]  =  0.0021235\n",
       "\t[69   , 34139]  =  0.0021235\n",
       "\t[69   , 34140]  =  0.00424701"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf = tf_idf(m)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "using Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Clustering.KmeansResult{Float64}([0.000117358 0.00140057 … 0.0 0.00049715; 0.0 0.000386622 … 0.0 0.0; … ; 0.0 2.13632e-6 … 0.0 0.0; 0.0 4.27264e-6 … 0.0 0.0],[2,2,2,2,2,2,2,2,2,2  …  2,2,2,2,2,2,2,2,2,2],[0.0268403,0.0194125,0.0457738,0.032654,0.0241076,0.0401771,0.0461305,0.0108926,0.0296793,0.0112415  …  0.022854,0.0267481,0.117224,0.0156784,0.0416075,0.0453245,0.0933526,0.222898,0.0960297,0.0540498],[1,994,3,1,1],[1.0,994.0,3.0,1.0,1.0],39.85185508928952,2,true)"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cl = kmeans(full(tfidf'), 5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5-element Array{Int64,1}:\n",
       "   1\n",
       " 994\n",
       "   3\n",
       "   1\n",
       "   1"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cl.counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "collect(keys(lexicon(crps)))[sortperm(cl.centers[:,1]; rev=true)[1:20]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: imported binding for beta overwritten in module Main\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "2×34140 sparse matrix with 43015 Float64 nonzero entries:\n",
       "\t[1    ,     1]  =  0.00138445\n",
       "\t[2    ,     1]  =  0.00299285\n",
       "\t[1    ,     2]  =  0.000170262\n",
       "\t[2    ,     2]  =  0.000190974\n",
       "\t[1    ,     3]  =  4.29954e-6\n",
       "\t[2    ,     3]  =  1.95155e-5\n",
       "\t[1    ,     4]  =  1.46184e-5\n",
       "\t[2    ,     4]  =  1.53336e-5\n",
       "\t[2    ,     5]  =  8.3638e-6\n",
       "\t[1    ,     6]  =  6.53531e-5\n",
       "\t⋮\n",
       "\t[1    , 34132]  =  8.59909e-7\n",
       "\t[1    , 34133]  =  1.11788e-5\n",
       "\t[1    , 34134]  =  8.59909e-7\n",
       "\t[2    , 34134]  =  1.39397e-6\n",
       "\t[1    , 34135]  =  8.59909e-7\n",
       "\t[1    , 34136]  =  3.86959e-5\n",
       "\t[1    , 34137]  =  8.59909e-7\n",
       "\t[1    , 34138]  =  8.59909e-7\n",
       "\t[1    , 34139]  =  8.59909e-7\n",
       "\t[1    , 34140]  =  8.59909e-7\n",
       "\t[2    , 34140]  =  1.39397e-6"
      ]
     },
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "k = 2            # number of topics\n",
    "iteration = 1000 # number of gibbs sampling iterations\n",
    "alpha = 0.1      # hyper parameter\n",
    "beta = 0.1       # hyber parameter\n",
    "l = lda(m, k, iteration, alpha, beta) # l is k x word matrix.\n",
    "                                      # value is probablity of occurrence of a word in a topic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(::gradfun) (generic function with 1 method)"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "using Knet\n",
    "\n",
    "predct(w,x) = w[1]*x .+ w[2]\n",
    "\n",
    "loss(w,x,y) = sumabs2(y - predct(w,x)) / size(y,2)\n",
    "\n",
    "lossgradient = grad(loss)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "train (generic function with 1 method)"
      ]
     },
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function train(w, data; lr=.1)\n",
    "    for (x,y) in data\n",
    "        dw = lossgradient(w, x, y)\n",
    "        for i in 1:length(w)\n",
    "            w[i] -= lr * dw[i]\n",
    "        end\n",
    "    end\n",
    "    return w\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1×1000 Array{Int64,2}:\n",
       " 0  1  0  1  1  0  1  0  0  1  1  0  0  …  1  1  0  0  0  0  0  0  0  0  0  1"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y=rand([0,1], 1, 1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2-element Array{Any,1}:\n",
       "  [0.0227769 -0.166226 … 0.00791624 0.121108]\n",
       " 0.0                                         "
      ]
     },
     "execution_count": 75,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w = Any[ 0.1*randn(1,34140), 0.0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.3928566451442594\n",
      "0.3408238131065078\n",
      "0.3075343812766715\n",
      "0.28623603788225627\n",
      "0.27260899215269474\n",
      "0.26388965838130973\n",
      "0.2583100328929715\n",
      "0.25473903551610305\n",
      "0.25245305782582805\n",
      "0.25098917140583094\n"
     ]
    }
   ],
   "source": [
    "for i=1:10; train(w, [(tfidf',y)]); println(loss(w,tfidf',y)); end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "([0.413599],0)"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predct(w, full(tfidf[3, :])), y[3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "A Corpus"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c = Corpus(Any[StringDocument(t) for t in getSentences(text)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "prepare!(c, strip_non_letters | strip_punctuation | strip_case | strip_stopwords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "stem!(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "update_lexicon!(c)\n",
    "update_inverse_index!(c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "54×252 sparse matrix with 532 Float64 nonzero entries:\n",
       "\t[5  ,   1]  =  0.44322\n",
       "\t[52 ,   2]  =  0.306845\n",
       "\t[25 ,   3]  =  0.398898\n",
       "\t[10 ,   4]  =  0.185906\n",
       "\t[11 ,   4]  =  0.371813\n",
       "\t[34 ,   4]  =  0.162668\n",
       "\t[49 ,   4]  =  0.236608\n",
       "\t[11 ,   5]  =  0.339935\n",
       "\t[32 ,   5]  =  0.475909\n",
       "\t[34 ,   5]  =  0.148722\n",
       "\t⋮\n",
       "\t[52 , 246]  =  0.306845\n",
       "\t[23 , 247]  =  1.44519\n",
       "\t[46 , 247]  =  0.481729\n",
       "\t[51 , 247]  =  0.481729\n",
       "\t[50 , 248]  =  0.249312\n",
       "\t[29 , 249]  =  0.44322\n",
       "\t[12 , 250]  =  0.362635\n",
       "\t[4  , 251]  =  0.321152\n",
       "\t[6  , 251]  =  0.722593\n",
       "\t[8  , 251]  =  0.289037\n",
       "\t[54 , 252]  =  0.0814078"
      ]
     },
     "execution_count": 111,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf = tf_idf(DocumentTermMatrix(c))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "54×54 sparse matrix with 1438 Float64 nonzero entries:\n",
       "\t[1 ,  1]  =  0.909875\n",
       "\t[3 ,  1]  =  0.204116\n",
       "\t[4 ,  1]  =  0.0167632\n",
       "\t[5 ,  1]  =  0.0167632\n",
       "\t[6 ,  1]  =  0.0377172\n",
       "\t[7 ,  1]  =  0.173923\n",
       "\t[8 ,  1]  =  0.0150869\n",
       "\t[9 ,  1]  =  0.0251448\n",
       "\t[10,  1]  =  0.0107763\n",
       "\t[12,  1]  =  0.0137153\n",
       "\t⋮\n",
       "\t[43, 54]  =  0.0108824\n",
       "\t[44, 54]  =  0.0035188\n",
       "\t[45, 54]  =  0.0215087\n",
       "\t[47, 54]  =  0.0629085\n",
       "\t[48, 54]  =  0.00985264\n",
       "\t[49, 54]  =  0.00223924\n",
       "\t[50, 54]  =  0.00686365\n",
       "\t[51, 54]  =  0.0230408\n",
       "\t[52, 54]  =  0.00189474\n",
       "\t[53, 54]  =  0.0170527\n",
       "\t[54, 54]  =  0.385053"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf * tf'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "collapsed": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: Method definition pagerank(Any) in module Main at In[102]:2 overwritten at In[113]:2.\n",
      "WARNING: Method definition #pagerank(Array{Any, 1}, Main.#pagerank, Any) in module Main overwritten.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "pagerank (generic function with 1 method)"
      ]
     },
     "execution_count": 113,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function pagerank( A; Niter=20, damping=.15)\n",
    "         Nmax = size(A, 1)\n",
    "         r = rand(1,Nmax);              # Generate a random starting rank.\n",
    "         r = r ./ norm(r,1);            # Normalize\n",
    "         a = (1-damping) ./ Nmax;       # Create damping vector\n",
    "\n",
    "         for i=1:Niter\n",
    "             s = r * A\n",
    "             scale!(s, damping)\n",
    "             r = s .+ (a * sum(r,2));   # Compute PageRank.\n",
    "         end\n",
    "\n",
    "         r = r./norm(r,1);\n",
    "\n",
    "         return r\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1×54 Array{Float64,2}:\n",
       " 0.482163  0.495855  0.48104  0.506664  …  0.508205  0.438006  0.426742"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p=pagerank(tf * tf')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10-element Array{String,1}:\n",
       " \"On 13 October 2005, the applicant again wrote to the Commission.\"                                                                          \n",
       " \"As to NSD 2340 of 2005, it was proposed that the application be dismissed.\"                                                                \n",
       " \"I am of that view for two reasons.\"                                                                                                        \n",
       " \"10 Another matter is an application by the applicant that her name be suppressed.\"                                                         \n",
       " \"11 Dealing with that second matter, it seems inappropriate to make the orders sought.\"                                                     \n",
       " \"I am not affirmatively satisfied that the orders sought are orders that are necessary todo justice between the parties.\"                   \n",
       " \"As to the other orders proposed by the applicant are concerned, I am not satisfied theyare necessary to do justice between the parties.\"   \n",
       " \"Such an order can be made by the Court if it is of the view it is necessary in order toprevent prejudice to the administration of justice.\"\n",
       " \"14 The applicant has elected to bring these applications, as she is entitled to do.\"                                                       \n",
       " \"However, in my view, that is not a sufficient reason to make an order under s 50 ofthe FCA Act .\"                                          "
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getSentences(text)[sort(sortperm(vec(p), rev=true)[1:10])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.5.3-pre",
   "language": "julia",
   "name": "julia-0.5"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}