mvaz/lm_example.ipynb

## lm_example.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The unreasonable effectiveness of Character-level Language Models\n",
    "## (and why RNNs are still cool)\n",
    "\n",
    "###[Yoav Goldberg](http://www.cs.biu.ac.il/~yogo)\n",
    "\n",
    "RNNs, LSTMs and Deep Learning are all the rage, and a recent [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy is doing a great job explaining what these models are and how to train them.\n",
    "It also provides some very impressive results of what they are capable of.  This is a great post, and if you are interested in natural language, machine learning or neural networks you should definitely read it. \n",
    "\n",
    "Go read it now, then come back here. \n",
    "\n",
    "You're back? good. Impressive stuff, huh? How could the network learn to immitate the input like that?\n",
    "Indeed. I was quite impressed as well.\n",
    "\n",
    "However, it feels to me that most readers of the post are impressed by the wrong reasons.\n",
    "This is because they are not familiar with **unsmoothed maximum-liklihood character level language models** and their unreasonable effectiveness at generating rather convincing natural language outputs.\n",
    "\n",
    "In what follows I will briefly describe these character-level maximum-likelihood langauge models, which are much less magical than RNNs and LSTMs, and show that they too can produce a rather convincing Shakespearean prose. I will also show about 30 lines of python code that take care of both training the model and generating the output. Compared to this baseline, the RNNs may seem somehwat less impressive. So why was I impressed? I will explain this too, below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unsmoothed Maximum Likelihood Character Level Language Model \n",
    "\n",
    "The name is quite long, but the idea is very simple.  We want a model whose job is to guess the next character based on the previous $n$ letters. For example, having seen `ello`, the next characer is likely to be either a commma or space (if we assume is is the end of the word \"hello\"), or the letter `w` if we believe we are in the middle of the word \"mellow\". Humans are quite good at this, but of course seeing a larger history makes things easier (if we were to see 5 letters instead of 4, the choice between space and `w` would have been much easier).\n",
    "\n",
    "We will call $n$, the number of letters we need to guess based on, the _order_ of the language model.\n",
    "\n",
    "RNNs and LSTMs can potentially learn infinite-order language model (they guess the next character based on a \"state\" which supposedly encode all the previous history). We here will restrict ourselves to a fixed-order language model.\n",
    "\n",
    "So, we are seeing $n$ letters, and need to guess the $n+1$th one. We are also given a large-ish amount of text (say, all of Shakespear works) that we can use. How would we go about solving this task?\n",
    "\n",
    "Mathematiacally, we would like to learn a function $P(c | h)$. Here, $c$ is a character, $h$ is a $n$-letters history, and $P(c|h)$ stands for how likely is it to see $c$ after we've seen $h$.\n",
    "\n",
    "Perhaps the simplest approach would be to just count and divide (a.k.a **maximum likelihood estimates**). We will count the number of times each letter $c'$ appeared after $h$, and divide by the total numbers of letters appearing after $h$. The **unsmoothed** part means that if we did not see a given letter following $h$, we will just give it a probability of zero.\n",
    "\n",
    "And that's all there is to it.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training Code\n",
    "Here is the code for training the model. `fname` is a file to read the characters from. `order` is the history size to consult. Note that we pad the data with leading `~` so that we also learn how to start.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import *\n",
    "\n",
    "def train_char_lm(fname, order=4):\n",
    "    with open(fname, 'r') as f:\n",
    "        data = f.read()\n",
    "    lm = defaultdict(Counter)\n",
    "    pad = \"~\" * order\n",
    "    data = pad + data\n",
    "    for i in range(len(data)-order):\n",
    "        history, char = data[i:i+order], data[i+order]\n",
    "        lm[history][char]+=1\n",
    "    def normalize(counter):\n",
    "        s = float(sum(counter.values()))\n",
    "        return [(c,cnt/s) for c,cnt in counter.items()]\n",
    "    outlm = {hist:normalize(chars) for hist, chars in lm.items()}\n",
    "    return outlm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train it on Andrej's Shakespears's text:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_items([])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c=defaultdict(Counter)\n",
    "c.items()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lm = train_char_lm(\"shakespeare_input2.txt\", order=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok. Now let's do some queries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('?', 0.0068143100511073255),\n",
       " (\"'\", 0.017035775127768313),\n",
       " (',', 0.027257240204429302),\n",
       " ('n', 0.0017035775127768314),\n",
       " ('.', 0.0068143100511073255),\n",
       " ('!', 0.0068143100511073255),\n",
       " (' ', 0.013628620102214651),\n",
       " ('w', 0.817717206132879),\n",
       " ('u', 0.03747870528109029),\n",
       " (':', 0.005110732538330494),\n",
       " ('r', 0.059625212947189095)]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['ello']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('t', 1.0)]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['Firs']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('i', 0.016853932584269662),\n",
       " ('c', 0.012841091492776886),\n",
       " ('b', 0.024879614767255216),\n",
       " ('e', 0.0032102728731942215),\n",
       " ('f', 0.011235955056179775),\n",
       " (\"'\", 0.0008025682182985554),\n",
       " ('E', 0.0016051364365971107),\n",
       " ('l', 0.01043338683788122),\n",
       " ('I', 0.009630818619582664),\n",
       " ('r', 0.0072231139646869984),\n",
       " ('t', 0.05377207062600321),\n",
       " ('C', 0.09550561797752809),\n",
       " ('W', 0.033707865168539325),\n",
       " ('n', 0.020064205457463884),\n",
       " ('y', 0.002407704654895666),\n",
       " ('O', 0.018459069020866775),\n",
       " ('R', 0.0008025682182985554),\n",
       " ('M', 0.0593900481540931),\n",
       " ('S', 0.16292134831460675),\n",
       " ('o', 0.030497592295345103),\n",
       " ('k', 0.0040128410914927765),\n",
       " ('a', 0.02247191011235955),\n",
       " ('L', 0.10674157303370786),\n",
       " ('v', 0.002407704654895666),\n",
       " ('N', 0.0008025682182985554),\n",
       " ('K', 0.008025682182985553),\n",
       " ('H', 0.0040128410914927765),\n",
       " ('w', 0.024077046548956663),\n",
       " ('u', 0.0016051364365971107),\n",
       " ('P', 0.014446227929373997),\n",
       " ('d', 0.015248796147672551),\n",
       " ('F', 0.012038523274478331),\n",
       " ('s', 0.03290529695024077),\n",
       " ('g', 0.011235955056179775),\n",
       " ('G', 0.0898876404494382),\n",
       " ('D', 0.0032102728731942215),\n",
       " ('h', 0.019261637239165328),\n",
       " ('p', 0.00882825040128411),\n",
       " ('A', 0.0056179775280898875),\n",
       " ('m', 0.02247191011235955),\n",
       " ('B', 0.009630818619582664),\n",
       " ('q', 0.0016051364365971107),\n",
       " ('T', 0.0032102728731942215)]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['rst ']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So `ello` is followed by either space, punctuation or `w` (or `r`, `u`, `n`), `Firs` is pretty much deterministic, and the word following `ist ` can start with pretty much every letter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating from the model\n",
    "Generating is also very simple. To generate a letter, we will take the history, look at the last $order$ characteters, and then sample a random letter based on the corresponding distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from random import random\n",
    "\n",
    "def generate_letter(lm, history, order):\n",
    "        history = history[-order:]\n",
    "        dist = lm[history]\n",
    "        x = random()\n",
    "        for c,v in dist:\n",
    "            x = x - v\n",
    "            if x <= 0: return c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate a passage of $k$ characters, we just seed it with the initial history and run letter generation in a loop, updating the history at each turn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def generate_text(lm, order, nletters=1000):\n",
    "    history = \"~\" * order\n",
    "    out = []\n",
    "    for i in range(nletters):\n",
    "        c = generate_letter(lm, history, order)\n",
    "        history = history[-order:] + c\n",
    "        out.append(c)\n",
    "    return \"\".join(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generated Shakespeare from different order models\n",
    "\n",
    "Let's try to generate text based on different language-model orders. Let's start with something silly:\n",
    "\n",
    "### order 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fir heannes. Nelf!\n",
      "\n",
      "Dive gods,\n",
      "iferk youres now ous.\n",
      "\n",
      "Elived.\n",
      "\n",
      "Sir dow my shatere, so thost duck youle suich,\n",
      "Thard sh don youre shat that me pacy, mazenhum!\n",
      "Alarcusbase wing youressellove muchat beight saw:\n",
      "Yould, a pried men, andere's my befte wer, hen CAER:\n",
      "For Sne is wourse ve withe ithe\n",
      "witit houll: diteave foold saw lint of Chrithatend\n",
      "smis abour that mais penceirtake redaut, for ever, ther eadstrunt ast to a daugle BOTESTH:\n",
      "A mendinclestruntervin froge'erkinsted duarvingerethsake yettend am Fra: widear\n",
      "faing.\n",
      "\n",
      "Commis do triegal that of bour thee, ingue wit dam\n",
      "Throurkin he stions, God ce sho'en-bogervere wid hur so bein,\n",
      "But, ing fain thess' th, mus, flood the to limeavell heat didee,\n",
      "If go mand nacky nave thers te!\n",
      "\n",
      "Fords for poin ithe ber, And sto eve nave-fus yout wit shave me jand If have, con of thro?\n",
      "\n",
      "Leasuid wearesers pood\n",
      "to us whaver of licephowl: hande lith and her cathe wer scom!'\n",
      "Cou withe earsent aletuall! Was a se, com thood gun rem myse a frinks,\n",
      "Yout bustere stio\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input2.txt\", order=2)\n",
    "print(generate_text(lm, 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not so great.. but what if we increase the order to 4?\n",
    "\n",
    "### order 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First, not Lord:\n",
      "On, Bassion out, ever lives be wooed Maste only, my loves be mean of Corioli and me, I'll stands much world,\n",
      "To speaks meet not so? O, a most a let in flow the look you well about given in o'\n",
      "the villain,\n",
      "I methou may dost we hath dress' eart, how may good so, as a prison,\n",
      "Which he cleavenly guilty strouse,\n",
      "I shall the may executed our know man.\n",
      "\n",
      "SEBASTARD:\n",
      "My mine eyes as thing\n",
      "That it pains-mater.\n",
      "\n",
      "GRATIO:\n",
      "We hot deeply.\n",
      "\n",
      "LADY ANNE:\n",
      "Nay, I'll gifts.\n",
      "\n",
      "EARL OF SYRACUSE:\n",
      "To do not be, no; for thus come dotary\n",
      "Upon Satan, butch Timon, affectious crave thy ranslate it dost by Saintance?'\n",
      "And spread field have\n",
      "courable Banding. Talbot asides, and it end dying.\n",
      "\n",
      "REYNALDO:\n",
      "Now, my lord.\n",
      "\n",
      "HAMLET:\n",
      "He prouder strong wear\n",
      "A this;\n",
      "It makest kernell, for spirit\n",
      "Crave will nevery stories answer a ward to old telligence.\n",
      "\n",
      "CASSIUS:\n",
      "The blood, my friends. Who withal the musty noses;\n",
      "Kind all on me, and the peace,\n",
      "Art the strength that world: yoursed would to proper bluntlemen find Joh\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=4)\n",
    "print(generate_text(lm, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is already quite reasonable, and reads like English. Just 4 letters history! What if we increase it to 7?\n",
    "\n",
    "### order 7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Clown:\n",
      "Fie, Joan, not forged him, he drink after; do it; and much at other side!\n",
      "How ceremony.\n",
      "\n",
      "CHATILLON:\n",
      "They lie dead:\n",
      "For every glad of it with my good Alexas?\n",
      "\n",
      "ALEXAS:\n",
      "Half all my father let me bid you tell'st shelter than my weakness\n",
      "To be revenge, now the people knows thy tent;\n",
      "The Duke of Norfolk and signs of war;\n",
      "and the strive them fought we bury back, foolish, rascals.\n",
      "\n",
      "FLAVIUS:\n",
      "Let me speaks well.\n",
      "That monster of his subject we old man.\n",
      "\n",
      "HECTOR:\n",
      "What! you are all in our maidenhead,\n",
      "And broke doth warrant\n",
      "him.\n",
      "\n",
      "ESCALUS:\n",
      "So. What if I do fear is open; and\n",
      "the princes, what nothing, it was folly: she\n",
      "will them: in their orbs, and return'd to all, my lord of Canterbury,\n",
      "I have beat him, and seek to touch the essential vesture wound heaves:\n",
      "You are dull, and cried, 'An angel whom thou hast a troop of sovereign liege, my very goodly treats his alter'd venom steep'd,\n",
      "'Gainst the stones out o' the time of darkness!\n",
      "If this more in your brother means:\n",
      "We will prevailed: I woul\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=7)\n",
    "print(generate_text(lm, 7))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How about 10?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First Citizen:\n",
      "Peace, ho! the moon sleeps will shed tears,\n",
      "And lowly words were music to my heart,\n",
      "When all the means of the earth, fearing since how it might please the king?\n",
      "\n",
      "KING:\n",
      "She hath been this day we shall,\n",
      "As I conceive me, countrymen!\n",
      "Then I, as one that will take pains to make\n",
      "a puppet of her.\n",
      "O my distress,\n",
      "But that you\n",
      "Have blown me full of idle dreamer, where the torch doth burn,' quoth I. 'No, sir,' quoth he, as if\n",
      "He had been wounded with the wind-swift Cupid wings.\n",
      "Now is the day that you seek?\n",
      "\n",
      "GLOUCESTER:\n",
      "I have almost forgot.\n",
      "\n",
      "MARK ANTONY:\n",
      "Favours, by Jove that the duke's trumpets! I know not.\n",
      "\n",
      "PARIS:\n",
      "You are one of thee,\n",
      "now shalt thou learn of noble birth.\n",
      "\n",
      "Shepherd:\n",
      "Out, out, though you were a mockery king of men,\n",
      "He's both the villanous house in all\n",
      "London road for fleas: I am stung like a testy babe, will scarcely more\n",
      "Than if not look'd on.\n",
      "Sir, as I told you a thing?\n",
      "\n",
      "Second Witch:\n",
      "Show!\n",
      "\n",
      "ALL:\n",
      "Show his eye aside,\n",
      "And bid him keep it better than well,\n",
      "They sh\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"shakespeare_input.txt\", order=10)\n",
    "print(generate_text(lm, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### This works pretty well\n",
    "\n",
    "With an order of 4, we already get quite reasonable results. Increasing the order to 7 (~word and a half of history) or 10 (~two short words of history) already gets us quite passable Shakepearan text. I'd say it is on par with the examples in Andrej's post. And how simple and un-mystical the model is!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### So why am I impressed with the RNNs after all?\n",
    "\n",
    "Generating English a character at a time -- not so impressive in my view. The RNN needs to learn the previous $n$ letters, for a rather small $n$, and that's it. \n",
    "\n",
    "However, the code-generation example is very impressive. Why? because of the context awareness. Note that in all of the posted examples, the code is well indented, the braces and brackets are correctly nested, and even the comments start and end correctly. This is not something that can be achieved by simply looking at the previous $n$ letters. \n",
    "\n",
    "If the examples are not cherry-picked, and the output is generally that nice, then the LSTM did learn something not trivial at all.\n",
    "\n",
    "Just for the fun of it, let's see what our simple language model does with the linux-kernel code:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "!wget http://cs.stanford.edu/people/karpathy/char-rnn/linux_input.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "~/*\n",
      " * linux/kernel/time/clockevents_handle_noop;\n",
      "\t\ttd->evtdev)\n",
      "\t\thrtimer_forward() to expire or\n",
      "\t * if we support will be useful, but\n",
      " * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\n",
      " * General Public License for more detailed information stored from task_tick_numa().\n",
      " */\n",
      "void __sched wait_for_completion\t*detach_completion_interval_ms);\n",
      "}\n",
      "\n",
      "static struct k_itimer *posix_timers_register(&module_trace_bprintk_fmt != __start_syscall - restart loop. */\n",
      "\t\t\tgoto out;\n",
      "\n",
      "\tret = count;\n",
      "out:\n",
      "\tmutex_unlock(&rcu_tasks_cbs_head);\n",
      "\t\t}\n",
      "\t}\n",
      "\n",
      "\t/* Any online checks)\n",
      " *\n",
      " * In none of the task.\n",
      "\t */\n",
      "\tcase SYSLOG_ACTION_* buffer sealed and compressed data */\n",
      "\tcd.read_data[id].dev.id = id;\n",
      "\n",
      "\tthreads[id]))\n",
      "\t\treturn NULL;\n",
      "\n",
      "\treturn suspend_ops->enter();\n",
      "\t\t}\n",
      "\t}\n",
      "\n",
      "\tpr_devel(\"verify_signature *pks,\n",
      "\t\t\t\t const char *dict, u16 dict_len, &pad_len);\n",
      "\n",
      "\tmatch = futex_top_waiter(hb, &key);\n",
      "\tif (strlen(root->name))\n",
      "\t\treturn;\n",
      "\n",
      "\trefcount = atomic_r\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"linux_input.txt\", order=10)\n",
    "print(generate_text(lm, 10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "~\n",
      " *     wait->flags &= ~WQ_FLAG_WOKEN;\t\tcondition = true;\n",
      " *     smp_mb() // B\t\t\t\tsmp_wmb(); // C\n",
      " *\t\t\t\t\t\twait->flags |= WQ_FLAG_WOKEN;\n",
      "\n",
      "\treturn default_wake_function(wait, mode, sync, key);\n",
      "}\n",
      "\n",
      "static bool init_nocb_callback_list(struct rcu_data *rdp)\n",
      "{\n",
      "\tunsigned long flags;\n",
      "\n",
      "\tif (!latencytop_enabled)\n",
      "\t\treturn;\n",
      "\n",
      "\t/* We're only interested __GFP_FS allocation is allowed for already existing hierarchies but we\n",
      "\t * can set the expiry time is replaced with cpu_online_mask));\n",
      "\twatchdog_running = 0;\n",
      "}\n",
      "\n",
      "static int cpu_cfs_period_write_u64,\n",
      "\t},\n",
      "#endif\n",
      "#ifdef CONFIG_SMP\n",
      "\t\tstruct sched_entity *last;\n",
      "\tunsigned long clone_flags,\n",
      "\t\t int __user *, stat_addr, int, options)\n",
      "{\n",
      "\treturn __ftrace_graph_ent *trace);\n",
      "void set_graph_array(tr);\n",
      "\tret = register_ftrace_function_check_pred(pred, 0);\n",
      "\t} else {\n",
      "\t\tdone = tu->filter.nr_systemwide++;\n",
      "\t}\n",
      "\twrite_unlock(&resource_lock);\n",
      "\treturn p;\n",
      "}\n",
      "\n",
      "/*\n",
      " * This is compatibility?  This can be done in libc so Alpha\n",
      " * and all newer ports shouldn't need it.\n",
      " */\n",
      "SYSCALL_DEF\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"linux_input.txt\", order=15)\n",
    "print(generate_text(lm, 15))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/itimer.c\n",
      " *\n",
      " * Copyright (C) 2008 Steven Noonan <steven@uplinklabs.net>\n",
      " *\n",
      " */\n",
      "\n",
      "#include <linux/kernel_stat.h>\n",
      "\n",
      "#include \"internals.h\"\n",
      "\n",
      "/*\n",
      " * Autodetection depends on the fact the task is active,\n",
      " * (it is on its rq) or has been removed from the hash list via\n",
      " * the rt_mutex code. See unqueue_me_pi().\n",
      " */\n",
      "struct futex_q {\n",
      "\tstruct plist_node list;\n",
      "\n",
      "\tstruct task_group *parent)\n",
      "{\n",
      "\tstruct pid *pgrp = task_pgrp(tsk);\n",
      "\tstruct task_struct *find_task_by_vpid(pid);\n",
      "\t\tif (!p)\n",
      "\t\t\tgoto out;\n",
      "\t\tgrp = task_pgrp(p);\n",
      "\t\tif (!grp)\n",
      "\t\t\tgoto out;\n",
      "\n",
      "\t\tret = expires;\n",
      "\n",
      "\t\t/*\n",
      "\t\t * nohz_stop_sched_tick can be called several times before\n",
      "\t\t * the nohz_restart_sched_tick(ts, ktime_get(), cpu);\n",
      "#endif\n",
      "}\n",
      "\n",
      "static int lockdep_stats_open(struct inode *inode, struct file *file)\n",
      "{\n",
      "\tstruct enable_trigger_data {\n",
      "\tstruct ftrace_event_file *file,\n",
      "\t\t   filter_func_t filter)\n",
      "{\n",
      "\tbool enabled = trace_probe_is_enabled(&tk->tp)) {\n",
      "\t\t\tret = -EBUSY;\n",
      "\t\t\tgoto out_unlock;\n",
      "\t}\n",
      "\n",
      "\t/*\n",
      "\t * Except for the root, child_subsys_m\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(\"linux_input.txt\", order=20)\n",
    "print(generate_text(lm, 20))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/irq/autoprobe.c\n",
      " *\n",
      " * Copyright (C) 2008 Red Hat, Inc. All Rights Reserved.\n",
      " * Copyright (c) 2009 Pavel Machek <pavel@ucw.cz>\n",
      " * Copyright (c) 2009 Rafael J. Wysocki, Novell Inc.\n",
      " * Copyright (C) 2002-2004 Timesys Corporation\n",
      " * Copyright (C) 2012 Bojan Smojver <bojan@rexursive.com>\n",
      " *\n",
      " * This file contains the jiffies based clocksource is the lowest common\n",
      " * denominator clock source which should function on\n",
      " * all systems. It has the same coarse resolution as\n",
      " * the timer interrupt handler.  The irq handler has\n",
      " * already updated our counts.  We need to check if any timers fire now.\n",
      " * Interrupts are disabled,\n",
      "\t * we do not want to preempt the current task with a newly woken task if needed:\n",
      " */\n",
      "static void check_preempt_curr_dl(struct rq *rq)\n",
      "{\n",
      "\tint prev_cpu = rq->rt.push_cpu;\n",
      "\tint cpu;\n",
      "\n",
      "\tcpu = raw_smp_processor_id();\n",
      "\tint cpu;\n",
      "\n",
      "\tfor_each_possible_cpu(i)\n",
      "\t\tzalloc_cpumask_var_node(&tbl[node], GFP_KERNEL,\n",
      "\t\t\t\tnode_online(node) ? node : NUMA_NO_NODE));\n",
      "\n",
      "\tfor_each_poss\n"
     ]
    }
   ],
   "source": [
    "print(generate_text(lm, 20))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/*\n",
      " * linux/kernel/irq/pm.c\n",
      " *\n",
      " * Copyright (C) 2005-2006 Thomas Gleixner\n",
      " *\n",
      " * This code is licenced under the GPL version 2. For details see\n",
      " * kernel-base/COPYING.\n",
      " */\n",
      "#include <linux/perf_event.h>\n",
      "#include <linux/livepatch.h>\n",
      "\n",
      "/**\n",
      " * struct klp_ops - structure for tracking registered ftrace ops struct\n",
      " */\n",
      "struct klp_ops {\n",
      "\tstruct list_head\tlist;\n",
      "\tstruct kretprobe\trp;\t/* Use rp.kp for kprobe use */\n",
      "\tunsigned long cpumask[0]; /* iteration mask */\n",
      "};\n",
      "\n",
      "struct sched_group {\n",
      "\tstruct sched_group *group = sd->groups;\n",
      "\n",
      "\tcpumask_clear(groupmask);\n",
      "\n",
      "\tprintk(KERN_DEBUG \"Probe %d : %p\\n\", i, funcs[i].func);\n",
      "}\n",
      "\n",
      "static struct ftrace_ops control_ops = {\n",
      "\t.func\t= ftrace_ops_control_func,\n",
      "\t.flags\t= FTRACE_OPS_FL_RECURSION_SAFE | FTRACE_OPS_FL_INITIALIZED;\n",
      "\t}\n",
      "#endif\n",
      "}\n",
      "\n",
      "/**\n",
      " * ftrace_nr_registered_ops(void)\n",
      "{\n",
      "\tstruct ftrace_profile {\n",
      "\tstruct hlist_node\tnode;\n",
      "\tstruct ftrace_profile_stat, stat);\n",
      "\n",
      "\tif (!stat || !stat->start)\n",
      "\t\treturn NULL;\n",
      "\n",
      "\treturn this_cpu_ptr(ctx->pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);\n",
      "\tif (pmu->pmu_cpu_context)\n",
      "\t\tgoto got_cpu_context;\n",
      "\n",
      "\tret = -ENOMEM;\n",
      "\n",
      "\tfpid = kmalloc(sizeof(*fgd), GFP_KERNEL);\n",
      "\tif (unlikely(!watch))\n",
      "\t\treturn ERR_PTR(-ENOMEM);\n",
      "\n",
      "\twl->name = kstrndup(name,\n",
      "\t\t\t\t\t      MAX_CGROUP_ROOT_NAMELEN - 1,\n",
      "\t\t\t\t\t      GFP_KERNEL);\n",
      "\tif (ns == NULL)\n",
      "\t\tgoto out;\n",
      "\n",
      "\tq = blk_trace_get_queue(bdev);\n",
      "\tif (q == NULL)\n",
      "\t\tgoto out_bdput;\n",
      "\n",
      "\tmutex_lock(&bdev->bd_mutex);\n",
      "out_bdput:\n",
      "\tbdput(bdev);\n",
      "out:\n",
      "\treturn ret ? ret : count;\n",
      "}\n",
      "\n",
      "int blk_trace_setup(struct request_queue *q,\n",
      "\t\t\t\tstruct bio *bio, int rw)\n",
      "{\n",
      "\tif (bio)\n",
      "\t\tblk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);\n",
      "}\n",
      "\n",
      "static void blk_add_trace_plug(void *ignore, struct pt_regs *regs)\n",
      "{\n",
      "}\n",
      "\n",
      "static inline void perf_cgroup_sched_out(task, next);\n",
      "}\n",
      "\n",
      "static void ptrace_unfreeze_traced(child);\n",
      "\t}\n",
      "\n",
      " out_put_task_struct;\n",
      "\n",
      "\tret = arch_ptrace(child, request, addr, data);\n",
      "\t\t/*\n",
      "\t\t * Some architectures need to do book-keeping after\n",
      "\t\t * a ptrace attach.\n",
      "\t\t */\n",
      "\t\tif (!ret)\n",
      "\t\t\tarch_ptrace_attach(child);\n",
      "\t\tgoto out_put_task_struct:\n",
      "\tput_task_struct(next_task);\n",
      "\n",
      "\t/* find_lock_lowest_rq releases rq->lock\n",
      "\t\t * so it is possible that hit + missed will overflow and be zero */\n",
      "\t\tif (!(hit + missed)) {\n",
      "\t\t\ttrace_printk(\"hit + missed overflowed and totalled zero!\\n\");\n",
      "\t\t\thit--; /* make it non zero */\n",
      "\t\t}\n",
      "\n",
      "\t\t/* Caculate the average time in nanosecs */\n",
      "\t\tavg = NSEC_PER_MSEC / hit;\n",
      "\t\ttrace_printk(\"%ld ns per entry\\n\", avg);\n",
      "\t}\n",
      "}\n",
      "\n",
      "static void perf_event_task_output(struct perf_event *event);\n",
      "\n",
      "void __weak perf_event_print_debug(void)\t{ }\n",
      "\n",
      "extern __weak const char *perf_pmu_name(void)\n",
      "{\n",
      "\treturn \"pmu\";\n",
      "}\n",
      "\n",
      "static inline u64 rq_clock(struct rq *rq)\n",
      "{\n",
      "\tlockdep_assert_held(&css_set_rwsem);\n",
      "\n",
      "\t\tretval = cgroup_attach_task()\n",
      " */\n",
      "\n",
      "static int cgroup_rm_cftypes_locked(cfts);\n",
      "\n",
      "\tmutex_unlock(&show_mutex);\n",
      "\n",
      "\treturn 0;\n",
      "}\n",
      "\n",
      "int perf_trace_add(struct perf_event *bp, int flags)\n",
      "{\n",
      "\tarch_uninstall_hw_breakpoint(bp);\n",
      "}\n",
      "\n",
      "static void hw_breakpoint_stop(struct perf_event *event, *tmp;\n",
      "\tLIST_HEAD(events);\n",
      "\n",
      "\tsrc_ctx = &per_cpu_ptr(pmu->pmu_cpu_context)\n",
      "\t\tgoto got_cpu_context;\n",
      "\n",
      "\tret = -ENOMEM;\n",
      "\n",
      "\tfpid = kmalloc(sizeof(*rd), GFP_KERNEL);\n",
      "\tif (unlikely(!entry))\n",
      "\t\treturn NULL;\n",
      "\n",
      "\tif (hash->size_bits)\n",
      "\t\tkey = hash_long(ip, FTRACE_PROFILE_HASH_BITS);\n",
      "\thhd = &stat->hash[key];\n",
      "\n",
      "\tif (hlist_empty(hhd))\n",
      "\t\treturn;\n",
      "\n",
      "\t/*\n",
      "\t * Delay this task enough that another task of this mm will likely win\n",
      "\t * the next time around.\n",
      "\t */\n",
      "\tp->node_stamp += 2 * TICK_NSEC;\n",
      "\n",
      "\tstart = mm->numa_scan_offset = 0;\n",
      "}\n",
      "\n",
      "/*\n",
      " * The expensive part of numa migration is done from task_work context.\n",
      " * Triggered from task_tick_numa().\n",
      " */\n",
      "void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)\n",
      "{\n",
      "\tstruct page *page = pfn_to_page(fr_pfn);\n",
      "\n",
      "\t\tmemory_bm_clear_current(forbidden_pages_map, page_to_pfn(page));\n",
      "}\n",
      "\n",
      "static void swevent_hlist_put(struct perf_event *event;\n",
      "\tint can_add_hw = 1;\n",
      "\n",
      "\tlist_for_each_entry(task, &cset->mg_tasks, cg_list) {\n",
      "\t\t\tif (count++ > MAX_TASKS_SHOWN_PER_CSS)\n",
      "\t\t\t\tgoto overflow;\n",
      "\t\t\tseq_printf(seq, \"  task %d\\n\", task_pid_vnr(task));\n",
      "\t\t}\n",
      "\n",
      "\t\tlist_for_each_entry(wq, &workqueues, list) {\n",
      "\t\tmutex_lock(&wq->mutex);\n",
      "\n",
      "\tfor_each_pwq(pwq, wq) {\n",
      "\t\t\tWARN_ON_ONCE(cpumask_test_cpu(cpu, buffer->cpumask))\n",
      "\t\tgoto out;\n",
      "\n",
      "\t/* Add a dynamic probe */\n",
      "\tdyn_ops = kzalloc(sizeof(*ref), GFP_KERNEL);\n",
      "\t\tif (!image->cmdline_buf) {\n",
      "\t\t\tret = -ENOMEM;\n",
      "\t\t\t\tgoto out_err;\n",
      "\t\t\t}\n",
      "\t\t}\n",
      "\n",
      "\t\tget_online_cpus();\n",
      "\tfor_each_online_cpu(cpu)\n",
      "\t\tprint_cpu(NULL, cpu, now);\n",
      "\n",
      "#ifdef CONFIG_GENERIC_PENDING_IRQ\n",
      "\tfree_cpumask_var(desc->irq_data.affinity, gfp, node))\n",
      "\t\treturn -ENOMEM;\n",
      "\n",
      "\tif (type)\n",
      "\t\terr = cpumask_parse_user(buffer, count, new_value);\n",
      "\tif (err)\n",
      "\t\tgoto bad_unshare_cleanup_fd;\n",
      "\terr = unshare_nsproxy_namespaces(unshare_flags, current, user_ns,\n",
      "\t\t\t\t\t new_fs ? new_fs : current->fs);\n",
      "\tif (IS_ERR(*new_nsp)) {\n",
      "\t\terr = PTR_ERR(event);\n",
      "\t\tgoto err;\n",
      "\t}\n",
      "\n",
      "\t/* Mark owner so we could distinguish it from user events. */\n",
      "\tevent->owner = EVENT_OWNER_KERNEL;\n",
      "\n",
      "\taccount_event(event);\n",
      "\n",
      "\tif (event->pending_disable = 1;\n",
      "\t\tirq_work_queue(&event->pending);\n",
      "\t}\n",
      "\n",
      "\tif (event->overflow_handler\t= ov\n"
     ]
    }
   ],
   "source": [
    "print(generate_text(lm, 20, nletters=5000))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Order 10 is pretty much junk. In order 15 things sort-of make sense, but we jump abruptly between the \n",
    "and by order 20 we are doing quite nicely -- but are far from keeping good indentation and brackets. \n",
    "\n",
    "How could we? we do not have the memory, and these things are not modeled at all. While we could quite easily enrich our model to support also keeping track of brackets and indentation (by adding information such as \"have I seen ( but not )\" to the conditioning history), this requires extra work, non-trivial human reasoning, and will make the model significantly more complex. \n",
    "\n",
    "The LSTM, on the other hand, seemed to have just learn it on its own. And that's impressive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}