organisciak/Lesson Draft.ipynb

## Lesson Draft.ipynb
{
 "nbformat_minor": 0,
 "nbformat": 4,
 "cells": [
  {
   "source": [
    "## Data Analysis in Python through the HTRC Feature Reader\n",
    "Summary: *Using the 4.8 million volume Extracted Features Dataset from the HathiTrust Research Center, we introduce you to the popular SciPy stack of data science tools in Python, particularly the Pandas library.*"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "The HathiTrust holds nearly 15 million digitized volumes from libraries around the world. In addition to their individual value, these works in aggregate are extremely valuable for historians. Spanning centuries and genres, a scholar can make inferences about cultural, linguistic, historic, and structural trends in the published word. To simplify access to this collection, the HathiTrust Research Center (HTRC) has released the Extracted Features dataset (Capitanu et al. 2015).\n",
    "\n",
    "In this lesson, I introduce a library, the HTRC Feature Reader, for working with the Extracted Features dataset using the Python programming language. The HTRC Feature Reader is structured to support work using popular data science libraries, particularly Pandas. In teaching analysis of HathiTrust materials through the HTRC Feature Reader, this tutorial teaches skills that will benefit general data analysis in Python.\n",
    "\n",
    "\n",
    "Today, you'll learn:\n",
    "\n",
    "- How to work with \"notebooks\", a useful, interactive way of data science in Python;\n",
    "- Methods to read and visualize text data for millions of books with the HTRC Feature Reader; and\n",
    "- Data malleability: selecting, slicing, and summarizing Extracted Feature data using the flexible \"DataFrame\" structure."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## The Parts\n",
    "\n",
    "The **HathiTrust Research Center** (**HTRC**) is the research arm of the HathiTrust, tasked with supporting research usage of the works held by the HathiTrust. Particularly, this support involves mediating large-scale access to material in a non-consumptive manner, which aims to allow research over a work without enabling that work to be traditionally enjoyed or read by a human reader.  Huge digital collections can be of public benefit by allowing scholars to make insights about history and culture, and the non-consumptive model allows for these uses to be sought within the restrictions of intellectual property law.\n",
    "\n",
    "As part of their mission, the HTRC has released the **Extracted Features** dataset, of features derived for every page of 4.8 million 'volumes' (a generalized term referring to the different types of materials in the HathiTrust collection, of which books are the most prevalent type).\n",
    "\n",
    "What is a feature? A **feature** is a quantifiable marker of something measurable, a datum. A computer cannot understand the meaning of a sentence implicitly, but it can understand the counts of various words and word forms, or the presence or absence of stylistic markers, from which it can be trained to better understand text. In other words, in text analytics a feature is text converted to the language of the algorithm. A dataset of features can be non-consumptive, assuming there isn't enough information to reconstruct the book text from the features, but is also likely the exact same information a researcher would traditionally have to extract themselves.\n",
    "\n",
    "Not all features are useful, and not all algorithms use the same features. The HTRC Extracted Features Dataset includes information such as counts of word occurrences tagged by part of speech, line and sentence counts, and counts of characters at the leftmost and rightmost sides of a page. No positional information is provided, so the data does not specify if 'brown' was followed by 'dog', though the information is shared for every single page, so you can at least infer how often 'brown' and 'dog' occurred in the same general vicinity within a text.\n",
    "\n",
    "With easy access and features already extracted, the Extracted Features dataset offers a great entry point to programmatic text analysis and text mining. To further simplify beginner usage, the HTRC has released the HTRC Feature Reader. The **HTRC Feature Reader** scaffolds use of the dataset with the Python programming language.\n",
    "\n",
    "This tutorial teaches the fundamentals of using the Extracted Features dataset with the HTRC Feature Reader. The Feature Reader is designed to make use of data structures from the most popular scientific tools in Python, so the skills taught here will apply to other settings of data analysis. In this way, the Extracted Features dataset is a particularly good use case, one useful to historians, for more general skills."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Possibilities\n",
    "\n",
    "Historians have access to a huge, century-spanning dataset in the HTRC Extracted Features dataset. It provides extracted textual information for each page in 4 million scanned books. This dataset is useful for the access it provides to the published word, but in having data pre-extracted, it also reduces the number of steps a scholar needs to take in text analysis, and makes research easier for others to reproduce.\n",
    "\n",
    "Though it is relatively new, the Extracted Features dataset is already seeing use by scholars, as seen on a [page collected by the HTRC](https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+in+the+Wild). [Mimno (2014)](http://mimno.infosci.cornell.edu/wordsim/nearest.html) has processed word co-occurrence tables per year, allowing others to view how correlations between topics change over time. [Underwood (2015)](https://sharc.hathitrust.org/genre) has identified genre (fiction, poetry, drama) for 178k books and released genre-specific word counts, against which [Forster (2015)](http://cforster.com/2015/09/gender-in-hathitrust-dataset/) was able to study author gender by genre.\n",
    "\n",
    "## Suggested Prior Skills\n",
    "\n",
    "This lesson aims to provide a gentle but technical introduction to text analysis in Python with the HTRC Feature Reader. Most of the code is provided, but is most useful if you are comfortable tinkering with it and seeing how outputs change when you do.\n",
    "\n",
    "We recommend a baseline knowledge of Python conventions, which can be learned with Turkel and Crymble's [series of Python lessions](http://programminghistorian.org/lessons/introduction-and-installation) on Programming Historian.\n",
    "\n",
    "The skills taught here are focused on flexibly accessing and working with already-computed text features. For a better understanding of the process of deriving word features, Programming Historian provides a lesson on [Counting Frequencies](http://programminghistorian.org/lessons/counting-frequencies), by Turkel and Crymble."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "# Download the Lesson Files\n",
    "\n",
    "To follow along, download [lesson_files.zip](https://github.com/htrc/HTRC-Programming-Historian/releases/download/v.0.1/lesson_files.zip) and unzip it to any directory you choose.\n",
    "\n",
    "The lesson files include a sample of files from the HTRC Extracted Features dataset. After you learn to use the EF data in this lesson, you may want to work with the entirety of the dataset. The details on how to do this are described in [Appendix: rsync](#Appendix: rsync)."
   ],
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   }
  },
  {
   "source": [
    "## Installation\n",
    "\n",
    "For this lesson, you need to install the HTRC Feature Reader library for Python alongside the data science libraries that it depends on. \n",
    "\n",
    "For ease, this lesson will focus on installing Python through a scientific distribution called Anaconda. Anaconda is an easy-to-install Python distribution that already includes most of the dependencies for the HTRC Feature Reader.\n",
    "\n",
    "To install Anaconda, download the installer for your system from the [Anaconda download page](https://www.continuum.io/downloads) and follow their instructions for installation of either the Windows 64-bit Graphical Installer or the Mac OS X 64-bit Graphical Installer. You can choose either version of Python for this lesson. If you have followed earlier lessons on Python at the *Programming Historian*, you are using Python 2, but the HTRC Feature Reader also supports Python 3.\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/conda-install.PNG\" width=\"400px\" alt=\"Conda Install\" />"
   ],
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   }
  },
  {
   "source": [
    "### Installing the HTRC Feature Reader\n",
    "\n",
    "The HTRC Feature Reader can be installed by command line. First open a terminal application:\n",
    "\n",
    "- *Windows*: Open 'Command Prompt' from the Start Menu and type: `activate`.\n",
    "- *Mac OX/Linux*: Open 'Terminal' from Applications and type `source activate`.\n",
    "\n",
    "If Anaconda was properly installed, you should see something similar to this:\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/activating-env.PNG\" width=\"450px\" />\n",
    "\n",
    "Now, you need to type one command:\n",
    "\n",
    "```bash\n",
    "conda install -c organisciak htrc-feature-reader\n",
    "```\n",
    "\n",
    "This command installs the HTRC Feature Reader and its necessary dependencies. That's it! At this point you have everything necessary to start reading HTRC Feature Reader files.\n",
    "\n",
    "> *psst*, advanced users: You can install the HTRC Feature Reader *without* Anaconda with `pip install htrc-feature-reader`, though for this lesson you'll need to install two additional libraries `pip install matplotlib jupyter`. Also, note that not all manual installations are alike because of hard-to-configure system optimizations: this is why we recommend Anaconda. If you think your code is going slow, you should check that Numpy has access to [BLAS and LAPACK libraries](http://stackoverflow.com/a/19350234/233577) and install [Pandas recommended packages](http://pandas.pydata.org/pandas-docs/version/0.15.2/install.html#recommended-dependencies). The rest is up to you, advanced user!"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Start a Notebook\n",
    "\n",
    "Using Python the traditional way -- writing a script to a file and running it -- can become clunky for text analysis, where the ability to look at and interact with data is invaluable.\n",
    "This lesson uses an alternative approach: Jupyter notebooks.\n",
    "\n",
    "Jupyter gives you an interactive version of Python (called IPython) that you can access in a \"notebook\" format in your web browser. This format has many benefits. The interactivity means that you don't need to re-run an entire script each time: you can run or re-run blocks of code as you go along, without losing your enviroment (i.e. the variables and code that are already loaded). The notebook format makes it easier to examine bits of information as you go along, and allows for text blocks to intersperse a narrative.\n",
    "\n",
    "Jupyter was installed alongside Anaconda in the previous section, so it should be available to load now.\n",
    "\n",
    "<img alt=\"Jupyter Code Blocks\" src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/notebook1.PNG\" width=\"350px\" />\n",
    "\n",
    "From the Start Menu (Windows) or Applications directory (Mac OS), open \"Jupyter notebook\". This will start Jupyter on your computer and open a browser window. Keep the console window in the background, the browser is where the magic happens.\n",
    "\n",
    "<img alt=\"Jupyter Code Blocks\" src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/open-notebook.PNG\" width=\"250px\" />\n",
    "\n",
    "If your web browser does not open automatically, Jupyter can be accessed by going to the address \"localhost:8888\" - or a different port number, which is noted in the console (\"The Jupyter Notebook is running at...\"):\n",
    "\n",
    "<img width=\"500px\" src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/notebook-start.png\" />\n",
    "\n",
    "Jupyter is now showing a directory structure from your home folder. Navigate to the lesson folder where you unzipped [lesson_files.zip](https://github.com/htrc/HTRC-Programming-Historian/releases/download/v.0.1/lesson_files.zip).\n",
    "\n",
    "In the lesson folder, open `Start Here.pynb`: your first notebook!\n",
    "\n",
    "<img width=\"500px\" src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/notebook-hello-world.png\" />\n",
    "\n",
    "Here there are instructions for editing a cell of text or code, and running it. Try editing and running a cell, and notice that it only affects itself. Here are a few tips for using the notebook as the lesson continues:\n",
    "\n",
    "- New cells are created with the <i class=\"fa-plus fa\"> Plus</i> button in the toolbar. When not editing, this can be done by pressing 'b' on your keyboard.\n",
    "- New cells are \"code\" cells by default, but can be changed to \"Markdown\" (a type of text input) in a dropdown menu on the toolbar. In edit mode, you can paste in code from this lesson or type it yourself.\n",
    "- Switching a cell to edit mode is done by pressing Enter.\n",
    "- Running a cell is done by clicking <i class=\"fa-step-forward fa\"> Play</i> in the toolbar, or with `Ctrl+Enter` (`Cmd+Return` on Mac OS). To run a cell and immediately move forward, use `Shift+Enter` instead.\n",
    "\n",
    "> An example of a full-fledged notebook is included with the lesson files in `example/Lesson Draft.ipynb`.\n",
    "\n",
    "Before continuing, click on the title to change it to something more descriptive than \"Start Here\"."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Reading your First Volume\n",
    "The HTRC Feature Reader library has three main objects: **FeatureReader**, **Volume**, and **Page**.\n",
    "\n",
    "The **FeatureReader** object is the interface for loading the dataset files and making sense of them. The files are originally in the JSON format and compressed, which FeatureReader decompresses and parses. It creates an iterator over files, allowing one-by-one access to the files as Volumes. A **Volume** is a representation of a single book or other work. This is where you access features about a work. Many features for a volume are collected from individual pages, to access Page information, you can use the **Page** object.\n",
    "\n",
    "Let's load two volumes to understand how the FeatureReader works."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 1,
   "cell_type": "code",
   "source": [
    "from htrc_features import FeatureReader\n",
    "# Remember to use '/' for windows, else '\\'\n",
    "paths = ['data/sample-file1.basic.json.bz2', 'data/sample-file2.basic.json.bz2']\n",
    "fr = FeatureReader(paths)\n",
    "for vol in fr.volumes():\n",
    "    print(vol.title)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "June / by Edith Barnard Delano ; with illustrations.\n",
      "You never know your luck; being the story of a matrimonial deserter, by Gilbert Parker ... illustrated by W.L. Jacobs.\n"
     ]
    }
   ],
   "metadata": {
    "scrolled": true,
    "collapsed": false
   }
  },
  {
   "source": [
    "Here, the FeatureReader is imported and initialized with file paths pointing to two Extracted Features files. We wrote out the file paths directly here, though when preparing your code for multiple systems, there are better ways to [deal with paths](https://docs.python.org/2/library/os.path.html#os.path.join) in Python. An initialized FeatureReader can be iterated through with a `for` loop. The code in the loop is run for every single volume: the Volume() object for the first volume is assigned to the variable `vol`, the loop code is run, then the next volume is set to `vol`, and so on.\n",
    "\n",
    "You may recognize `for` loops from past experience iterating through what is known as a `list` in Python. However, it is important to note that `fr.volumes()` is *not* a list. If you try to access it directly, it won't print all the volumes; rather, it identifies itself as a different data structure known as a generator:\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/generator.png\" width=\"500px\">"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "What is a generator, and why do we iterate over it?\n",
    "\n",
    "Generators are the key to working with lots of data. They allow you to iterate over a set of items that don't exist yet, preparing them only when it is their turn to be acted upon. \n",
    "\n",
    "Remember that there are 4.8 million volumes in the Extracted Features dataset. When coding at that scale, you need to be be mindful of two rules:\n",
    "\n",
    "1. Don't hold everything in memory: you can't. Use it, reduce it, and move on.\n",
    "2. Don't devote cycles to processing something before you need it.\n",
    "\n",
    "A generator simplifies such on-demand, short term usage. Think of it like a pizza shop making pizzas when a customer orders, versus one that prepares them beforehand. The traditional approach to iterating through data is akin to making *all* the pizzas for the day before opening. Doing so would make the buying process quicker, but also adds a huge upfront time cost, needs larger ovens, and necessitates the space to hold all the pizzas at once. An alternate approach is to make pizzas on-demand when customers buy them, allowing the pizza place to work with smaller capacities and without having pizzas laying around the shop. This is the type of approach that a generator allows.\n",
    "\n",
    "Volumes need to be prepared before you do anything with them, being read, decompressed and parsed. This 'initialization' of a volume is done when you ask for the volume, *not* when you create the FeatureReader. In the above code, after you run `fr = FeatureReader(paths)`, there are are still no `Volume` objects held behind the scenes: just the pointers to the file locations. The files are only read when their time comes in the loop on the generator `fr.volumes()`.\n",
    "\n",
    "If you tried to read hundreds of volumes when the FeatureReader is initialized (as you can try, against our advice, with `list(fr.volumes())`, that command would take a very long time, and if you had a bug in later actions, the waiting would be for naught. With enough volumes, it is also likely that your system would run out of RAM, because all the volumes would be held in memory at the same time. Using the generator, Python assumes that you only need the volume within its turn in the loop: after moving on to the next volume in the iterator, the earlier one is thrown away. Because of this one-by-one reading, the items of a generator cannot be accessed out of order (e.g. you cannot ask for the third item of `fr.volumes()` without going through the first two first."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## What's in a volume?"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "Let's take a closer look at what features are accessible for volumes. For clarity, we'll grab the first volume to focus on, which can conveniently be accessed with the `first()` method. Any code you write can easily be run later with a `for vol in fr.volumes()` loop."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 2,
   "cell_type": "code",
   "source": [
    "# Reading a single volume\n",
    "vol = fr.first()\n",
    "vol"
   ],
   "outputs": [
    {
     "execution_count": 2,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "<htrc_features.feature_reader.Volume at 0x1cf355a60f0>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "While the majority of the HTRC Extracted Features dataset is *features*, quantitative abstractions of a book's written content, there is also a small amount of metadata included for each volume. We already saw `Volume.title` accessed earlier. Other metadata includes:\n",
    "\n",
    "- `Volume.id`: A unique identifier for the volume in the HathiTrust and the HathiTrust Research Center.\n",
    "- `Volume.year`: The publishing date of the volume.\n",
    "- `Volume.language`: The classified language of the volume.\n",
    "- `Volume.oclc`: The OCLC control number(s).\n",
    "\n",
    "The volume id can be used to pull more information from other sources. The scanned copy of the book can be found from the HathiTrust Digital Library, when available, by accessing `http://hdl.handle.net/2027/{VOLUME ID}`. For this volume, that would be: http://hdl.handle.net/2027/nyp.33433074811310."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 3,
   "cell_type": "code",
   "source": [
    "print(\"http://hdl.handle.net/2027/%s\" % vol.id)"
   ],
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "http://hdl.handle.net/2027/nyp.33433074811310\n"
     ]
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "<img alt=\"Digital copy of sample book\" src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/June-cover.png\" width=\"250px\" />"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "Since the focus of EF is features, more in-depth metadata like genre and subject class needs to be grabbed from other sources. For example, the [HathiTrust Bibliographic API](https://www.hathitrust.org/bib_api) returns information about a book specified by its id; for our current example, that is http://catalog.hathitrust.org/api/volumes/full/htid/nyp.33433074811310.json. Another additional data source for metadata is the [HTRC Solr Proxy](https://wiki.htrc.illinois.edu/display/COM/Solr+Proxy+API+User+Guide), which allows searches for many books at a time.\n",
    "\n",
    "Data from the Solr Proxy is accessible for Public Domain volumes with `Volume.metadata`:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 4,
   "cell_type": "code",
   "source": [
    "extra_meta = vol.metadata\n",
    "# Example field: call number\n",
    "extra_meta['callnosort']"
   ],
   "outputs": [
    {
     "execution_count": 4,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['PZ3.D3726 J']"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "Calling detailed metadata is useful for small data settings, but using it pings the HTRC servers and adds overhead, so an efficient large-scale algorithm should avoid `vol.metadata`."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Our First Feature Access: Visualizing Words Per Page\n",
    "\n",
    "It's time to access the first features of `vol`: a table of total words for every single page. These can be accessed simply by calling `vol.tokens_per_page()`.\n",
    "\n",
    "> If you are using a Jupyter notebook, returning this table at the end of a cell formats it nicely in the browser. Below, you'll see us append `.head()` to the `tokens` table, which allows us to inspect just the first 5 rows. Jupyter automatically guessed that you want to display the information from the last code line of the cell."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 5,
   "cell_type": "code",
   "source": [
    "tokens = vol.tokens_per_page()\n",
    "# Show just the first few rows, so we can look at what it looks like\n",
    "tokens.head()"
   ],
   "outputs": [
    {
     "execution_count": 5,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "      count\n",
       "page       \n",
       "1         5\n",
       "2         0\n",
       "3         1\n",
       "4         0\n",
       "5         1"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "scrolled": false,
    "collapsed": false
   }
  },
  {
   "source": [
    "This is a straightforward table of information, similar to what you would see in Excel or Google Spreadsheets. Listed in the table are page numbers and the count of words on each page. With only two dimensions, it is trivial to plot the number of words per page. The table structure holding the data has a `plot` method for data graphics. Without extra arguments, `tokens.plot()` will assume that you want a line chart with the page on the x-axis and word count on the y-axis."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 6,
   "cell_type": "code",
   "source": [
    "%matplotlib inline\n",
    "tokens.plot()"
   ],
   "outputs": [
    {
     "execution_count": 6,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x1cf3636dc18>"
      ]
     },
     "metadata": {}
    },
    {
     "output_type": "display_data",
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAEPCAYAAABShj9RAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztvXl4HNWZNX5eLba1W14kL7Jl2WCDDMQQbMKSoCTELIHA\nDAxb2CYkYSYk8A1fZoDJJOCEDwaehAwk4ZeFhMAkfOCQGbYEMOBPJIYQwAbbYA8Y27ItL5K8aLNk\na7u/P96+VHV1dXd1163uW9X3PI8etUpdVbdu3Xvq3PO+9xYJIWBgYGBgEH4U5bsABgYGBgZqYAjd\nwMDAICIwhG5gYGAQERhCNzAwMIgIDKEbGBgYRASG0A0MDAwigrSETkTjieivRPQ2Ea0notti22uJ\naAURvU9ELxBRjW2fW4loExFtJKKlQV6AgYGBgQGDvOShE1G5EGKAiIoBvArgBgAXAtgnhLiHiG4G\nUCuEuIWImgH8FsBiAA0AXgJwpDAJ7wYGBgaBwpPlIoQYiH0cD6AEgABwPoCHY9sfBnBB7PMXADwm\nhBgRQrQB2ARgiaoCGxgYGBi4wxOhE1EREb0NYA+AF4UQbwKoF0J0AIAQYg+AutjXZwLYYdt9Z2yb\ngYGBgUGA8KrQx4QQx4MtlCVEtBCs0uO+prpwBgYGBgbeUZLJl4UQvUTUCuAsAB1EVC+E6CCiaQA6\nY1/bCWCWbbeG2LY4EJF5ABgYGBhkASEEuW33kuUyRWawEFEZgM8B2AjgaQDXxL52NYCnYp+fBnAp\nEY0joiYARwB4I0mhCubntttuy3sZzHWbazbXHP7rTgUvCn06gIeJqAj8AHhcCPFHInodwHIi+hKA\nbQAujpH0BiJaDmADgGEAXxPpSmFgYGBg4BtpCV0IsR7ACS7b9wM4I8k+dwG4y3fpDAwMDAw8w8wU\nzRFaWlryXYS8oBCv21xz4UC36/Y0sSiQExMZJ8bAwMAgQxARRJKgaEZZLgYGBgbZYM6cOdi2bVu+\nixEqNDY2oq2tLaN9jEI3MDAIHDFVme9ihArJ6iyVQjceuoGBgUFEYAjdwMDAICIwhG5gYGAQERhC\nNzAwMIgIDKEbGBgY5BhNTU1YuXKl8uMaQjcwMDCICAyhGxgYFDza29tx4YUXoq6uDlOnTsUNN9wA\nIQTuuOMOzJkzB9OmTcM111yDvr4+AMArr7yCWbNmxR3DrrqXLVuGSy65BFdffTWqq6tx7LHHYs2a\nNQCAq666Ctu3b8d5552H6upqfP/731d2HYbQDQwMChpjY2M499xz0dTUhG3btmHnzp249NJL8etf\n/xqPPPIIXnnlFWzZsgV9fX24/vrrP9qPyDUV/CM888wzuPzyy9HT04Pzzjvvo30feeQRzJ49G88+\n+yx6e3vxzW9+U9m1GEI3MDDQAkT+f7LBG2+8gd27d+Oee+5BWVkZxo0bh1NOOQW//e1vcdNNN6Gx\nsRHl5eW466678Pjjj2NsbMzTcU877TSceeaZICJceeWVWLduXdz/g5hoZab+GxgYaIF8TSTdsWMH\nGhsbUVQUr2937dqFxsbGj/5ubGzE8PAwOjo6PB132rRpH30uLy/HoUOHMDY2lnAelTAK3cDAoKAx\na9YsbN++PUF5z5gxI279mW3btqG0tBT19fWoqKjAwMDAR/8bHR1FV1eX53Oms2uyhSF0AwODgsaS\nJUswffp03HLLLRgYGMDhw4fx2muv4bLLLsMPf/hDtLW1ob+/H9/61rdw6aWXoqioCPPnz8ehQ4fw\n3HPPYWRkBHfccQeGhoZSnsdusUybNg1btmxRfi2G0A0MDAoaRUVFeOaZZ7Bp0ybMnj0bs2bNwvLl\ny3HttdfiiiuuwKc+9SnMmzcP5eXluP/++wEA1dXVeOCBB3DttdeioaEBVVVVaGhoSHkeuyq/5ZZb\n8L3vfQ+TJk3Cvffeq+xazGqLBgYGgcOstpg5zGqLBgYGBgUMQ+gGBgYGEYEhdAMDA4OIwBC6gYGB\nQURgCN3AwMAgIjCEbmBgYBARmKn/BgYGgaOxsTGw2ZFRhX3ZAa8weegGBgYGIYLJQzcwMDAoABhC\nNzAwMIgI0hI6ETUQ0Uoieo+I1hPRN2LbbyOidiJaE/s5y7bPrUS0iYg2EtHSIC/AwMDAwIDhRaGP\nALhJCLEQwMkAvk5ER8X+d68Q4oTYz/MAQERHA7gYwNEAzgbwAJloiEEOcegQcOed+S6FgUHukZbQ\nhRB7hBDvxD73A9gIYGbs325EfT6Ax4QQI0KINgCbACxRU1wDg/To6gIULmBnYBAaZOShE9EcAIsA\n/DW26etE9A4RPUhENbFtMwHssO22E9YDwMAgcIyNAQcP5rsUBga5h2dCJ6JKAE8AuDGm1B8AMFcI\nsQjAHgA/CKaIBgaZYWyMbRePr340MIgMPE0sIqISMJn/pxDiKQAQQtjft/QLAM/EPu8EMMv2v4bY\ntgTcfvvtH31uaWlBS0uLx2IbGCSHnN4wMABUVua3LAYGftHa2orW1lZP3/U0sYiIHgGwVwhxk23b\nNCHEntjnfwKwWAhxORE1A/gtgJPAVsuLAI50ziIyE4sMgsKHHwJHHgl0dAB1dfkujYGBWqSaWJRW\noRPRqQC+CGA9Eb0NQAD4VwCXE9EiAGMA2gBcBwBCiA1EtBzABgDDAL5mmNsgl5BWi/HRDQoNBTP1\n/+MfB/7yF2DcuJyd0iBP+J//AY4+Gnj3XWDhwnyXxsBALQp+6v/oKLBmDdDXl++SGOQCRqEbFCoK\ngtAPHeLfpoMXBiShDwzk/twrVwKPPJL78xqED//8z0BPj9pjFgShDw7yb0PohYF8KvTXXwf+9V+B\nkZHcnztojI4CO3ak/55BeoyOAvfdxwF8lSgIQpdKzRB6YUCGZjK936+9BrS1+Tv34CCwcyfw7LP+\njqMjli8Hrroq36WIBtragOFhntWsEgVB6EahFxaytVzuuw944QV/5x4YABYtAn76U3/H0RHPPw/0\n9ua7FNHABx/wb0PoWUASen9/fsthkBtka7ns3Wu1lWwxOAh88YucUbV3r79j6QQhgBUrTB9SBUPo\nPmAUemEhW4WugtAHBoDJk4EzzwSefNLfsXTCu+9y/zGErgbvvw/U1xtCzwqG0AsLfhS6zIjKFoOD\nQFkZcNFF7DmraHO/+AXw//6fv2MIYcUWssHKlcA55xhCV4UPPgBOO80QelaIGqHv2wfs2ZPvUuiL\nbBS6EOosl/JyJr+tW4GaGuCtt/wdc8UK4E9/8neMH/6Qs2+yRVcXcNRRTOhm3rd/vP8+cOqphtCz\nQtSyXH7+c2DZsnyXQl9kk+XS3w8MDflX6AMDrNArK4FNm4DzzgPa2/0ds7MT2LbN3zF+8xt/5Rgc\n5GsaN85/HRU6Dh5k8fDxjxtCzwqqFPpNN3FKWr7R08NPeAN3ZKPQZQBThUIvK7P+njLFf6ft6AC2\nb89+/82bgbffBrq7sz+GvK7KyuxtFyEKY0nju+8GVq+O3/a//zeP2ABg925g+nRg2jRD6FlBFaH/\n4Q/WTcknenutKLlBIrLx0CWhZ6s+161jwhoYYMtFYsoU/9kunZ3+CP2JJ3hNmwMHEv83OOitLakg\n9OefBy6/PLt9w4SXXwbWrrX+FgL41a94jSGALdPJk4GpUw2hZ4XBQR4q+iX0nh49bJu+Ph4pmACV\nO/Kh0E8/nWdRuil0P4Q+NMQP8B07sveu33oLuOACd4X+u98B//AP6Y+hgtDb2tiGijp6euIfnjt2\ncN3LbZLQJ05kPjl8WN25C4bQp0yxyPhPfwJs79bwjN5efQgdUD9tOJcYGwtuenw2Cr2riwOY2RD6\nwYPcYXt71RP63r28pntFRaKaGx3ln3To7wdmz3Yn9Hfesfz5a69NXlYVhN7VVRhLB/T2Avv3W3+v\nW8e/5TZJ6ERqRnB2FAShDwxYhD46CnzjG8Bjj2V2jOFhbtS6EPrEieH20R9+GLj55mCOPTbGBJip\n5TJrVnaWy65d/Lu315/l8t57ieeXL+mYPTsxMHrnncAdd6Q/7sAAMGOGO6GvW8ckOzoKPPpocgtA\nFaF3dUU/qOpU6GvXAkVF8Qp9yhT+rNp2KQhCHxzkiuvvZyKfMIGHf5koRLkqWi5X8HvhBWD9+sTt\nfX3ACSfo6aPv3w889FD67+3cmf1yxsuXp15zZWwMqKrK3HKZNSs7hS4Jva/Pn0L/x39MzDfv7LQI\n3emjP/usN3I9eJAnsRw+zMJEQgiLbN5+O/V7WFUROmDVV1ThVOhr1wLHH29t27uXFTpgCD0r2C2X\n114DrriCo8yZLMQk17BIp/oOHQLOP1/N2usPPww891zi9r4+4MQT9ST0tWs55zkdDhzIPuPhxz8G\nXnwx+f+FYOIJSqFfd108ucrMJ6nQsyX0rq5EskxG6Pv2AW++6c1yOXiQRywTJ8ar9N27+fexx1oP\nkqAJnch/GqfOGBnh+nYS+qc/nWi5AIbQs4Kd0HfuBGbOBBYsSG5ZvPZaYgBKKvR0JLFiBfD008Bv\nf2ttE4KPmSk6O91vdm8vP/E3b878mEHg9det+urt9bbGsx9C37Urvwp95cr4h6lUnHv3AiUl/COR\nCaHv3Zuc0Bsb4wn95Ze9pwEmI/S1a4GPfYyPnStCP+ooJvS//CWaKYxSyEl7ZWiIM+M+8YnEoCgA\n1Nb6Syd1ouAIfdcu9hPnz3dXuGNj/DR1RuNTKfT/+3+t7z/xBPA3fwM88IBFcu+8A5x1Vubl7uhw\nJ/S+PmDePH0Wfzr3XMvf7e1N3UD/4z+sIGI2HVoIfiinSh8dG8tOoTc0eCP03t74ut+1i22Ljo54\ndQ4A1dV8zKEh92Pt388jjtFR/pyM0Ovr+fgSL74IHHFEdgr9zjt5P0nos2cDf/4zf1cer78f+MlP\nrGOoIvTjj+dg/mc/q0cKsGpIMSPV+J49fP/q6uIVuvTQa2rUvuSiYAh96tR4hZ6M0Lu6uPPZ80iB\n1Ar93/4NaG1lj/LZZ4Ef/YjP+c47/P9Vq5iEM01PclPoQnAZ5szRg9CFYJKQ9dPbyz9uRCME8J3v\ncD5utgq9u5ttkXQKvbKSFbrXVD9J6F4sFyeh79wJNDUx4doDogBbDJMnJ79XL7zAE1HkA85Jlh0d\nTOZVVfE2Xlsb0NycuULfsgX41rd4//feY7tl9mzrvPJ4bW28nLCEX0IXgols0SKOYw0OBqfQh4fj\nYwWZYmAg+wys3l62c6Ual3wzaZK7QjeEngF27WLykFkuvb1MkNOmMaG7WS7S3/NK6Js3cyfp6ODZ\nYXPn8g1ctMhS7atW8e99+5KXddu2eJIaHWUScBL6wYMc1J08mTu4n4arAv39XFY7oQPuMYT9+3l7\nZ2f2hL5rF6vedIReUpLZNPXubh65pVPocnkAp0I/6ih3hQ6ktl1WrWIvu7OT/06m0J2E3tvLBJ2u\nDkdHucwTJvDwXs5g3LCBf5qbmdABHmXI442NxQsQv4Te3c0Pu7lzgY0brXOoxsAAcPLJ/taj/8pX\neHmNbNDTwxZWby9fn3QEamvdPXRD6B4gFc//+T/82W65TJoElJayh75hQ6KSbG9nInASem8v3xSn\nL7tiBZPHnj28b1MTb29o4L+F4E47cWJqRf3gg8D991t/79/PDcJJ6H193LmLiuIbSb4g7RUnobvZ\nLpKE/RD6zp2c4bN/v0W+XV3Arbda9Ts2xvVTUeHdR+/uZmWV7gEgSdWp0P0Q+uioRXJuhD51qjuh\n19amt1xkGiURt0G5UNi777LYOfpoi9AbG70RejYBf3kdDQ38N1EwhH7DDWzlbNmS/TFee41H3Nmg\np4fvS2Ulf3YSulwEzhB6BvjRj4Bf/hL4/e9Z/QwOsqorLeXKBbgRz53LixbZ0d7Os/7cFPr06fxQ\nGBmxGvULLwCf/zx35p07reNLQpc+4aJFTDy33w689FJimQcH41VnRwcfw43Qq6v5s+pJCXa4TRN3\ngyRu+VsSulsjlXWRjNB/8hP+6e7mOIYbdu3i4OWsWRyvmDmT7+OPfmRNtJKEXl2dOoNgdJTLOzTE\nI52JE/l3KpKU1yfrXYh4he60XIDk96m7m4mnudlqb06y3LWLr7GqKp7svSp0abcA/P3Vq9nC+eMf\nmVSqqliENDTwdjuhS99/dJTrZfz47BV6VxcTelMT96Mjj8yM0OXiaakwMsIzX5cty37Npa4u3nfV\nquxm5vb2MklPmsQELi2XsjKguJjVuRBWOzGE7gHDw0ycBw+ycpbqoqLCIlwi4PvfZ//bPsxubwc+\n9SkmHDup9fbyvgcP8oPiy1/m7a++ClxyCZ9Hdj6AO8iOHcBf/8oR7qlTuVP/+c/us+UGB+ODRJ2d\nHPgcGYkvn1ToQHCEPjLCHc/LsTNV6MXFVg66s0M/+ij/rFzJCsmNOGQdz5nDE5NuvJHv2XHHWccT\nggn9jDN4/Z1keP55TmHt6eGORcTWRCqV7iT0/fu5bcmgZTKF3tXFxLhypbX9tdeAJUtYGa9dy+3T\nfs1DQ3yeadMSlbFXhe4k9N5eDmK/+io/SAAm9q1b+d64KfRDh7heiOIfLOvXJ09BHBy0Aq2AReh1\ndZytM25cZoT+v/4XizRZL08+mbj/unXc7447znuu+6ZN8UJq9Wpe1pYou6CtbEu1tcwfUqEDvO3D\nD7k9EPE2Q+geMDLClXf11ZZCl4QuCRcATjmFlZ49pbC9ndX7scdyg925k7MQ7Ap9925uCH191jsk\nOzrib96sWXysDRv4WJJ829vdG7Iz0NfZySThzFPt7Q2e0Ddt4uv1Mk1bPvTshF5U5E7oW7dyZ5Ox\nBXs9dHdzfa9bxxOHACtP2g45Cmpq4nNedx13Cqf/W1QE/N3fsYpPhp07+Rzd3Ux2ALeTdIReUWHV\n+29+wy8qqK5monNT6M3NnDP+yiuc3fGXv/D2zZtZ2c+cydc9Z048oe/ezW2guDjechkd5XZXXZ25\nQgeY0IWwCB1g29Beh0JYhG6fLCUVuhDANddYJOvESy/FrxEjCd3tXAcOsLhKhY0brXa5aBG/QOTd\ndzndUs7VWLWK78WMGd4J/d5746/hzTeBxYuBT34y/oHkFb29fF+cCh3gbZs2WXYLYAjdE4aHeUh5\n991cqb29iQpd4qST+CZKtLdbw899+zgT4N57udKlQt+3j8m3rY074fTpyS0XGXiSKi0ZoQ8OMrFI\nIpTBMCehp1PoMuskE5x9djx5yuG/lwkgbgp9xgz3RtrWxvUtg9H2eli5kpXRqacCjz/O1+3WKeVD\ns7mZCaWmhrcXFVlqVRK6TD996SX3Ounq4nvZ02OR3YQJqQOjPT1WyqiM09x1l3VP3BT63/wNz014\n9FH2/7/5Tb5P8mUYM2bww85J6LItAlYapsyEqaxkYkxH6AMDFqHX1vLvE05gkrETOpD4UJRrxbgR\n+ptvAmvWcPt2w4YN8baHndCd52prY9GUCps38/dWr+bruPRSLsODDwLPPMPfcRJ6KstEjrQ2b45P\nVpCEfvrpnLHmxOrVwJe+lPy4doW+f7+7QjeEniFGRthXrahg0tu3jzuOU6EDfPPsb5SRnUgOvaVy\nlmsYy8XpDxzgtMQ5c/imHD7MjUMeX5L8unUWoX/4IXewZIQOWCpdruGRKaE/8ADnFntFVxdbD86Z\nbYA3H7K7m4fPslH29PDoJJnlctJJ1lDWXg8rVgBLl/LPzJlAS4u7QpeWy403sm8u4bQLioo4ZnLz\nzWyPyXVjnnnG6uhdXRYxyweDF4Uu7ajHH+eHxjHHWHENN0JvaODg40MP8ZyFrVu5rUiilB1eEnpP\nD6tDO6EXF/N3BwYsFVhcnLnlMm4c199VV7G1aIeT0AG2N9wI/YEHgMsuY8Hjhvfe4+uQo4pXXuHR\nSLJzdXYmJ+D+fu4PbW2catzcbPXbVav4Xsjkg9NO475eVpY8YWD7du6f+/dzn7T3offe41HklVey\nLeV829Qbb6R+HaBdoTstl0mT+ME+f771fUPoHjA8bM3WmzaNf5eVcUW7Efqbb3IHuu8+7kQzZ8YT\nuhD8ZJ4+nTuUfKK3tnLnJmJF395u3bzSUibcrVs5ADRlipWXnsxyGT8+PnAoCX3DBrYPLrqIG7ad\n0O1k//LLnOediUJ/9dXEMsm1J7wS+uzZ8QrdbWU/IbhDLlmSSBoAd6yFC4EvfpGJ2m3YLATbQDNm\ncJ0X2VqvkyCkR/nP/8zv5Ny0icv2hS9Ylk9XFxNOV1e85ZJKoff28r0uKeFRxemn83ZJ6G6WC8D3\n77jjuDPLjCtJlLJNSkJ/+WVe18VO6IBlu0jSsF9zMhw8aJVp0iQWOsXFPLnITixudQiwUHES+vbt\nrF7vuYfv2/Awt9+vfc0iuw0b+B7s3MkP67a2+LXQnedKtfDdli1c51u38uhuwQLut888w2XZu5fr\npK+P6xBIbbv8/Ofcj1evtvaX6Oy0Yha33QZ8+9vx+37wAe8zNMR16ExFtiv07dutYDvA2zZsAG65\nxfq+jEl0dvJIzy8iR+hCsEIvLeW/p0/nhjVuHC8y/7nPxX9/3jy+CVddxdFxac3Iobfs3H198ZZL\nRQU3XtmA5OQPSbYAd8YjjmCinjLFWuA+mUKfPz8+tU966D/4AV/T5s2cVeOm0Ddv5g7z6KOZTXmX\nOfJOQj/nnOSE3t1t1cuBAxzUcxK6U3Xs2WPlIQNcz/Zzjo0x0dTXs0UxY0aiQn/3Xb6Ps2YllsnN\nQ5eYN4+JxzkvQD4Mt2yxFLo9KOo2QpBkOmUKE9Xixbw9leUCANdfbwVoy8v5HjkVelOT1bk3buT2\nYid0GRi1E7qbQh8asspuV+iSBJPBK6EfPMgPyoYG/vnwQ+CrX2WyvfJKzjTZuJGtnfZ2Jqo77rD6\nZLJzyVx8icFBFhYbNrAQGBlhhTx/Pvvoe/ZY9teePUzE8kE+Y4Z7+x0aYs+8pYXrQs73APi+Dw1Z\nD+cLL+RlLez44AMu79atwL//e/wKmLt3W/emsZH7ohQfAAu7m2+2OAPgNl9RweLg0UeT3xuvSEvo\nRNRARCuJ6D0iWk9EN8S21xLRCiJ6n4heIKIa2z63EtEmItpIREv9F9M7Rke5kmQlTp/OjZGIycS+\nzgbADevEE7mSX3mFJxUA8QpdYto07oh793JD27rVyjufNi3Rn29oYNUJMAHYPV4nDh3iYblU6HbL\nZdcuVgpnncVldEtb/P73WdUtXZrZWuOrVsXnBMv1RE45JTmh33abFUjq7nYndKdCX72a62zCBC7/\n5MmJhG4n4enTExXW737HoxR5b+1IReizZ3N9rl/PbcNJ6Js3Jyr0Q4f4YSy9Vgk7oR88yKob4Osq\nKUlO6PaU2fJySyy4KfTOTr6GZ59NrdDtNpOEEJy5M3cux37shF5UxNeUDF4sl+pqHkXdcAP/3dzM\ncYQ1a5gg77+fFWhtLbf97dvZtnAKKWfMA4hf2gBgS+Sdd4Cf/YyJu6mJs8bmz+c6XLiQH/579/K+\ncjQOcJ26KfS//IUFwSWXAP/1X9znZDvo6orPQJk8mevTbt28/z63p5dfjs/U2rGDy9XezuLgq1/l\nDLrLLrP2vfVW4LvfTSzTxIl8nSomCXpR6CMAbhJCLARwMoDriegoALcAeEkIsQDASgC3AgARNQO4\nGMDRAM4G8ACRWxcMBna7BbAIPRW++10mqGOP5UAqEE/ossPV1rLabm/nhwAQr9Cdds7s2fGELpFM\noTc3swLo72eFc9RR3EiXLGG188lPsmKSalCmQgKs5k47jRtjOttA4tAhJrmjjorvVDNnWkFdN/T0\nWOft7uY66Onhso2NcV04CV0Gm2RduRF6cbH1t9uQ+Ykn2Lpwg50gZNqiREkJd2I5Z8BO6I2NTOh2\nD31wkMs7MJCY+2wn9OOO4/YAcL1XVye3XOyQXrgkyqlT+Thz5jABd3Rw+9uzJ340IofnqSyXH/+Y\nSfTNN/nBayf0dPCi0EtKOLNH/t3czAvR3XsvX/sXvsDtqrmZ29Cf/sRWjwzIpjqXU6FL/721lR9E\nsq/JUd7ddwNf/7ql0OvrrX2TWS7vvcfq/mMfY8Fy0klWbrgzcEtkje4Abgvt7fxw+q//Sqyn/n5+\nsMl785WvAN/7Xvzx3FBTw6PinBC6EGKPEOKd2Od+ABsBNAA4H8DDsa89DOCC2OcvAHhMCDEihGgD\nsAnAEv9F9Ybh4fihnRdCP+UUq5FI2An9hBN4W3U1d479+/mN3YCl0OvrExX6t7/NL4cFLEKvrExO\n6GefzQrkoYd4+vKkSRzNf/JJ/s7JJ1u5wPKYkljff9/yRCVhOOG0EHbv5gZcURGvloqLuTMmU+gD\nA/ETiqRCl5OeamsTLZe33rIegnV1fG3pFLq9vNu2cb0vSdKS3IKidsybx4T+t3/LHXfPHv69YAFb\nLvYsl0OH3K0oIJ7Q5QNKoqoqfVsDEi2XoiJ+gE+ezMTe1gZ85jP83XQK3Wm5/P73bB1KIbF7t1pC\nd+K007hOl8bG4SUlrN4/8Qku+x//6B6k90rosk9Jhd7UxLYbwCPWxka+Zx98EK/QnZbLxRez5SWz\nzo49lvvS0Ufz/nJZEDuhy/PKVU03b+YH7FFHWTNJnZlVgCUOvCKnhG4HEc0BsAjA6wDqhRAdAJM+\ngLrY12YCsGcw74xtywlGRhIVuhfV5ISd0I85hu2M0lI+1vjx3Ciqqizl8elPs/qzY/Jk6+aWl/PP\n7NnuvuehQ9wg/+7v2J+86CJrv+nT+XNtLZfFGRTt62NilZ1fDunt+N3vuDHaiVY2YDe7oraWFYlb\noGpw0DqO9NDlK9iqq/ma7QpdiHiFXleX3nJxKqzOTi5/MpWTynIBWOEdOMCd8eSTmdxLS635Ak7L\nRRK6817Jazz77MTRQnW1N0KX57ATpRQGlZX8gDnnHL5Wee+B9EHRoSF+cMoHf0MDP+j9ELrTcnHi\n3HMTc/1vvplHvXKmsx9CX7CARdEJJ7BCdwZyAe4H774br9Dtq1OuWMHt/9lnmdAXLuR6njePf6Qw\nkokIdtgJ/YMPuDzz5iXaRaOj3Mauuy5R2KVDTQ0/eFUQekn6rzCIqBLAEwBuFEL0E5EzySjjibK3\n217s2dLp9Ux+AAAgAElEQVTSgpaWlkwPkQCnQm9oiA9UeoVMXysu5s5z5528XaZCHnssK2lJMJ/9\nbPpj1tdzx02m0CdM4AfHr3/NL/V1w3/8h6W+KiuZeP/wBw64SBJzWi7t7ZyBMHGi5fEB1voa9qVs\nJRkSWSqntJQn+8jUP6dCl+uI79/PdTVxYvyDY/t2rkdpSX3zmzyM/fWvre84Sbimhh/OMufaPvnH\nDekIfd48/j1/Pj9YnnuOr92+pgbA90C+CKWszOq4W7Zw0EqS6YUXJpahqsqbeHAqdDsqKzmOcvrp\nnOJob8vpgqJr1nA7kDEWSeif/GT6MgHZKXQg+UNWCoxsCV1OopOTji68kC0SJ6ZMYSvF3gdraqz4\nx3e+w233D3/gB4zMv7//fn74SUJPptBlJtimTVy/si1NnRp/DcXF2S0KJttesrhXa2srWj0uLuOJ\n0ImoBEzm/ymEeCq2uYOI6oUQHUQ0DYC8HTsB2PMQGmLbEmAndFVwEvrixZZlkQmkQi8ujn9qV1Tw\nkK+01L1Tp8Kf/sQeZ7KgaFkZBw7b2+M9dzvkUBzgjrR0Kaf5LVhgbZeEIfHGG9xwBwf52PKBIBuw\nXBkOiCfDmTP5+/v2cWewE7rMGe7uZvukqoq/Kwm9u5s7wNy5HBA98USr459yCiu/VApdqtPdu7kT\nHTiQGaE7SWbePL7W2lpuE/ffz+QuCd2u0Nes4fq3v5Jt40b2iefOtQjTCa8K3RkUtaOykkl02jQe\njdlh99BnzEgMiso8bImGBg6iB2m5pEKmhO4MitrnXAA8upULidkxZQpPHrMr9Opqi9Dffx946ile\nJ6ioyFLQZ59t7Z+K0B95hD/v38/fnTeP+1BtbWoR4RWS0JMpdKfYXbZsWdJjeS3CrwBsEELYVkjG\n0wCuiX2+GsBTtu2XEtE4ImoCcASANzyexzecloscemYKu+Vib8wVFfEzvTJBQ0Py3GF7p3EO+1Jh\n6VJWk/ahqFOhy6GiDHSuWcPB0FSWiyyvnB5vVw9SoctZi5LEd+ywLJe9e7lMzz3HEfxFi+LL7awH\ntw5ht126uxMDa87jufmZEiedxKmDAD9curvdFXpZGZPgSSfFe9R9ffxQWb8+OaFfcEHidbrBGRS1\no7KSyz5pUuJ+6SwXN0IfHs6M0J1Wgh9CnzKFg5bO+JQ8l1PdulkuXkbXU6bwddoJXSp0Ifj3pEl8\nT5ubEx/2qQj9iCMsy0UGmCsq2OIpLU28hmxQU8Oj3JxYLkR0KoAvAlhPRG+DrZV/BXA3gOVE9CUA\n28CZLRBCbCCi5QA2ABgG8DUhslm3LDs4FXq2sCv0CROs7eXl3OmyhRuhyxvpTKn0gjPO4AbqRug/\n/rH17tGTT+braG/nB8DEiVwWSeiyI4+Oxiv0nTvZPnESulwDXRJQTY1F6OPG8SzKVat4dLB2Lecn\np6oH+3kl7IHRdJaLXa06s1wAVry33WZ9bmiIJ3R7UPS99zhD4bXXrGPKKflDQ8mDXtddl7x8dqSz\nXOQ9caKqilVssqBoW1t8O5BCRoWHbu8DXkEUP5s31bnq6/0ROhAfFJUKvb+fy15ayjNj3TK3UhH6\nzJk8QpUTn+yWWjoR4RU1NTwK3bMnu/3tSEshQohXASR79pyRZJ+7ACiY95Q57JOK/EBOLCoqim/M\nfhQ64E7ozlFAJpgyhQNoMusGsAhDztD74ANeqGxsjIOTGzaw+q2q4ii/m1oCuDF/+CF3LDuhS7tg\n/35LNdfUcCaK/Pviizl4/NOfsl3hXHwpG4Xux0N3YvFid0KX9+HEExMVurSSkil0rygrYwJxI/Sq\nqnil6fzfhx8mV+hOAlRB6FKhp6r7bGA/1+gok7FzMTj7UtGpIAndzXKRMzcBjt24LcEr03/dCL2o\nyIoJOVNAM21zyXDkkcwBK1eyGPGT5B25maLOPPRsYbdcgib0bBWQxLPPWr44YCn0vj5WyXK69KxZ\n3Gk2bODAm2zAyVL+pEXjZrnYF+8HuNM89VT8OuYnnsgBpX37rEBSsnpwG7LaFXqmHnq6zvX3fw+c\neaZ1LyVxTJjA5Tj++Phj9vdz/jGRd4JMhnQeejLLTa6j0tPjHhTt7w+O0LMVHF7PNXUq32N7O7Ov\nLJoKU6dy+3P2Uyk67Jlmbm0olUIHrHpPRehuI0yvuOgiTjVV8dKPSBK6SsvFSei1tfFDu0yRjNBV\ndhhJGL29rMgPH2b10tDAaW0jI6zcZZpWMjKUlsuePYmETsTKXxL6d77DXvkll1jfmzmTO8MxxyQ2\n9nwr9PPO46yI+nru0PJhUlbGD8fy8kSF/vGP82xIv9Pk0lkuyQg93UzRvr54O1BOSgoybTFbOM81\nbhyLjrfftr6TieXiHNUUFXFd2FNSk6G+nj1xt7RFwCJ0+8qVbteQLaFLlJb699EjR+gqLRc3Qv/e\n96zlAbKB2/obfiwXN9hX5Zs0iTuKDA7LyTnDwxzsSRUUlYRuV+jyHZV1dZZ1A7CFYV9NT2Lx4vRZ\nDs7zStin/3sJiqbKckmGyZOt178BnFZ63nmJx5Tq1znXIBukC4p6JXSnQjx8ON7jra3l43udh+Ec\nqQG5U+hFRRzgf+EF6zteCX3+fJ7I5ER1NY9I0030OessvvaBAXfyT6bQ001myxQqCF2BOaEXVFku\n9jx0O6Fnk9Nuh9v6G34tFyfslstZZ1nnkx184UIeYr77bmpCnzaNh6EjI5byk5174kQmwnRTB77y\nFfc686rQvQZF/QSo7Cmin/+8RdpOhe4nGG5HeTnbJkVFiW1Vzrh0Q1UVj7AmTeJ709Fhla+/n8nG\n/iAj4vVDvGZ55dNykYR+5528BgrgndAXLgQefjhxu1dCHzeOp/L/7GfuQsCL5eIny0XCELoLgshy\nUdmYVQdF3WC3XP7t3yx1LVV6czMHMDdtsrJU3LJc5BLAdstFvnS4poazQexLorrh3HPdt6u2XFSr\nJWcZnf60H8i1ut3ueaoJalOmcNDsySc54Owsn9sDx7n8ayrYjyfz0nJJ6J/6FM++lSMQr4SeDNXV\nPKnNy1T8urrkdWUn9CCyXCRKS70vqpcMkbRcggyK+kUuPPSyMu4Uhw/zcNQeMG1p4Yk9TU2szolS\nk+HMmewxSuUg37IzcSJ3lkynOUt4IfSaGj5vf39mQVG3tMVsYFfoyQgzG5SXc6A403ve1MQP1+OP\n57/thKLigZPMQ1f5MJNwa3Pl5WzPrV7N2yWxZwuvCj0dvCp0v22upMR46AkIOijqF0FkuThRXs4B\nnqqqxCHkz3/Ok1/mzLEi+qmGjjNncnDNqdAluTpXmPQKL3nocvkB+d7PTDx01YTuVy3aUV6eXKGn\ng/2hYidFFZZQMsvFL7F6OZe8X5WV1pLVKhS6CkKXsYugslwkTFDUBSoJXaaWBU3oQQRFOzpSd8L5\n8y1vNRUZJiN02UmCVOgAB0a3bOHPqe5DEITuxdLIBmVlTBB+25VqSygZoat8mCU7l7xfJSXqHqLV\n1fFrF2ULuYaOyXLJA4zlYhF6qs5wzjm8+JOzTM6GefLJ/OJmO6HLoGhlZfYdzkseOsAPjI0b06ee\nqfYzgWCDooD/e666fMksl1wq9JISbmtCqCH0w4fVEPq+fVw2O7fomOUSOUJXpdBLSrhRyanDqpBM\noau2XNIp9KKi+DfZJCPDK64A/uEf3C2XbO0WeU4vCr25mWe8piN0Z+dS8UqVIIOi9t/ZQvUIQgfL\nRRL64cO8Ta59ng1kmVUQekdHYj6/s9+oyHIxQVEHVBE6EZOsnPSgCm556EEo9M5O750wnRdob2gy\nKFpTk73d4jwnkJzQL7wQeP751P6583iFotBzERSVr1mTb2ZShWT3q7iY25oKm0cSuSpCd+bz6xgU\njVzaoirLBbBeZKzyBXq5ykMfHfXeIdINHWUnE8JS6Mcck/jG80zgldCPOYb9fi+WS1Aeulz/QxWp\njR9vvSrQD4IOio4fz/MVqqvV9gG3czkVutdp/6mQC4Wum+USOUJXpdABS6GrRK7y0IHsFLrb0LGo\nyPqOJPTTTotfqjVTeCV0Il7rQi5h6uV4qtMWVapzwCJz1ZaLaoUuFxFTbbe4ncsZFPW6MFcqqCT0\nzs7EEamOWS6RI3RVU/+B3BH64KBawpBE4bWDe1EaUjnJoKhfeCV0gF9BJt+dmup4qoOisowqM1wk\nysvVB0XdFpbKBM52MGEC17vqDBe3czkVugrLRSWhd3Xxqoh26KjQI+mhq7JcJkxQa4UAuQuKApkp\n9HRkaCf0bN7R6nZOr4ReW5vYmZwIYqaoXaGrJrUgFLoKy8XeDsrKmMhyqdClvefM+c4Gstx+yy9f\n7O4sTxBZLn6DopFT6GG0XIIIigLZWy7JCH142AqK+oWXiUXZHk81oeuq0IMOipaVMbHmQqFLm08K\nBxX9uLqa69nvceS9DzrLxcwUdYFqyyUXCj0oQs8mKJqMWHOh0P10CCdBqExbDGLquyrLJcigaKbC\nwM+5nB66iuSGujr3d5BmClmvQWe5GA/dBaotl1wFRfNtuWTioefacsn0eEFYLqoVehiCorJN5prQ\nR0bUCLO6Ol5Azi9SKXRD6AEjDJZL0HnosiNmGxR1U8r5DIp6PZ70H1UHRQcH9VXoQc4UzXSkl+25\n7KNC6aGrSj9W0Q4kkacidJPlEhBGRtTlC8s8dJXIheUi0+JUKnQZsNFZoUty0z1tEdAzKCrJFMiv\n5aKS0FWgpIRFUi4UugmKOqC7QnebWGR/ka0qlJVlptB1znLxgiDXQy/UoGimIz0/53J66CqtUxWo\nrAw+y8XMFHWB7oTuptC7u9UTemWl9ze1e2mYktCDynLR2UNXHeMAeH2cOXP8HSPIoKgQ+Vfoqvqx\nCrgRuuosF2O5uEDlUC1XhN7T4518veLll/mFCJmWKR9ZLvLtOH4yU4IgdHnMIIb/LS3+j5GLmaJA\n7gl9YEAvywVgQjdZLnmAUeiMI47IrkxeFLoKtao6oBRE2qJU6KOjahdoUwVZvpERNWXMV1DUbWKR\njoRuslzygLBN/T90iBWq6iF9tmVKl+Vy+LCaoLOXc2Z6vCDWQw9KoauAfQShos3rkraoo4deU5NY\nD6pFiexjvo7hb3c9cPfdvHDOlVeqbQgNDTz8UwknoUu7RfVqdpmWyWtQdGhIjVpVrW7sfrKqLBdZ\nL6Oj/h84QUCWT2WKnw6ErqNC/+UvOa/dDh1fcKFRlWWPXbusDqfScrnuOjXHscNJ6EHYLZkik6n/\nQ0PBKHTVlovKoOjoqF7kIiGvcXhYzQPHWYelpfzwzsfiXLoFRd1e5qJ6VGgW54pBdjpAv4bghHNi\nURAB0UyRSZbL4cN6KvSgg6I6KnSAy6hqVOpWh1VVwbTPsHnoblBtG+aE0Inol0TUQUTrbNtuI6J2\nIloT+znL9r9biWgTEW0koqX+iucNdkLXzXtzwpmHrptCT5flokqhE7E1IoS+hG5X6DoT+uHDwRA6\nEbB2bfq3Rak4l+6Wixt0DIp6KcJDAM502X6vEOKE2M/zAEBERwO4GMDRAM4G8ABR8O7w2JhVsSot\nlyDgZrnkW6FnMlNUlUInskhdFaEHtR66rpYLwA+aoaHgFLqf98Zmei5A36CoG1QTuoqgaNoiCCFW\nATjg8i83oj4fwGNCiBEhRBuATQCW+CqhBzgtF50bQrKgaD7hZYKE6qCoPK98GOuctqi75TI0FIyH\nruKhmOm5dPXQ3aA6yyXfHvrXiegdInqQiKRpMBPADtt3dsa2BQqn5aJzQwhzUPTwYf6/qgemSkIP\ncuq/zpaLaoWuepST6lxh99CjlOXyAIDvCiEEEd0B4AcAvpzpQW6//faPPre0tKAly+lzYbZcdFDo\nXoOiAwOszlWZaHbCVKnQVS/OpbPlIhV6UJZLUIiKh56LLJfW1la0trZ6OkZWVSaE6LL9+QsAz8Q+\n7wQwy/a/htg2V9gJ3Q/CbLl0dwMLFuSvPEBi53KrP0noqlaytJ9X1cSiIIKiYchyCSoomi9CD6uH\nHlSWi1PsLlu2LHmZPJ6LYPPMiWia7X9/C+Dd2OenAVxKROOIqAnAEQDe8HiOrBF2yyXfCt1rlsvB\ng2qnwKv20IMIiuqu0KXlEkUPXdc6l9AxKJq2yojoUQAtACYT0XYAtwH4NBEtAjAGoA3AdQAghNhA\nRMsBbAAwDOBrQsill4KD3XLRPZjiloceFg9dWi6qz6t72qLuCl2V5RJEHCIZohYUDY2HLoS43GXz\nQym+fxeAu/wUKlOEPQ9dB4XuJcvl4MHgLBfVQVFV7xQdHi6soGi+CF3Wb5iCos6Rrd/y5jvLRRsY\ny8UfvAZFdbdcgpxYpCu5RDFtMSweuo5ZLpEg9LBZLs4slzBZLroq9CDISB6zUCyXIDKFvJzLHrcJ\nm4du1nIJAGGyXMJK6KWl4VHoQaQt6kroUbFcouCh+20jOZkpGgaE2XJROfMyW3jNcgkqKKoqD70Q\np/5HLQ89rB66UegKEWbLRQf154UMdc9DDyJDIwxZLmFOW3Rrc9JDN4SeZZn87a4Hwmy56ELo6YaO\nhRgUNQrd/zEzPZe0HXTvx4Ah9MAQNstFllWuNJhvQs9k6n9YgqKFtDhX1GaKhsVyke0DiMbiXNrA\nabno3BDc8qXz+fo5wHuWS6EpdHmvdBhFJYMJiuYPOs4UjQShS4UuRDgUuj0AqQNRRGWmaCFO/Q8y\nDz1IoZEqKBpWDz0UbywKA6SKkg0jSGXhF7oSerqho+4zRYNMW9TZcomqQjceepZl8re7HpBkHrZG\noIsKiYLlUqjroUc1KKpL30gFQ+gBQQ6LdbdbAD0VupMMk2W5jI4Go9BVr4euOm1Rd8tFZVA03y+4\nMITus0z+dtcDkhTCFkjRhdC9NExZr0EpdB3XQ5fHDIPlEsY89HQTi3Tvy6qzXExQNIawWi5hInRZ\nr7paLk51qTJtUXeFrqrd60DoZmKRzzL5210PyE4XlkZgf6qHjdBVWy72YLbfYxWih15cHM089LCJ\nM5PlohB2yyVMjUAXovCa5QLoq9DtcYBCynJRmbao2wsuwtSXjUJXCKnydB4aS9g7jS6E7nWmKBCO\ntMUgJhbp2q7CmuWSrM2FyUNXXV+yTuQxsyqTvyLoAbvlogNBpoIzbVGH8noZOuqu0IOyXHRvV1HM\nQw/jaFtFfRH5D4xGgtClitJZSUk4LRcdyhvFoKjqtEVdCT2sCj3ZueR2VdcUJFRnuQD+bZdIELq0\nXHRWUhK6eujGckl+TF0evG4IKm0xl28sst8vqVIPHdK3ziWCaHNyhJJ1mfwXIf+QKkoXgkwFXQnd\ny3roQHAvuFC9HnohrbYYdoXuVLfyWsJG6CraiLFcEG7LRQeiSNW5JMJguRSiQlc9UzRflou9HxQX\n8wNZh76RCkEp9IIndGO5+IOXLBeZcRAGy8W8UzQ76OChA3wtuj5A7TCEHhDslovuDUE2AiH0IQqT\n5ZL8mCMjwfvJfqAyD90QemawB0VVBuILntAlKYRBocsXWuhM6GENigbRuXTPtjAKPX/wYlVmCqPQ\nEa6gKGA1BF0eQPkOiuqq0FVmkASFIIOi+XjBBcD1rfukIsBkuQSGMAVFAf2CbVHIQw8iy6WoSP81\nRaISFA27QjdZLgoRpqAooDZdTwWcyxGE1XIpRIUeteVzgfASeigsFyL6JRF1ENE627ZaIlpBRO8T\n0QtEVGP7361EtImINhLR0uyL5h1hCooC+hF6vhW6inoIKiiqO6GrHEUEEYdIdS5D6InIhUJ/CMCZ\njm23AHhJCLEAwEoAtwIAETUDuBjA0QDOBvAAUfDvtJekYBS6v/IA6bNcdFbokoxUpi3qHhSV1xkl\ny6W4WO86lwhllosQYhWAA47N5wN4OPb5YQAXxD5/AcBjQogRIUQbgE0AlmRfPAs/+1nyNQ7sqy3q\nQJDpoDuhh9FDL1SFLssWJUIvKQlfUDTsWS51QogOABBC7AFQF9s+E8AO2/d2xrb5xre+BXR2Jm4X\nwkoBDJPlolNWjpehdtCvoFMdFC0UD11ep/HQcw8ds1xUVZvIZqfbb7/9o88tLS1oaWlJ+t2REV6w\nxwn7EzIsloskH13Ka9ZDT35M3S0XlQpdhxdcAHwt8n86I1dZLq2trWhtbfW2f5bn7SCieiFEBxFN\nAyC1804As2zfa4htc4Wd0NNhZITTs5ywK8swKXRdLZeorOWianGusCj0qFkuYST0oCwXp9hdtmxZ\n8jJ5PA/FfiSeBnBN7PPVAJ6ybb+UiMYRUROAIwC84fEcKTE87K7Q7esR66J40yGseehyaVPV59V5\npqgQetyjZIii5RKmiUWq25xfDz1tUyWiRwG0AJhMRNsB3Abg3wH8joi+BGAbOLMFQogNRLQcwAYA\nwwC+JoTIyo5xIplCtytLXQgyHXRW6MmGjuPHA+eeq3b2oP2VW7paLoAe9ygZohoUDQOCsKgCJ3Qh\nxOVJ/nVGku/fBeCu7IuUCNnpUyl0nQgyHXQkdC9T/59+Wv15gyB0lWmL9t86IijLJV8vuADCQ+hB\nZLn4TVsMRdVJsknmoYfhRQR26EjouVJmbucN4gUXKhW6zgRjFHr+EOUsl0Ahn1jJslzGjTOWix/k\nMrvBDt2zXMKk0KPmoYcBucpyyahM/osQPOSEomSWS2mpXmmA6SAtDl3KG8TQMZPzqiAP6e0LUVge\nelSzXMISFNXNQw8FocsLTGa5lJYaha6iPEB4Fbo8npw1rCptEdC7TUXBcnF7p6jOdS6hY5ZLqAjd\nqdBffpkr0k7oOhBkOkiLQ5cHkLNh5qoOgyJ0QN3yuYDebSqqCl2HfpEOOma5hIrQ7Qp9ZAQ44wz+\nLW/+8LDenU/CKPT486r0vEdHrbdCqTgeoDe5qPT5g5ic5fVcYSR0k+WSJdwUuvTVDx/mSigq0n+a\ntoRuhB6FoKg83siIuvKHKSgadoVur+PiYr3rXELHLJfQKnR7oFQ2AN2naUvoRuhRUeiqCb3QLJcg\nAsvJEDWFbrJcMoBblotdoRcVGYWuojxAfrJcVNVDURG3C9UKXec2pbqM9sCyIfTU0DHLJQTVltxD\nl9uMQldTHiB/Cl2VZVCoCl1VGVWPmtKdBwj3xCLdslxCUXVePPSwEbpueehRyHKRASVVwbwweOhB\nKPR8E3pxcbABWVUwWS5ZIp2HbiwXfzBBUXdIUtG5Tan00AG1C6alQhQmFtljDibLJQOkUuhhDIrq\nloeey2CYHUHloask9KIivdtUVC2XXLVBv1BdXwWh0O32its2O6HrQJDpoJtCB3LXkYM8p+qgqDym\nzm0qipZLWIKiQHx9qcpyKcjFuZJZLroQZCroTOhhXctFHk+l5QLonxOt2nLJNaHLdwLbPfMJE8Kl\n0FVmBZmgaEiDojoSei7S1ZznNITuD2FX6JLM7YR+443hCIoC+lkuoSH08ePTpy0ePhyOoZquhK5y\n6JiPc8qAUiFZLkF56E7VrBry2G73q6YmuPOqhuogckF46CMjQGVl9CwXXdIWAWsdFCB36ki1zSMV\nuurX5Olyj9xQVGQFb1UdT+V6OOnOpfoBnGuobsN+s1xCUZWS0NNN/R8e1ltNSeiq0HPducJgueie\n5VJcrP7F3blqB1EidKPQM8DwMFBRES0PXbflfg2hu0M1YaqGakvIEHpm0C3LJRRVmU6hh81y0S0P\nHQgm5c/LOXUn9DAodJXlM4SeGeQ1AGosqoJQ6Kk8dBMUVYMoEHoQQVGj0INDFAhddZsrKEL3MrFI\nF4JMBR0JXTbMXJbHKHT/CIrQcxEYz4eIUA3Vba6gCN2u0O256dJyMUHR7BElD72QslxMUDS/UP1Q\nKpgsl4oKVuNC8Da3LBep1nWHIXTrnCoDSkEovjBYLsZDzx+MQs8Cw8M8HZjIutgoWC465aHnk9BV\n56EXkuUShELPlQ0SJUJX1UYKJsulpCR+tqhZPlctohIUVb0Wje6WS5iDokEEsXMN1f2mIBS6JPQJ\nEywf3U2hDw/r3fkkdMxDl/UXZkIPSqHrLBLCTOiq1W0+oFuWi6+mQERtAHoAjAEYFkIsIaJaAI8D\naATQBuBiIUSPn/N4JXRA784noWseelSCooWk0FWXL5eqOUqWiy6E7rcYYwBahBDHCyGWxLbdAuAl\nIcQCACsB3OrzHBgZ4TeYpLNcAL07n4SulksU0hYLbT30KCj0sBN6lLJcyOUY5wN4OPb5YQAX+DzH\nR+mIdoXuttoioA9BpoLOhB4FhV5IaYuG0POLqCl0AeBFInqTiL4c21YvhOgAACHEHgB1Ps/hOSgK\n6K2mJHQldOOhux9Tl3vkBjP1P7+Q/UaXLBe/9HeqEGI3EU0FsIKI3geTvB3OvzNGMg+9rCw+Dx3Q\nu/NJ6Jq2GHZCN1P/1RzPELp36KbQfTUFIcTu2O8uInoSwBIAHURUL4ToIKJpADqT7X/77bd/9Lml\npQUtLS2u30um0MvLwxkU1VGh5yOFLIiJRYWWtjh7NnDaaeqOZwg9M+Qiy6W1tRWtra3e9s/2xERU\nDqBICNFPRBUAlgJYBuBpANcAuBvA1QCeSnYMO6GnQjKFXl4OdHWZoKgKRGViURBBUV3ukRvmzgXu\nuUfd8YJ4KKY6V9gJPRcK3Sl2ly1blnx/H+euB/DfRCRix/mtEGIFEb0FYDkRfQnANgAX+zgHACvL\nJUqWi2556FHJcik0y0U1jELPDLpluWTdVIUQWwEsctm+H8AZ2RcpETLLpbIS6O+3tpWV8ecwWy66\nlDcKHnohBkVVwxB6ZtDNQw9FVUrLpbIS6OuztpWX8+ewWS72iUW6lDcKaYtS3ahOW9TloZsLGELP\nDLpluYSiKiWhV1XFK3RJ6GFW6LoQer6Dokah6wFD6JnBKPQsYCd0qdCTEXoYOp+OhG4sF3fonuWi\nGobQM4Nua7mEoirTEXrYLBdd89CNQk/EjTcCS5ak/15UIOswV28sCjuh66bQQ2BQMHmXliYSek0N\nfyb57x8AAAqnSURBVDaWi3/kO8tFVR66aoI4+2x1xwoDjELPDLpluYSiKu1B0XRZLroQZCroSuj5\nslx0fsFFoSGX7SAfbU41dFPooajKTCyXsCh03fLQoxAUjcILE/INo9AzQxBZLpG3XCShl5W5py2G\nWaHr8gCKioc+PJwb/zeqMISeGYJQ6AWVtpguyyUMjUPXPHST5WJgXnCRGVTXl3yNoshyScNQVKXb\nxCK7hy4tF13Ubjro6qFHQaGHnSDyDaPQM4PqayCySD2r8qgpRrCwZ7mkmlikCzmmg65pi7l+J6sh\ndP1gCD0zBDGy9eOjh6IqvVouRqFnjygodBMU9Q9D6JkhCEL3k7oYiqqUhC7XPx8ZcbdcdCHHdNCR\n0POd5aJrHnqhIR+ErksfyAZBXEPBKHQiKxc9CpaLToQeBYVeVATs2gVUVPg/VqHCKPTMEMQ1+Ml0\nCUVVSkIHUhN6mCyXkRGOZOtE6FGYWLRxI3DZZf6PVagwhJ4ZghjZRlKhj41ZqTtyPXTA8tFHRsJt\nuUjy1CVnOippi3PnAkuX+j9WoUJOejOE7g1BKfTIEfr11wP338+f5RuLAIvQw6zQi4tzn1GSDvnw\nM1UTelMT8C//Em6CyDeMQs8MumW5aEuBq1YBa9bwand2yyUVoetEkKmQjxTBdCgu5tf5hVmh/9M/\n+T9GoWPCBGDfPkPoXlFUpL7fRC7L5eBBYMsWYPNmYNu29B56GCcWDQ3pRehRCIoa+EddHbBnjyF0\nr9Aty0VLCnz7beCYY4BjjwV+//vkCt252qJOBJkKOip02TDHj8/tOQ2h64X6eib06dODP1cUCD2o\noGikslzefBNYvBj43OeAV19NJPTeXr5gQ+jqMGkSp/yFOQ/dwD+kQjcvuPAGExT1gDffBE48EWho\nYJJxZrkcOBAfBA2b5VJSAgwO6kViJ54IvPtubjtXaSnXg1Ho+qCuzgRFM4EhdA9YuxY44QRgxgwm\ndGeWy4ED/DcRX3zYFPqcOcCmTXqVd/Hi3E90WrCA1WBPT7g7dZRQV8e/DaF7g4yHGUJPAiGArVs5\nn3jGDGD37niimTyZSV4SfGlp+BR6czMTuk7lnTcPmDgx9wq9pUV9hzDIHvX1/NsQujccfTSwerXJ\nckmKzk6eul1ZyQG66mq+QOnpSXVrJ/SwKfSZM/n6dCovEdsuue5cchJQmDt1lDBxIguNXNyPceM4\nYy3M9/6rX2WRqUuWi3ZVuXUrTxCRmDEjXsk2NQEffBBuQidila5beRcvNoRe6CgqAqZOzc39WLIE\neOutcN/7CROA++7jjDxV8JPlotGgn9HWxipcYuZMzkmXaGzkLJeqKv47jJYLwIS+d2++SxGPK64A\ntm/P7TmPOAL4wQ/Mglo6oa4uNyS7dCmvnhpmQgeA88/nH1WYPJljS9lAu6pMp9ArKrjBSYUexqAo\noKdCb24Gzjor9+e96abwd+ooob4+N/fjiCM4VmbufTyOO44TQ7JBYFVJRGcR0f8Q0QdEdLPX/ZwK\nfcYMi7wl5sxxt1zCpNAXLtSP0A0MgNwpdIBVuiH0eHzsY5oROhEVAfgxgDMBLARwGREd5fzewEDi\nvlu3JlouTqJuanLPctGZIFtbW+P+PvVUXkgq6nBedyEg7NecDaFne83XXw9cfHFWu2qBIO61doQO\nYAmATUKIbUKIYQCPAUhwmV54IXHHtrbUlguQXKGHidCrqoCrr85PWXKJsJNbNgj7NeeS0I85BvjM\nZ7LaVQsEca8bG1nsdnVlvm9QJsVMADtsf7eDST4Oy5dz6lJzM2d+/Pd/c1CusdH6jhuhOxV6GC0X\nAwNd8YlPcPqiQX5AxD76unXAZz+b2b55pcDly7nQHR3894UXAsuWWWu0APwE/+534/ebO9daRGrC\nBCZ1+WNgYOAPp5/OPwb5w6JFnCwwe3Zm+5GQrwVSCCL6BIDbhRBnxf6+BYAQQtxt+476ExsYGBgU\nAIQQrsunBUXoxQDeB/BZALsBvAHgMiHERuUnMzAwMDAAEJDlIoQYJaKvA1gBDrz+0pC5gYGBQbAI\nRKEbGBgYGOQeJqU/IBBRGxGtJaK3ieiN2LZaIlpBRO8T0QtEVJPvcvoBEf2SiDqIaJ1tW9JrJKJb\niWgTEW0koqX5KbV/JLnu24ionYjWxH7Osv0v9NdNRA1EtJKI3iOi9UR0Q2x7ZO+3yzV/I7Zd33st\nhDA/AfwA2AKg1rHtbgD/Evt8M4B/z3c5fV7jaQAWAViX7hoBNAN4G2zzzQHwIWIjxLD9JLnu2wDc\n5PLdo6Nw3QCmAVgU+1wJjpEdFeX7neKatb3XRqEHB0LiCOh8AA/HPj8M4IKclkgxhBCrABxwbE52\njV8A8JgQYkQI0QZgE1zmJoQBSa4b4HvuxPmIwHULIfYIId6Jfe4HsBFAAyJ8v5Nc88zYv7W814bQ\ng4MA8CIRvUlEX45tqxdCdADcWADU5a10waEuyTU6J5vthNU5ooKvE9E7RPSgzXqI3HUT0RzwCOV1\nJG/Tkbpu2zX/NbZJy3ttCD04nCqEOAHAOQCuJ6JPgknejkKISBfCNQLAAwDmCiEWAdgD4Ad5Lk8g\nIKJKAE8AuDGmWiPfpl2uWdt7bQg9IAghdsd+dwF4Ejz06iCiegAgomkAOvNXwsCQ7Bp3Aphl+15D\nbFskIIToEjEjFcAvYA21I3PdRFQCJrb/FEI8Fdsc6fvtds0632tD6AGAiMpjT3UQUQWApQDWA3ga\nwDWxr10N4CnXA4QLhHg/Mdk1Pg3gUiIaR0RNAI4ATzgLK+KuO0ZmEn8L4N3Y5yhd968AbBBC3Gfb\nFvX7nXDNWt/rfEeSo/gDoAnAO+CI93oAt8S2TwLwEjhavgLAxHyX1ed1PgpgF4DDALYD+HsAtcmu\nEcCt4Mj/RgBL811+xdf9CIB1sfv+JNhbjsx1AzgVwKitXa8BcFaqNh32605xzdreazOxyMDAwCAi\nMJaLgYGBQURgCN3AwMAgIjCEbmBgYBARGEI3MDAwiAgMoRsYGBhEBIbQDQwMDCICQ+gGBgYGEYEh\ndAMDA4OIwBC6QUGAiBpjLx34DRFtIKLlRFRGRN8mor8S0Toi+qnt+4tjLyhZQ0T3ENH62Pai2N9/\nja2295X8XZWBQTwMoRsUEhYA+LEQohlAH4B/BPAjIcRJQojjAJQT0edj3/0VgK8IXjFzFNYqgtcC\n6BZCnARelOmrRNSY06swMEgCQ+gGhYTtQojXY59/A+CTAD5DRK/HXif3aQALY+tbVwoh5MJKj9qO\nsRTAVUT0Nnht7EkAjsxN8Q0MUqMk3wUwMMgjBICfAPi4EGIXEd0GYELsf25vpJHbvyGEeDEXBTQw\nyARGoRsUEmYT0Umxz5cD+HPs877YcscXAYAQogdALxEtjv3/UtsxXgDwtdg62SCiI4moLPiiGxik\nh1HoBoWE98Fvj3oIvIb1/we2TN4DsBvxa1d/GcCDRDQK4BUAPbHtD4JfALyGiAj8QodQvxvWIDow\ny+caFARigctnhRDHevx+hRDiYOzzzQCmCSH+KcgyGhj4hVHoBoWETNTL54noVnAfaYP1Vh4DA21h\nFLqBgYFBRGCCogYGBgYRgSF0AwMDg4jAELqBgYFBRGAI3cDAwCAiMIRuYGBgEBEYQjcwMDCICP5/\n/dsho7hEaiUAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x1cf382b5470>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "> `%matplotlib inline` tells Jupyter to show the plotted image directly in the notebook web page. It only needs to be called once, and isn't needed if you're not using notebooks.\n",
    "\n",
    "On some systems, this may take some time the first time. It is clear that pages at the start of a book have less words per page, after which the count is fairly steady except for occasional valleys.\n",
    "\n",
    "You may have some guesses for what these patterns mean; a look at the [scans](http://hdl.handle.net/2027/nyp.33433074811310) confirms that the large valleys are often illustration pages or blank pages, small valleys are chapter headings, and the upward pattern at the start is from front matter. \n",
    "\n",
    "Not all books will have the same patterns so we can't just codify these correlations for millions of books. However, looking at this plot makes clear an inportant assumption in text and data mining: that there are patterns underlying even the basic statistics derived from a text. The trick is to identify the real patterns and teach them to a computer."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### Understanding DataFrames\n",
    "\n",
    "Wait... how did we get here so quickly!? We went from a volume to a data visualization in two lines of code. The magic is in the data structure used to hold our table of data: a DataFrame.\n",
    "\n",
    "A **DataFrame** is a type of object provided by the data analysis library, Pandas. **Pandas** is very common for data analysis, making Python feasible for activities that in the past would have been much more appropriate in R or Matlab.\n",
    "\n",
    "In the first line, `vol.tokens_per_page()` returns a DataFrame, something that can be confirmed with `type(tokens)`. This means that _after setting `tokens`, we're no longer working with HTRC-specific code, just book data held in a common and very robust table-like construct from Pandas_. `tokens.head()` used a DataFrame method to look at the first few rows of the dataset, and `tokens.plot()` uses a method from Pandas to visualize data."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 7,
   "cell_type": "code",
   "source": [
    "type(vol.tokens_per_page())"
   ],
   "outputs": [
    {
     "execution_count": 7,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "pandas.core.frame.DataFrame"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "Many of the methods in the HTRC Feature Reader return DataFrames. The aim is to fit into the workflow of an experienced user, rather than requiring them to learn proprietary new formats. For new Python data mining users, learning to use the HTRC Feature Reader means learning many data mining skills that will translate to other uses."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## Loading a Token List\n",
    "\n",
    "The information contained in `vol.tokens_per_page()` is minimal, a sum of all words in the body of each page. \n",
    "\n",
    "The Extracted Features dataset also provides token counts with much more granularity: for every part of speech (e.g. noun, verb) of every occurring capitalization of every word of every section (i.e. header, footer, body) of every page of the volume. \n",
    "\n",
    "`tokens_per_page()` only kept the \"for every page\" grouping, to get section-,pos-, and word-specific details, you can use `vol.tokenlist()`:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 8,
   "cell_type": "code",
   "source": [
    "tl = vol.tokenlist()\n",
    "# Let's look at some words deeper into the book:\n",
    "# from 1000th to 1100th row, skipping by 15 [1000:1100:10]\n",
    "tl[1000:1100:15]"
   ],
   "outputs": [
    {
     "execution_count": 8,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                           count\n",
       "page section token    pos       \n",
       "24   body    years    NNS      1\n",
       "25   body    For      IN       1\n",
       "             asked    VBD      1\n",
       "             emeralds NNS      1\n",
       "             him      PRP      2\n",
       "             live     VBP      1\n",
       "             n't      RB       1"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th>section</th>\n",
       "      <th>token</th>\n",
       "      <th>pos</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <th>body</th>\n",
       "      <th>years</th>\n",
       "      <th>NNS</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"6\" valign=\"top\">25</th>\n",
       "      <th rowspan=\"6\" valign=\"top\">body</th>\n",
       "      <th>For</th>\n",
       "      <th>IN</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>asked</th>\n",
       "      <th>VBD</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>emeralds</th>\n",
       "      <th>NNS</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>him</th>\n",
       "      <th>PRP</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>live</th>\n",
       "      <th>VBP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>n't</th>\n",
       "      <th>RB</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "scrolled": true,
    "collapsed": false
   }
  },
  {
   "source": [
    "As before, the data is returned as a Pandas DataFrame. This time, there is much more information. Consider a single row:\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/single-row-tokencount.png\" width=\"300px\" />\n",
    "\n",
    "The columns in bold are an index. Unlike the typical one-dimensional index seen before, here there are four dimensions to the index: page, section, token, and pos. This row says that for the 24th page, in the body section (i.e. ignoring any words in the header or footer), the word 'years' occurs 1 time as an plural noun. The part-of-speech tag for a plural noun, `NNS`, follows the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) definition.\n",
    "\n",
    "> The \"words\" on the first page seems to be OCR errors for the cover of the book. The HTRC Feature Reader refers to \"pages\" as the $n^{th}$ scanned image of the volume, not the actual number printed on the page. This is why \"page 1\" for this example is the cover.\n",
    "\n",
    "Tokenlists can be retrieved with arguments that fold certain dimensions, such as `case`, `pos`, or `page`. You may also notice that, by default, only 'body' is returned, a default that can be overridden.\n",
    "\n",
    "Look at the following list of commands: can you guess what the output will look like? Try for yourself and observe how the output changes.\n",
    "\n",
    " - `vol.tokenlist(case=False)`\n",
    " - `vol.tokenlist(pos=False)`\n",
    " - `vol.tokenlist(pages=False, case=False, pos=False)`\n",
    " - `vol.tokenlist(section='header')`\n",
    " - `vol.tokenlist(section='group')`\n",
    "\n",
    "Details for what arguments are taken are in the [documentation](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume.tokenlist) for the Feature Reader."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 1,
   "cell_type": "code",
   "source": [
    "# Preparing where we were at the end of the last notebook\n",
    "from htrc_features import FeatureReader\n",
    "fr = FeatureReader(['data/sample-file1.basic.json.bz2', 'data/sample-file2.basic.json.bz2'])\n",
    "vol = fr.first()\n",
    "tokens = vol.tokens_per_page()\n",
    "tl = vol.tokenlist()"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  },
  {
   "source": [
    "## Working with DataFrames\n",
    "\n",
    "The Pandas DataFrame type returned by the HTRC Feature Reader is very malleable. To work with the tokenlist that you retrieved earlier, three skills are particularily valuable:\n",
    "\n",
    "1. Selecting subsets by a condition\n",
    "2. Slicing by index\n",
    "3. Grouping and aggregating\n",
    "\n",
    "### Selecting Subsets of a DataFrame by a Condition\n",
    "\n",
    "Consider this example need: *I only want to look at tokens that occur more than ten times on a page.* \n",
    "Remembering that the table-like output from the HTRC Feature Reader is a Pandas DataFrame, the way to pursue this goal is to learn to filter and subset DataFrames. Knowing how to do so is important for working with just the data that you need.\n",
    "\n",
    "To subset individual rows of a DataFrame, you can provide a series of True/False values to the DataFrame, formatted in square brackets. Consider this fake example:\n",
    "\n",
    "```python\n",
    "to_keep = [True, False, False, ..., True]\n",
    "fake_dataframe[to_keep]\n",
    "```\n",
    "\n",
    "After receiving these boolean values, the DataFrame goes through every row and returns only the ones that match up to \"True\" in the given order. So, *the task of subsetting a DataFrame is a matter of figuring out the True/False values for which rows you want to keep.*\n",
    "\n",
    "Consider the example need in that context. To select just the tokens that occur more than 10 times on a page, we need to determine what rows match the criteria, i.e. *\"this token has a count which is greater than 10\"*. Let's try to convert that goal to code.\n",
    "\n",
    "First, \"this page has a count\" means that we are concerned specifically in the 'count' column, which can be singled out from our `tl` table with `tl['count']`. \"Greater than 10\" is formalized as `> 10` so try the following and see what you get:\n",
    "\n",
    "```python\n",
    "tl['count'] > 10\n",
    "```\n",
    "\n",
    "```\n",
    "page  section  token          pos\n",
    "1     body     0              CD     False\n",
    "...\n",
    "267   body     prince         NN     False\n",
    "               quite          RB     False\n",
    "               ran            VBD    False\n",
    "```\n",
    "\n",
    "It is a DataFrame of True/False values! Each value indicates whether the 'count' column in the row matches the criteria or not. We haven't selected a subset yet, we simply asked a question and were told for each row when the question was true or false.\n",
    "\n",
    "> You may wonder why page, section, token, and pos are still seen, even though 'count' was selected. This is because, as noted earlier, these are part of the DataFrame *index*, so they're part of the information about that row. You can convert the index to data columns with `reset_index()`. In this lesson we will keep the index intact, though there are circustances with there are benefits to resetting it.\n",
    "\n",
    "Armed with the True/False values of whether each value of 'count' is or isn't greater than 10, we can give those values to `tl` in square brackets."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 2,
   "cell_type": "code",
   "source": [
    "matches = tl['count'] > 10\n",
    "tl[matches]"
   ],
   "outputs": [
    {
     "execution_count": 2,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                        count\n",
       "page section token pos       \n",
       "11   body    0     CD      35\n",
       "             \u00a9     NNP     13\n",
       "15   body    .     .       31\n",
       "20   body    ,     ,       19\n",
       "             the   DT      12\n",
       "22   body    ,     ,       23\n",
       "23   body    ,     ,       16\n",
       "24   body    ,     ,       22\n",
       "             I     PRP     11\n",
       "25   body    ,     ,       12\n",
       "26   body    ,     ,       20\n",
       "27   body    ,     ,       15\n",
       "28   body    ,     ,       17\n",
       "29   body    .     .       13\n",
       "30   body    the   DT      11\n",
       "31   body    ,     ,       13\n",
       "             a     DT      11\n",
       "32   body    ,     ,       15\n",
       "33   body    ,     ,       15\n",
       "34   body    ,     ,       16\n",
       "35   body    ,     ,       11\n",
       "36   body    ,     ,       13\n",
       "38   body    ,     ,       18\n",
       "             I     PRP     11\n",
       "39   body    ,     ,       11\n",
       "40   body    the   DT      12\n",
       "41   body    ,     ,       14\n",
       "42   body    ,     ,       12\n",
       "43   body    ,     ,       16\n",
       "44   body    ,     ,       18\n",
       "...                       ...\n",
       "240  body    ,     ,       14\n",
       "241  body    ,     ,       12\n",
       "242  body    I     PRP     11\n",
       "243  body    \"     ''      11\n",
       "             ,     ,       21\n",
       "244  body    ,     ,       20\n",
       "245  body    ,     ,       14\n",
       "246  body    the   DT      12\n",
       "247  body    ,     ,       11\n",
       "             the   DT      12\n",
       "248  body    ,     ,       13\n",
       "             the   DT      13\n",
       "249  body    ,     ,       16\n",
       "             the   DT      11\n",
       "250  body    ,     ,       11\n",
       "             the   DT      11\n",
       "253  body    ,     ,       11\n",
       "254  body    ,     ,       20\n",
       "255  body    !     .       12\n",
       "             ,     ,       16\n",
       "256  body    ,     ,       16\n",
       "257  body    ,     ,       12\n",
       "258  body    ,     ,       16\n",
       "259  body    ,     ,       17\n",
       "260  body    ,     ,       16\n",
       "261  body    ,     ,       14\n",
       "262  body    ,     ,       18\n",
       "263  body    ,     ,       20\n",
       "264  body    ,     ,       16\n",
       "266  body    ,     ,       14\n",
       "\n",
       "[258 rows x 1 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th>section</th>\n",
       "      <th>token</th>\n",
       "      <th>pos</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">11</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>0</th>\n",
       "      <th>CD</th>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>\u00a9</th>\n",
       "      <th>NNP</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <th>body</th>\n",
       "      <th>.</th>\n",
       "      <th>.</th>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">20</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">24</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <th>PRP</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <th>body</th>\n",
       "      <th>.</th>\n",
       "      <th>.</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <th>body</th>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">31</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">38</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <th>PRP</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <th>body</th>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>241</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>242</th>\n",
       "      <th>body</th>\n",
       "      <th>I</th>\n",
       "      <th>PRP</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">243</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>\"</th>\n",
       "      <th>''</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>244</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <th>body</th>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">247</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">248</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">249</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">250</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>253</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>254</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">255</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>!</th>\n",
       "      <th>.</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>256</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>259</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>261</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>262</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>263</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>264</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>266</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>258 rows \u00d7 1 columns</p>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "scrolled": true,
    "collapsed": false
   }
  },
  {
   "source": [
    "You can move the comparison straight into the square brackets, the more conventional equivalent of the above:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 3,
   "cell_type": "code",
   "source": [
    "tl[tl['count'] > 10]"
   ],
   "outputs": [
    {
     "execution_count": 3,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                        count\n",
       "page section token pos       \n",
       "11   body    0     CD      35\n",
       "             \u00a9     NNP     13\n",
       "15   body    .     .       31\n",
       "20   body    ,     ,       19\n",
       "             the   DT      12\n",
       "22   body    ,     ,       23\n",
       "23   body    ,     ,       16\n",
       "24   body    ,     ,       22\n",
       "             I     PRP     11\n",
       "25   body    ,     ,       12\n",
       "26   body    ,     ,       20\n",
       "27   body    ,     ,       15\n",
       "28   body    ,     ,       17\n",
       "29   body    .     .       13\n",
       "30   body    the   DT      11\n",
       "31   body    ,     ,       13\n",
       "             a     DT      11\n",
       "32   body    ,     ,       15\n",
       "33   body    ,     ,       15\n",
       "34   body    ,     ,       16\n",
       "35   body    ,     ,       11\n",
       "36   body    ,     ,       13\n",
       "38   body    ,     ,       18\n",
       "             I     PRP     11\n",
       "39   body    ,     ,       11\n",
       "40   body    the   DT      12\n",
       "41   body    ,     ,       14\n",
       "42   body    ,     ,       12\n",
       "43   body    ,     ,       16\n",
       "44   body    ,     ,       18\n",
       "...                       ...\n",
       "240  body    ,     ,       14\n",
       "241  body    ,     ,       12\n",
       "242  body    I     PRP     11\n",
       "243  body    \"     ''      11\n",
       "             ,     ,       21\n",
       "244  body    ,     ,       20\n",
       "245  body    ,     ,       14\n",
       "246  body    the   DT      12\n",
       "247  body    ,     ,       11\n",
       "             the   DT      12\n",
       "248  body    ,     ,       13\n",
       "             the   DT      13\n",
       "249  body    ,     ,       16\n",
       "             the   DT      11\n",
       "250  body    ,     ,       11\n",
       "             the   DT      11\n",
       "253  body    ,     ,       11\n",
       "254  body    ,     ,       20\n",
       "255  body    !     .       12\n",
       "             ,     ,       16\n",
       "256  body    ,     ,       16\n",
       "257  body    ,     ,       12\n",
       "258  body    ,     ,       16\n",
       "259  body    ,     ,       17\n",
       "260  body    ,     ,       16\n",
       "261  body    ,     ,       14\n",
       "262  body    ,     ,       18\n",
       "263  body    ,     ,       20\n",
       "264  body    ,     ,       16\n",
       "266  body    ,     ,       14\n",
       "\n",
       "[258 rows x 1 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th>section</th>\n",
       "      <th>token</th>\n",
       "      <th>pos</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">11</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>0</th>\n",
       "      <th>CD</th>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>\u00a9</th>\n",
       "      <th>NNP</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <th>body</th>\n",
       "      <th>.</th>\n",
       "      <th>.</th>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">20</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">24</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <th>PRP</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <th>body</th>\n",
       "      <th>.</th>\n",
       "      <th>.</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <th>body</th>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">31</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">38</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>I</th>\n",
       "      <th>PRP</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <th>body</th>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>241</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>242</th>\n",
       "      <th>body</th>\n",
       "      <th>I</th>\n",
       "      <th>PRP</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">243</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>\"</th>\n",
       "      <th>''</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>244</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>245</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <th>body</th>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">247</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">248</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">249</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">250</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>the</th>\n",
       "      <th>DT</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>253</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>254</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">255</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th>!</th>\n",
       "      <th>.</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>256</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>257</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>258</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>259</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>261</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>262</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>263</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>264</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>266</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>258 rows \u00d7 1 columns</p>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "scrolled": true,
    "collapsed": false
   }
  },
  {
   "source": [
    "As might be expected, the tokens that occur very often on a single page are \"the\", \"a\", and various punctuation. The 'pos' column shows what part-of-speech the word is used in accordding to the [Penn Treebank tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html): `DT` is a determiner, `PRP` is a personal pronoun, etc. \n",
    "\n",
    "Multiple conditions can be chained with `&` (and) or `|` (or), using regular brackets so that Python known the order of operations. For example, words with a count greater than 3 *and* a count less than 7 are selected in this way:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 4,
   "cell_type": "code",
   "source": [
    "tl[(tl['count'] > 3) & (tl['count'] < 7)].head()"
   ],
   "outputs": [
    {
     "execution_count": 4,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                        count\n",
       "page section token pos       \n",
       "9    body    .     .        4\n",
       "11   body    \u00a9     IN       6\n",
       "                   NNS      4\n",
       "12   body    ,     ,        6\n",
       "17   body    \"     ''       5"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th>section</th>\n",
       "      <th>token</th>\n",
       "      <th>pos</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <th>body</th>\n",
       "      <th>.</th>\n",
       "      <th>.</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">11</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">body</th>\n",
       "      <th rowspan=\"2\" valign=\"top\">\u00a9</th>\n",
       "      <th>IN</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>NNS</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <th>body</th>\n",
       "      <th>,</th>\n",
       "      <th>,</th>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <th>body</th>\n",
       "      <th>\"</th>\n",
       "      <th>''</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "execution_count": 5,
   "cell_type": "code",
   "source": [
    "tl.index.names"
   ],
   "outputs": [
    {
     "execution_count": 5,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "FrozenList(['page', 'section', 'token', 'pos'])"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "### Slicing DataFrames\n",
    "\n",
    "Above, subsets of the DataFrame were selected based on a matching criteria for columns. It is also possible to select a DataFrame subset by specifying the values of its index, a process called **slicing**. For example, you can ask, *\"give me all the verbs for pages 9-12\"*.\n",
    "\n",
    "In the DataFrame returned by `vol.tokenlist()`, page, section, token, and POS are part of the index (try the command `tl.index.names` to confirm). One can think of an index as the margin content of an Excel spreadsheet: the numbers along the top and letters along the right side are the indices. A cell can be referred to as A1, A2, B1... In pandas, however, you can name these, so instead of A, B, C, rows can be referred to by more descriptive names. You can also how multiple levels, so you're not bound by the two-dimensions of a table format. With a multiindexed DataFrame, you can ask for `Page=24,section=Body, ...`.\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/Excel.PNG\" width=\"300px\" />\n",
    "*One can think of an index as the margin notations in Excel (i.e. 1,2,3... and A,B,C,..), except it can be named and can have multiple levels.*\n",
    "    \n",
    "Slicing a DataFrame against a labelled index is done using `DataFrame.loc[]`. Try the following examples and see what is returned:\n",
    "\n",
    "- Select information from page 17: \n",
    "  - `tl.loc[(17),]`\n",
    "- Select 'body' section of page 17:\n",
    "  - `tl.loc[(17, 'body'),]`\n",
    "- Select counts of the word 'Anne' in the 'body' section of page 17:\n",
    "  - `tl.loc[(17, 'body', 'Anne'),]`\n",
    "\n",
    "The columns are specified by label in a tuple, in order of index level: i.e. (1st_level_label, 2nd_level_label, 3rd_level_label). To skip specifying a label for a level -- that is, to select everything for that level -- `slice(None)` can be used as a placeholder:\n",
    "\n",
    "- Select counts of the word 'Anne' for all pages and all page sections\n",
    "  - `tl.loc[(slice(None), slice(None), \"Anne\"),]`\n",
    "  \n",
    "Finally, it is possible to select multiple labels per level of the multiindex, with a list of labels (i.e. `['label1', 'label2']`) or a sequence defines by a slice (i.e. `slice(start, end)`):\n",
    "\n",
    "- Select pages 37, 38, and 52\n",
    "  - `tl.loc[([37, 38, 52]),]`\n",
    "- Select all pages from 37 to 40\n",
    "  - `tl.loc[(slice(37, 40)),]`\n",
    "  \n",
    "> The reason for the comma in `tl.loc[(...),]` is because columns can be selected in the same way after the comma. Pandas DataFrames can have a multiple-level index for columns, but the HTRC Feature Reader does not use this."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "Knowing how to slice, let's try to find the word \"CHAPTER\" in this work and compare to the earlier counts of tokens per page. \n",
    "\n",
    "The token list we previously set to `tl` only included body text, to include headers and footers in our search for `CHAPTER` we'll grab a new tokenlist with `section='all'` specified."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 6,
   "cell_type": "code",
   "source": [
    "tl2 = vol.tokenlist(section='all')\n",
    "chapter_pages = tl2.loc[(slice(None), slice(None), \"CHAPTER\"),]\n",
    "chapter_pages"
   ],
   "outputs": [
    {
     "execution_count": 6,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                          count\n",
       "page section token   pos       \n",
       "19   header  CHAPTER NNP      1\n",
       "35   header  CHAPTER NNP      1\n",
       "56   header  CHAPTER NNP      1\n",
       "73   header  CHAPTER NNP      1\n",
       "91   header  CHAPTER NNP      1\n",
       "115  header  CHAPTER NNP      1\n",
       "141  header  CHAPTER NNP      1\n",
       "158  header  CHAPTER NNP      1\n",
       "174  header  CHAPTER NNP      1\n",
       "193  header  CHAPTER NNP      1\n",
       "217  body    CHAPTER NNP      1\n",
       "231  header  CHAPTER NNP      1\n",
       "246  header  CHAPTER NNP      1"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th>section</th>\n",
       "      <th>token</th>\n",
       "      <th>pos</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>73</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>141</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>158</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>174</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>193</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>217</th>\n",
       "      <th>body</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>231</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <th>header</th>\n",
       "      <th>CHAPTER</th>\n",
       "      <th>NNP</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "Earlier, token counts were visualized using `tokens.plot()`, a built-in function of DataFrames that uses the Matplotlib visualization library. \n",
    "\n",
    "We can add to the earlier visualization by using Matplotlib directly. Without dwelling too much on the specifics, try the following code which simply goes through every page number in the earlier search for 'CHAPTER' and adds a red vertical line at the place in the chart with `matplotlib.pyplot.axvline()`:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 7,
   "cell_type": "code",
   "source": [
    "# Get just the page numbers from the search for \"CHAPTER\"\n",
    "page_numbers = chapter_pages.index.get_level_values('page')\n",
    "\n",
    "# Visualize the tokens-per-page from before\n",
    "tokens.plot()\n",
    "\n",
    "# Add vertical lines for pages with \"CHAPTER\"\n",
    "import matplotlib.pyplot as plt\n",
    "for page_number in page_numbers:\n",
    "    plt.axvline(x=page_number, color='red')"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": true,
    "collapsed": false
   }
  },
  {
   "source": [
    "### Grouping DataFrames\n",
    "\n",
    "Up to this point, the token count DataFrames have been subsetted, but not modified from the way they were returned by the HTRC Feature Reader. There are many cases where one may want to perform aggregation or transformation based on subsets of data. To do this, Pandas supports the 'split-apply-combine' pattern (Wickham 2011).\n",
    "\n",
    "Split-apply-combine refers to the process of dividing a dataset into groups (*split*), performing some activity for each of those groups (*apply*), and joining the new groups back together into a single DataFrame (*combine*).\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/split-apply-combine.png\" width=\"500px\"/>\n",
    "\n",
    "<img src=\"https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/example-split-apply-combine.png\" width=\"600px\"/>\n",
    "** Figure: Example of Split-Apply-Combine, averaging movie grosses by director.**\n",
    "\n",
    "Split-apply-combine processes are supported on DataFrames with `groupby()`, which tells Pandas to split by some criteria. From there, it is possible to apply some change to each group individually, after which Pandas combines the affected groups into a single DataFrame again.\n",
    "\n",
    "Try the following, can you tell what happens?\n",
    "\n",
    "```\n",
    "tl.groupby(level=[\"pos\"]).sum()\n",
    "```\n",
    "\n",
    "The output is a count of how much each part-of-speech tag (\"pos\") occurs in the entire book.\n",
    "\n",
    "- *Split* with `groupby()`: We took the token count dataframe that is set to `tl` and grouped by the part-of-speech (`pos`) level of the index. This means that rather than thinking into terms of rows, Pandas is now thinking of the `tl` DataFrame as a series of smaller groups, the groups selected by a common value for part of speech. So, all the personal pronouns (\"PRP\") are in one group, and all the adverbs (\"RB\") are in another, and so on.\n",
    "- *Apply* with `sum()`: These groups were sent to an apply function, `sum()`. Sum is an aggregation function, so it sums all the information in the 'count' column for each group. For example, all the rows of data in the adverb group are summed up into a single count of all adverbs. \n",
    "- *Combine*: The combine step is implicit: the DataFrame knows from the `groupby` pattern to take everything that the apply function gives back (in the case of 'sum', just one row for every group) and stick it together.\n",
    "\n",
    "`sum()` is one of many convenient functions [built-in](http://pandas.pydata.org/pandas-docs/stable/groupby.html) to Panadas. Other useful functions are `mean()`, `count()`, `max()`. It is also possible to send your groups to any function that you write with `apply()`.\n",
    "\n",
    "> groupby can be used on data columns or an index. To run against an index, as above, use `levels=[index_level_name]` as above. To group against columns, use `by=[column_name]`.\n",
    "\n",
    "Below are some examples of grouping token counts.\n",
    "\n",
    "- Find most common tokens in the entire volume (sorting by most to least occurrances)\n",
    "  - `tl.groupby(level=\"token\").sum().sort_values(\"count\", ascending=False)`\n",
    "- Count how many pages each token/pos combination occurs on\n",
    "  - `tl.groupby(level=[\"token\", \"pos\"]).count()`\n",
    "  \n",
    "Remember from earlier that certain information can be called by sending arguments to `vol.tokenlist()`, so you don't always have to do the grouping yourself.\n",
    "\n",
    "Transformations can also be done, where the process returns the same number of rows as the input groups, but changes based on some grouping. Here is an example of more advanced usage, a [TF\\*IDF](https://porganized.com/2016/03/09/term-weighting-for-humanists/) function:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": 10,
   "cell_type": "code",
   "source": [
    "import numpy as np # For the log function\n",
    "idf_scores = tl.groupby(level=[\"token\"]).transform(lambda x: x * np.log(1+vol.page_count / x.count()) )\n",
    "idf_scores[1000:1100:30]"
   ],
   "outputs": [
    {
     "execution_count": 10,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                           count\n",
       "page section token pos          \n",
       "24   body    years NNS  2.315830\n",
       "25   body    asked VBD  1.730605\n",
       "             him   PRP  2.994040\n",
       "             n't   RB   1.250162"
      ],
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>page</th>\n",
       "      <th>section</th>\n",
       "      <th>token</th>\n",
       "      <th>pos</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <th>body</th>\n",
       "      <th>years</th>\n",
       "      <th>NNS</th>\n",
       "      <td>2.315830</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"3\" valign=\"top\">25</th>\n",
       "      <th rowspan=\"3\" valign=\"top\">body</th>\n",
       "      <th>asked</th>\n",
       "      <th>VBD</th>\n",
       "      <td>1.730605</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>him</th>\n",
       "      <th>PRP</th>\n",
       "      <td>2.994040</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>n't</th>\n",
       "      <th>RB</th>\n",
       "      <td>1.250162</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "scrolled": true,
    "collapsed": false
   }
  },
  {
   "source": [
    "Compare the function in `transform()` above with the equation:\n",
    "\n",
    "$ IDF_w = log(1 + \\frac{N}{df_w}) $\n",
    "\n",
    "Document frequency, $df_w$, is just 'how many pages (docs) does the word occur on?' Can you modify the above to use corpus frequency, which is 'how many times does the word occur overall in the corpus (i.e. across all pages)?'"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": null,
   "cell_type": "code",
   "source": [
    "# Preparing where we were at the end of the last notebook\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "from htrc_features import FeatureReader\n",
    "fr = FeatureReader(['data/sample-file1.basic.json.bz2', 'data/sample-file2.basic.json.bz2'])\n",
    "vol = fr.first()"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  },
  {
   "source": [
    "# More Features in the HTRC Extracted Features Dataset\n",
    "\n",
    "So far we have mainly used token-counting features, accessed through `Volume.tokenlist()`. The HTRC Extracted Features Dataset provides more features at the volume level. Here are other features that are available at the volume level; try them and see what the output is:\n",
    "\n",
    "- `Volume.line_counts()`: How many vertically spaced lines of text, a measure related to the phyical format of the page.\n",
    "- `Volume.sentence_counts()`: How many sentences of text: a measure related to the content on a page.\n",
    "- `Volume.empty_line_counts()`: How many larger vertical spaces are there on the page between lines of text? This can be used as a proxy for paragraph count. This is based on what software was used to OCR so there are inconsistencies: not all scans in the HathiTrust are OCR'd identically.\n",
    "- `Volume.begin_line_chars()`, `Volume.end_line_chars()`: The count of different characters along the left-most and right-most sides of a page. This can tell you about what kind of page it is: for example, a table of contents might have a lot of numbers or roman numerals at the end of each line\n",
    "\n",
    "Earlier, we saw that the number of words on a page gave some indication of whether it was a page of the novel or a different kind of page. We can see that line count is another contextual 'hint' that could help a researcher focus only on the real content of a page:"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "execution_count": null,
   "cell_type": "code",
   "source": [
    "line_counts = vol.line_counts()\n",
    "plt.plot(line_counts)"
   ],
   "outputs": [
    {
     "execution_count": 2,
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x1136a5ac8>]"
      ]
     },
     "metadata": {}
    },
    {
     "output_type": "display_data",
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAEACAYAAABI5zaHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXmYFNW5/7/vzDDIJiKyuuAKiLtRf0RjMgooVwXMnhuT\naGLWJ4k+N5tgkismN15NjL/rksSbxCT8jF6vmmsEogEJzs2qJoobRNAYiQYZZGBAQGCYeX9/vH3S\nNT29VHed7j5dfD/P0093V1edOufUqW996z3nVIuqghBCSLpoqncGCCGE+IfiTgghKYTiTgghKYTi\nTgghKYTiTgghKYTiTgghKSSWuIvIcBG5R0T+LCIrReT/iMgIEVkqIqtFZImIDK92ZgkhhMQjrnO/\nEcADqno0gBMAPAdgLoBlqjoJwHIA86qTRUIIIeUipSYxici+AFao6hE5y58D8DZV7RCRsQDaVXVy\n9bJKCCEkLnGc+2EANorIj0XkCRH5vogMBjBGVTsAQFXXAxhdzYwSQgiJTxxxbwFwMoDvqOrJALbD\nQjK5lp/PMSCEkEBoibHOKwBeVtU/Zb7/DCbuHSIyJhKW2ZBvYxGh6BNCSAWoqlS6bUnnngm9vCwi\nEzOLpgFYCWAhgEsyyy4GcH+RNFL7uuqqq+qeB5aPZWP50vdKShznDgCXAbhDRAYAeBHAhwE0A7hb\nRD4CYC2A9yTODSGEEC/EEndVfQrAqXl+mu43O4QQQnzAGaoJaWtrq3cWqkqay5fmsgEs395OyXHu\niXcgotXeByGEpA0RgVazQ5UQQkjjQXEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAU\nQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEn\nhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAUQnEnhJAU\nEkvcReQlEXlKRFaIyGOZZSNEZKmIrBaRJSIyPElGOjqAVauSpEAIIcQR17n3AmhT1ZNU9bTMsrkA\nlqnqJADLAcxLkpGFC4EbbkiSAiGEEEdccZc8684BsCDzeQGAC5NkpLfXXoQQQpITV9wVwEMi8kcR\n+Whm2RhV7QAAVV0PYHSSjKhS3AkhxBctMdc7Q1VfFZFRAJaKyGqY4EfJ/V4WdO6EEOKPWOKuqq9m\n3l8TkZ8DOA1Ah4iMUdUOERkLYEOh7efPn/+Pz21tbWhra+u3DsWdELI3097ejvb2dm/piWpxwy0i\ngwE0qeo2ERkCYCmAqwFMA7BJVa8TkSsAjFDVuXm211L7AICbbwb+8AfgzjsrKQYhhKQLEYGqSqXb\nx3HuYwDcJyKaWf8OVV0qIn8CcLeIfATAWgDvqTQTAJ07IYT4pKS4q+pfAZyYZ/kmANN9ZUTVXoQQ\nQpITzAxVOndCCPFHMOLOoZCEEOKPYMSdzp0QQvwRjLjTuRNCiD+CEXc6d0II8Ucw4k7nTggh/ghG\n3OncCSHEH8GIO8e5E0KIP4IRdzp3QgjxRzDizpg7IYT4Ixhxp3MnhBB/UNwJISSFBCPuDMsQQog/\nghF3OndCCPFHMOJO504IIf4IRtx7eznOnRBCfBGMuNO5E0KIP4IRd8bcCSHEH8GIO507IYT4Ixhx\np3MnhBB/BCPudO6EEOKPYMSdzp0QQvwRjLjTuRNCiD+CEXeOcyeEEH8EI+507oQQ4o9gxJ0xd0II\n8Ucw4k7nTggh/ghG3OncCSHEHxR3QghJIbHFXUSaROQJEVmY+T5CRJaKyGoRWSIiw5NkhGEZQgjx\nRznO/XIAqyLf5wJYpqqTACwHMC9JRujcCSHEH7HEXUQOAnAegB9GFs8BsCDzeQGAC5NkRJXj3Akh\nxBdxnfv/BfBFAFH5HaOqHQCgqusBjE6SETp3QgjxR0upFUTkfAAdqvqkiLQVWbWg754/f/4/Pre1\ntaGtrX8yjLkTQvZm2tvb0d7e7i090RKxEBG5BsAHAOwBMAjAMAD3ATgFQJuqdojIWAAPq+rRebbX\nUvsAgPe/H3joIeC118ovBCGEpA0RgapKpduXDMuo6pWqeoiqHg7gfQCWq+oHASwCcElmtYsB3F9p\nJmw/dO6EEOKLJOPcrwUwQ0RWA5iW+V4xjLkTQog/Ssbco6jq/wL438znTQCm+8oInTshhPiDM1QJ\nISSFBCPudO6EEOKPYMSdf9ZBCCH+CEbc6dwJIcQfwYg7Y+6EEOIPijshhKSQYMSdYRlCCPFHMOLu\nOlTZqUoIIckJRtydqFPcCSEkOcGIuwvJMDRDCCHJCUbc6dwJIcQfwYg7nTshhPgjGHF3jp3iTggh\nyQlG3OncCSHEH8GIO507IYT4Ixhxp3MnhBB/BCPudO6EEOKPYMSdzp0QQvwRjLhznDshhPgjGHGn\ncyeEEH9Q3AkhJIUEI+7sUCWEEH8EI+507oQQ4o9gxJ3OnRBC/BGMuNO5E0KIP4IRdzp3QgjxRzDi\n7kSd49wJISQ5wYg7nTshhPijpLiLyEAReVREVojIShG5JrN8hIgsFZHVIrJERIYnyQhj7oQQ4o+S\n4q6quwCcpaonATgewNkicgaAuQCWqeokAMsBzEuSETp3QgjxR6ywjKruyHwcmNlmM4A5ABZkli8A\ncGGSjPT2As3NFHdCCPFBLHEXkSYRWQFgPYB2VV0FYIyqdgCAqq4HMDpJRlSBlhaKOyGE+KAlzkqq\n2gvgJBHZF8ASEWkDkDuupeA4l/nz5//jc1tbG9ra2vqtQ+dOCNmbaW9vR3t7u7f0RMsceygiXwXw\nBoBLAbSpaoeIjAXwsKoenWd9jbOPww8HOjuB9nbgpJPKyhIhhKQOEYGqSqXbxxktc4AbCSMigwDM\nALACwEIAl2RWuxjA/ZVmAsg6d45zJ4SQ5MQJy4wDsEBEBHYxuF1Vf5WJwd8tIh8BsBbAe5JkhGEZ\nQgjxR0lxV9VnAJycZ/kmANN9ZYQdqoQQ4o9gZqjSuRNCiD+CEXc6d0II8Ucw4k7nTggh/ghG3Onc\nCSHEH8GIO507IYT4IxhxV+U4d0II8UUw4k7nTggh/ghG3BlzJ4QQfwQj7nTuhBDij2DEnc6dEFIt\nXn8dePTReueitgQj7r29FHdCSHVYvBj4zGfqnYvaEoy4u9EyFHdCiG/WrLHX3jQaLxhxp3MnhFSL\n1auBrVuBjo5656R2BCXuSca533QT8MorfvOUBn79a2DWLOCHP+z/27e/DaxdGy+du+8GHn88/29X\nXw1s2xYvnVtu6b/PTZuAd7wD+PSn7fhffz2wYUO89H70I+Cpp+Kt+9BD9irG3XcDv/1t/t+uuspi\nt3H43veAF1+Mt24u27ZZnRajq6v/Oj/4AfD885Xt0xfr1gH/8R+Vb3/PPcAf/1h8nR/8AFi1Kl56\nS5cCy5aZax88GHjmGeBf/9Xa2b/9G7BxY+V5DR5VrerLdlGaAQNUzz9f9Wc/i7V6H7ZvV91nH9V/\n//fyt007s2aptrWpvvOd/X+bMEH1mmvipTN5suqHPtR/+Zo1qoDqPfeUTuONN1QHD1b92tf6Ln/q\nKcvLhAmqy5erNjerfuc78fJ1+umqn/lMvHWnTbNXMU48UfWLX+y/fM8ea6O/+U28fR17rOrixfHW\nzeWee6xO16wpvM5jj6mKqG7YkM3fyJGqX/pSZfv0xbe+pdraqvr665VtP2WK6kUXFf69p0d13DjV\nu+6Kl95ZZ6nOmKE6bJjqnDmq73qX1a1rZ3/4Q2X5rAUZ7axYe4Nz7pWEZX71K2DgQOs0IVneeMP+\ntvDii4Hdu/v/9re/xauzF14wR/bAA0BPT9/fFi8Ghg+Pl057OzBgQP91u7uBkSOBOXOAyy8Hhg4F\nFi0qnR5gt9uLFpW+49u61UZLPPaYfc7HK68ATz5paeby0kuWz3y/5dLTYw56z57S6+YjTp1u2WJl\nfuAB+/6HP9j+6n0OLFoEDBpkbrlcXnzRjsGDDxauuxUrgFdfjVe3XV3An/5kd2IDBwJTpwL/8z9W\nt5dfbsfpjTfKz2ejEIy4JxkKuWgRcMUVdsuV6tusMlm+3P6PdsyY/uL+wgvAYYcBK1cCr71WPJ3F\ni4F3vxsYPx545JG+vy1aBHz96/mFP5dFi4AvfMFukdevzy7v7jbRnzXLjuFXvwr87nelQz2dnbat\nCPDss8XXXbIEeMtb7LVkSeFynnCC5S8Xtyzfb7m8/DKwa1dl4t7TY3X59a8Xv8B1ddn54sR88WIL\na3V2Vh4OSsqmTXZxnDcv/sU5yuLFwDvfCRxyiF2s8uHSLdXWADvOb30rcOaZwMSJ9urttXDMM8/Y\nOjt2lJ/PhiGJ7Y/zQsywDKD67ner/td/FV/vwQft9rilJfsaOlT1hRdUL7lEtamp728tLapHHaXa\n3W238C++aOncfHP/9eK8pkyxW0NV1dmzbVnu7ff27XZbvmuX6jnnqD77rOoVV6jefrvqfff1z3++\n15e/bGlt3qy6776l13/LW7L7P/dcW9bUpPqf/6m6bJndnqpa3o85RvXHP7bb1A9/OH+d5b4eekj1\nm9+0dT//eUuru1t14EDVbdts/83Ntu6Pf2y/r1xp4TKXxpAhqqtXq37sY5bO975n6/3616pnnGH1\nddxxquvWWRjJpZfvtf/+dmt+6qmqn/iE6o03WlrPP58tq+Nd77L93X676k9/Wri8zc2q995rZeru\ntm3XrrW83XCDhQPmzFH91Kf6b3vmmZb/U0+1fQDZ0MHChbbO7NmqXV2qp5xibWTWLNUnnlCdP1/1\n+9+3tt3aanW5fbvqmDEWxnr1VUvnlltUb73VPv/whxbGHD7c9jtliuojj6j+y79Y+b7xjWz5t25V\nPfDAwnX5kY9k1z30UFvW2qr66KO27J57Cm87cKDqL39p7ay5WfWDH7RjcNBB2TTf/vb459eSJarX\nX29l+PSnbfvdu1UPOCC7vylTVH/0I/vt4Yez59PkyRaeOuWUbPu/807V225Tvewy04gTT7T0jjtO\nderUeOHEeoGEYZkgxL2313Lyvvep3nFH8XVvu81iv7t3Z1/uROzt7bvcvU4+2RofYOKqagf7+uvz\nr1/sdcwxFqfbsMFE973vzTY0x9q1tq9bb7X3+fPtRJ0xQ/Ud77AyFNvHM8+oHnywlWfVKrs4FVt/\n1SrVI47I7v/YY1X/9Cf7TdXE04l/Z6fl6U1vsthyoTorVL8/+Uk2Jrptm+qgQfa5p8fWveqq7IVp\nyRLVs8/On85XvmL1oqr6q19Zv4D7LZpeodfFF1sZLrpIde7cbN/B0qVWvpdeytbHwQeb4DiKpatq\nsX+3/k03ZevrYx+zet5vP9WXX85us3On6tix2eP9pjfZ++23Wxo33GAXoH33tb4EwC5sgOV9wgQ7\nPhddZL8787Bnjx37556z73PnZi+s119vQj51qu13zBjbrrfXRPmoo7Ll/dnPVKdPz1/e++5TPe+8\n7LouXj5zZta0XHGF9ZPk2/7mm03QTz9d9Re/sP339NhFaetW237KFLuIxTm/HL/+tV0oVVXXr1cd\nNSrbhi691C6Gqqr//d92Tu3erXr88XaeT5hgF7xoern09qp+4AOqCxYUXqfeJBX3IMIyLl4aJ+b+\nxhvAkCF2G+9eLZl/ghXpu9y9Zs2yW0WR7G31xo3A6NH51y/2mjXLbg0feACYPt3yknv73dVl71de\nCRx9tI3kGTzYbjWXLbM0iu3jmGOA1lbg6actn6NGFV9/4MC+edizx+KeAwbY99bWbFjGha0efxyY\nNKlwnRWq39bW7L727Mn+1tRk644end1HVxcwYkTh4xRNx+VVpG96hV6zZ2fL0NKSTcvt292+b99u\nYafDDsvWT7F0AUvTtZPFi4HJk21f558P/OUvdnt/0EF96//88+14u3VduVyeDjwQmDbNQk6TJwNf\n/rK9f//71uafftr2NXu2lR2w8yFatp6ebNvassVix7Nm2X7PP9+2EwFOPdXK7foHFi2ydAu1nWiI\nw7Wd6HFevdracaHjsHChhcWmTbP9NzUBRx3V91wbOzbe+eWYMiU7Ln3jRuCAA7JtqLk5m+c9eyyv\n0fN81qzsskKI2DnJmHuVUc02zFLivnMnsM8+5aV/wQUWD5w9O9vgOzutwZTLBRdYp8wdd9jnlpb+\n8b8tW0zUNm2y+F5PD/CudwGnnw4ce6yJdTFELO1FiyyfI0cWXz83D1GxBPqKe2enxc4BE6lyie4r\nKu6OkSNtH0BWgEql42Lu5TBjhpVr4sS+AujKd9ddFv994QXgiCNMEOIycaK1k9dftwvyDTfY8hNO\nsHjwBRf032bWLDveN91kx2/8+L55OuCAbDu85RZ7/+pXLe9z5gBnnQUcfrhdNArV0549VqdAtm5d\nmtE8ufbz3e/aYIMHHrD85aO5OZtPVTv/3IW1u9uWr1lTuK0ccghw6KEm7AMHZpe7C6Sq5a9UG85l\n5Egr+4YN/c+B3DpxbcfVRaGy5jJoULpj7i2lV6k+vb3ZK36pUQ9vvGEHpRxOPhn4/OfNaX/jG7Ys\njmjmY+pU4JRTbMTFnDk2JjfXuW/ZYusdcwwwcybwta8B55xjo1PiNqazzwZuvRUYN650PqMnKNBf\ndHPF/cQTgfe+18SqXKL7iiPu++1XOp1KxH3YMHNpp59uAu7qtbPTyrZ2rXWmzZtnQlMORx5paT7+\nOHD88XYh+eQngYMPBj7xCUs/lxkzrLN4+nTgS1/qO1rGtbVp04C//tXer7jCxGjrVuCMM2ydfGP7\no/UUde5dXVa3xx0HfPGL1r6ifPKTlo9nnwXe8x4T4HxEhbKnx/bn7qy6u23Ziy+aEy/EvHnmzKO4\nC+TWrWbGWlsLb1+IiRPtApF7rhZqg6edBnzuc8Db3hYv/cGDKe5Vxzn3pqbqOPemJpsY8+qrfZ17\nJeLe3Azcfnv2ez7n7k68b33Lvl92mb1Pnhx/PwcdZMMPK3XuxcR95MisGy2Xcpx7V1d8556bThzm\nz8+mFRXSI46w8n30o8DNN9tQ0HIYPx74zW+s/g8+2NL/3vfstyuvzL/N4MHZ433ttfYck1xxHznS\nRsG4dQAT4WLkim+ucxcBvvnN/tuddFLpCVtA/xCHOw5O3NeutVBbMUOV72I3caINaaz0PHNpuPM1\nV9zz5bmpySbmxYVhmRrgbgXjiHslzt0xdqxtv3mzxfEqbXRRcl0zUDwcEZfx4+OLe7nOPUm5Q3Hu\nUXLF3ZVv1iy7oJcbfnJ1v25dNoTlK0/lklvfueKelNz0XfiqpcWOy5o15d/5ALbN6tXJyu5CO7nn\naimDEZe0h2WCEHfVbFimGs7dIWIn+sqVJvI+To5CMfdCohaXUaPsIrR+fem+gXKce9KLWqkT64AD\nyo+55/YRVJKnaOelK9/06RYHLlecxo2zi8K6dfbZZ54qSSfq3HPDMknJTT/q3PfsMYGupG8mGlKp\npG+rWBqlDEZc6NxrQK2cO2An+iOPWIenG5mRhHzOvVg4opx0R4+2mGkjOffBg+19x47i9VBN5+6E\nYMgQ6/w+5ZTy0ouKe1LnrurPuecLyySlVFims7P0AIB87LeftYU47bcQhdx/oTyXC517DaiVcwfM\nDfz+935CMkD1nDtgwrJqlf+Ye6VOKndfhU6skSPNrRarh6SjZXLTKhQCOe+88jvzBg2y18qVyZ37\njh3Wrt1Fr5J0ovW9a5edA9UOyzhx7+6urDMUSH6uHXGEdUB3dFQnLEPnXgNq7dx///tkAhelWjF3\nwIRl587kzn3AgNo5dyAbdy9WD7nOvdITFPAX347iLqxJnbvP+naC1tXlr40VC8t0d1u7qZe4Dxpk\n/WSPPx5vtEwl6e/Vzl1EDhKR5SKyUkSeEZHLMstHiMhSEVktIktEpOKm5oZCxh3nnkTcJ07s7wSS\nUGi0jI8TzwlLUufuxN1NCKlmzN3lt7OzvNEyPpz77t128d9338rTcowbZ20xqXNPKu654gtYuGjg\nwGR15igklK5DNcmxmTQp+bmWLw1fYZm0D4WM49z3APicqh4D4M0APi0ikwHMBbBMVScBWA5gXqWZ\niA6FjDPOPUlYxo3X9SXuhZy7r7AMUDqvbkZjb6/VX/T22uXR5dO3c8934rtO1bijZZJ2qLrOv85O\nYP/9/fSljB9vMfthwyrb3ol70otpbj0BNjzRh3lw6UeFMhqWcRfMJM4dSHaXnC8NhmXiUVLcVXW9\nqj6Z+bwNwJ8BHARgDoAFmdUWALiw0kxEJzFV27nvu2+8iUFxKRRz9xWWGTYs3snlxCQa4orS2mrx\nWp9OslTMfevWwi66Gs7dV0gGsLofN67yC0U1nfvf/ubHPORLPzcsk9S5A8mdO2AXbQfDMvEoK+Yu\nIocCOBHAIwDGqGoHYBcAAKMrzUQ5k5iSOnfA3EA1nbuvYWrjx8fPp3NghRp7a6sNrWxuTnZxjBtz\nf+klc0aFTrxqjJbxKe7jx1ceb/eZp9yYe0uLf+ee73j6iLkffrid00nKP3GilTXajnyGZdLs3GNX\ni4gMBXAvgMtVdZuI5AZQCgZU5ruphADa2trQ1tbW5/daOnfApmofd1yyNBzVdO5TpsSfSu3EpKmp\nsLh3dNgQ0CTEce4HHpj9U4Q46XR3JzumruzuQWU+OOmk0s+5L4YLayTNU25977+/PWRszJjK0yyW\nfr7RMpVeeAcOtOc5TZhQef5OOMHmK+TmOY3Ovb29He3t7d7Si1UtItICE/bbVfX+zOIOERmjqh0i\nMhZAwX+9jIp7Pmrt3AtNIa+EXOfe3W3hjyFDkqd92GHAT34SPx89Pdk/Gs+ltdWEZujQZHmK49zP\nOcem3xd73EKuc0/SCepO9l27+j68Kglvfau96p2nXOc+ciTw8MP2WAUfRF1wNCzjOlSTOHcAuO++\nZPkbMwa4996+y9LaoZprfK8u9Ue6JYgblvkRgFWqemNk2UIAl2Q+Xwzg/tyN4lLOUEgfzt0nuc49\n+syPWudjz57iYZktW5LXXRznfsQRdjtdLDRVjRmqScM7PonmKYk45sbER460cyTfkykroVhYJrQ6\ndficxJTmsEycoZBnALgIwNkiskJEnhCRmQCuAzBDRFYDmAbg2kozUc4kJh/O3Se5zt1XSKaSfJSK\nuXd1VT6ZJrqfOLfEs2YVr4dqxNxDEqLo8Mwkecqt7/33zz562Fc+i4Vlkjr3auArLLPPPnZnVclf\nezYCJatFVX8HoNDTsKcXWF4WzrnXYpy7b3Kd++bN/uK+5ebDNfh6O3cA+PjH7c+M46STZnH37dzP\nPbf/M9+TUCgs4yPmXi18OfemJhP4nTuTm54QSTAv0B9R515snLsb6pdkNqNvcp27zxEb5eajpyf7\nR+O51Nq5H3mkveKk42uGakhCVA3n3tNjrv2MM/zkMTf9fJOYQnTuvsQdyHaqUtyrRNyYu3PttY5n\nFyPXuddL3KMPqirm3JM2Yl8TSOjc46eTL2zii2JhmdDq1OErLAOkezhkEOIeN+YeWrwdCM+5R2+t\nowwYYM49aVjG1wQSnzNUfUyV9021nLvvu9ZSYZnQnXvSu77QhkP6pKEeHBZavB3o79x9/QlIJfmI\nM1omzc49qZD6JJonnzF33849ziSmUOrU4TMsE9pwSJ8EIe5pc+6+njhZbj5CGi1TTjo+Y+6huExf\noaJCj+T1RfS5RIUmMYVSpw6fYZk0D4cMQtzjOvekj/utBqHF3EuJe61Gy5STTppj7r6du++wjEj+\nkF60QzWUOnXQuccjGHGP49yT/lFHNYg2NKD+MfdC8WtfYZlqOfc0iruPfoDcmLtv5x7dR6FJTKE5\n91xxT1K/dO5Vxj1+oNQ491CdezQsE3LM3VeHqm/nnuYZqj6de1KXWmofhSYxhVKnDp9hmei/lKWN\nIMQ96tyLjXOncy+dj2LivnVrcucevQjTuRemGjH3ajr33LDM3jCJCbDzorvbT75CIwhxL/XgsI0b\n7SA2gnMPOebe2+tnskapfcVNw9dwtpDFPfTRMkDhsEyoQyF93s1E/4IybQQh7qU6VD/+cWDp0vCd\n+65d1lAq/fceH/koJu6An4tjqX2VkwaQ7qGQvkfL1CosE+LcAYfPOkmzc2+ISUzbt1uPdujO3bn2\nesygjePcAb/Ovbu78sfZRustzWEZoDGce6GwTNL+kGrgMyxD515lSjl35yBCd+716kyN5mNvdu4h\njexotJi7MwbRDtUdO6wcIT3uA/AblqFzrzKlnLtreI3i3OuVj2KC6wQmpJi7z8cPhOrce3oaZ7RM\nrnPfsSOci2UUn2EZOvcqE9e579oVXmOLOtBt25L/01GSfNQqLFMN557WDtVGce65x3PAAAuHhlKf\nUThaJh5BiLtz7oXGuTtxD+m22xF1oLt3+/ubt0ryUauwjG/nnlQA3RT6XbvCEaNGHS2T26Ea2vkG\n+B3nTudeZaLOPd8496i4h3LyOnIdaL1OhkZ37kmPa0uLhe1CaR+NOFomNywTfQ8JOvd4BCHupWLu\nIYt7rnOvl7jvzc7dpReiuCcdnlnL0TK5YRkgTOfO0TLxCELcS8XcnZCEKO7RhlZPcW805+6Otaqf\n4XYtLdYBGEr78DWCxx1Xd140VeGMLTRaJvoeEr4fP0DnXkXo3P3ko5Rzb272U38+nHv0aYRJO1QB\nK1dozt3HOPFSx9UH+cIyTuTp3BuXIMQ97miZEMW9kZy7r78o9OHco+n4DMuEIkYuP0nHibvjWq2Q\njNtH7vEUsc+hnW8AY+5xobgnpJGcu68/AY4696SudPduO+ZJhSu0mLuvsEbUVVdT3HPDMoDlPZSL\nZRSOlolHEOIeNywT4kOMGsm5+xJ3n8595047wZLeUYQm7u7pmUnbQ6nj6oN8YRnA6jKU+oxC5x6P\nIMTdOfdC49xD7lANzbkXil+7sIyvffkQnJaWrLj7yFNIDw4D/IQ1auXc812sQ3XuLr+uMz5JvdC5\nV5moc4+Oc3/qKXsPOSyT69zrlb9SDm/AgHCduw9HmjuELwRaWvw593qFZUKqT0f00Q4unFspdO5V\nplDM/aypic3iAAANvElEQVSz7GFcIYt7aM69kOAefTRw6aX+9hWicwfCah8+nXs9wjI+Lk7VwJe5\nAOjcq06hmPvOnfZy4YYQxb1RYu6jRwOf+pS/ffly7r7i5KGKe6M490JhmZDq0+FT3Oncq0wh575r\nlz28CAhX3F0oqbe3vo8fqIXDi+4rVOcektNspJh7I46WoXMvTklxF5HbRKRDRJ6OLBshIktFZLWI\nLBGR4Ukykc+59/TYZyfuoXaoRifjhOLcq11H1Rgtk5S0O3eOlsniYuw++rj2duf+YwDn5iybC2CZ\nqk4CsBzAvCSZyOfc3dU06txDGw3h8PUEwKR5aFTnnuYO1UZx7o0UlgEsz7t20bkXo6S4q+pvAWzO\nWTwHwILM5wUALkySiXyP/N21y95zwzIh3iaG5tyrLe6hxtyTjpzwTSPF3POFZULtUAX8GYO93bnn\nY7SqdgCAqq4HMDpJJoo5923b7D3UmDuQPTnqLe61cu6+LiS+wzKhtQ0fearFcW20sAxA5x4HXz4n\nz1PYy9hYs7FrN6ywkHMPsbG5k6PeYZlaOXdfISCfs0pDFfek7SE6ppuTmLL4Evc0O/dKq6ZDRMao\naoeIjAWwodjK8+fP/8fntrY2tLW19fndOfdRo4DXXrNl+WLuoYo7nXvl6dC5F8cdV05i6ouvsExI\nzr29vR3t7e3e0otbNZJ5ORYCuATAdQAuBnB/sY2j4p4PVRP38eOBdetsWa64hzpaBqBzT5KOzw7V\n0NqGT+fOSUx9SaNzzzW+V199daL04gyFvBPA7wFMFJG/iciHAVwLYIaIrAYwLfO9Ynp7LSwzahTQ\n1WUi2UhhGTr3ytPx6dxDE6JGcu6NOFombc7dNyWrRlXfX+Cn6b4yEe1QHTMGWL++sTpUo869Xvmr\n9VBIX859xw4/opx2585JTH1paUmfc/dNlWUgHq5DFciGZlyFb99ujSzkce6hOPdaDoX05dy3bgWG\nDEmep1DFnaNlqgNHy5QmiFHBzrkDwLhxwKuv9o25DxoU9jh3d3LsTY8f8OXct2xJt7g3inNvxNEy\nHOdenCDEPZ9zzxX3kDtU6dwrT4fOvTj1HC0TYp06fIVlmptNf9zD/9JEEGGZqHMfP96c+4EH2vft\n2+055I0Sc6dzLy+dzZvTLe6NPFrm0kuz52Fo+ArLANmwb7UunvUiOOc+blxf575tm4n7zp3hTS93\n7K3OvdC/PpWTDp17ceo5WubMM4HDD6/OPpPi849eWlvTGXcPQirzOffoUMhBg2xURWgnr2NvdO7d\n3cn/2Jox93hp1Gu0TMj4CssAWeeeNoIQ92LO3Ym7GzUTInujc9+1y96T/LG1b+ceWuefL+fe21uf\nsEzI+AzL0LlXkahz328/c3NRcR882BpeqOLu3FUI49yThkri7svHidXSkv6wTNILjvufg927ax+W\nCRmfYRk69yriHj8A2Im+fXvWGTpxB8I7eR3ORTQ11e+2ttbO3ceJ5VxpmsXdR56id0rVoFHDMoy5\nFycIcXePHwCy4r57NzBsWLZDFQjvttvhnm5Yz/zVOubuy7kD6RZ3X7NvfYUgCqW/N4dl6NyrSNS5\nDxqU/WPsYcOswe2zj/0W2snraG6uf4dvozp3wI+4hzibcsIEP0MJa+Hc9+awTFqdexCHMurcm5pM\n4Lu6TNwBO2lDdGYOOvfK0wHS69yvuspPOq6+GZbJ4upk332Tp0XnXkWizh2wMMymTcDQofY9dHF3\nfxdXT3Gvh3P3MRIESK+4+8JnCCIfe3tYJq3OvW7i/txzJupAX+cO2Mm+eXNf5x7ibbcjNOde7Xqi\nc68tLS0cLZOL61D1cczp3D0zcybw5JP2OToUErCTfdOmxhF3F3MPwblv22ZhrWrva8uW5Pvx6dyn\nTgXOPTd5OiHC0TL98dUGATp377z2GrB6tX2OTmIC+jt358pCFfdQnPumTdboXb1Vc18vv2yziZOm\nA2RHQyVh6lTgwguTpxMiHC3Tn+Zm4JVXkrdBgM7dKzt3mtN14p7PuTdSWCYU575unc3wrdW+kp5Y\nzc32CnWIayjUwrm7x0mE+OymfLS02GNKfLR3OnePdHba+5o19p7PueeGZUKcXu4Ixbn39PhxMnH3\nlfTEammxY53kEQZ7A7WIubv0G+VYuAlwdO6FqZu4NzVlxT2fc3fj3AE697h5AGoj7r721dzsJ96e\ndmoxWqaa6VcDn+2dzt0jnZ3AscdaWEY1v3MH+g6FDFncQ3HuQG3CMm5fPmLuFPfS1GKcezXTrwY+\n2zudu0c2bgSOPBIYOBDo6Mjv3IHG6VDdW5170hOLzj0e1RbfRnXuw4f76Yync/dIZycwciQwaRKw\ncmX/SUy54k7nHi8PAJ17Gqm2+Ibw+IxyaW7219ZbW7P/H5Em6ibuBxwAnH028Mtf5p/EBDSOuIdw\nctC5p5dajJb5y1+AQw6pTvrVoKXFX1sfNQrYsMFPWiFRV+c+axawaFE85x7yDEQ3W25vcu777Zd8\nAgmdezzcaJlqdqh2ddmddKPg07m7f39LG3WLuY8cCbzpTTbLbPXq4h2qjRBz37p174q5+9gPnXs8\nauHcAWDixOqkXw18tUHA0lm3zk9aIVETcd++ve9359ybmoALLgB+8Yv8zt29O2EPdZz70UdbYzv2\n2PrlYdAgYM6c6s9OBez2febM5OlMnAicdlrydNLO8OHmLKst7o3k3I8/Hnjzm/2kNW4cnXvFvPBC\n3+8u5g5YaGbjxvzOfeBAe4Uec7/kEivj5z5XvzwMGAD8/Oe12dekScC3v508nenTgcsuS55O2jnn\nHIsJVzMsAzSWc7/0UuDtb/eTFp17AtxkJYdz7oCd4Pvsk9+5t7baK3RxJ6SaXHCBvTMsUx1Gj7YZ\n8Wkb655I3EVkpog8JyJrROSKQuvdeSfwxBM2KuanPwXWr8+K++DBNmqmlHMPuUOVkGpy2GHAMcdU\nV9zHjvXzxxeNSHOzjZjp6Oi7/N57gVtuAV5/vT75SkrF4i4iTQBuAXAugGMA/LOITM637rBhwIc+\nBPzud8CVVwKf/SwwYkT29698BTj//Oz3XOcecodqe3t7vbNQVdJcvkYq27e+BZx5ZnnbxC3fpEnA\nddeVn6d64/P45YZm1qwBPvlJ+9+JRnX0SZz7aQCeV9W1qtoN4C4Ac/Kt+JOf2CN+b7rJ4tPXXNM3\nDPPmNwOnnJL93khhmUYSiEpIc/kaqWz/9E/AUUeVt03c8jnz1Wj4PH7jxvUV90WLgHe+05z7/vt7\n201NSSLuBwJ4OfL9lcyy/jtpMmd+773Z+GExhgyxbVpaGqNDlRDS2OSOdV+82AZ7NDI1e5rE7NnA\ngw/2deiFGDo0O0Fm0CAT+H32sRchhPjmwAOBG28EHnjAvj/xBDBtWn3zlBRR90em5W4oMhXAfFWd\nmfk+F4Cq6nU561W2A0II2ctR1YqfsJ9E3JsBrAYwDcCrAB4D8M+q+udKM0MIIcQPFYdlVLVHRD4D\nYCksdn8bhZ0QQsKgYudOCCEkXKo2QzXuBKdGQkReEpGnRGSFiDyWWTZCRJaKyGoRWSIiw+udz7iI\nyG0i0iEiT0eWFSyPiMwTkedF5M8ick59ch2fAuW7SkReEZEnMq+Zkd8apnwicpCILBeRlSLyjIhc\nllmeiuOXp3yfzSxPy/EbKCKPZrRkpYhck1nu7/ipqvcX7KLxAoAJAAYAeBLA5Grsq5YvAC8CGJGz\n7DoAX8p8vgLAtfXOZxnleQuAEwE8Xao8AKYAWAEL5R2aOb5S7zJUUL6rAHwuz7pHN1L5AIwFcGLm\n81BY/9fktBy/IuVLxfHL5Hlw5r0ZwCMAzvB5/Krl3GNPcGowBP3vduYAWJD5vADAhTXNUQJU9bcA\nNucsLlSe2QDuUtU9qvoSgOdhxzlYCpQPsOOYyxw0UPlUdb2qPpn5vA3AnwEchJQcvwLlc/NoGv74\nAYCq7sh8HAjTlc3wePyqJe6xJzg1GArgIRH5o4h8NLNsjKp2ANYgAYyuW+78MLpAeXKP6d/RuMf0\nMyLypIj8MHLb27DlE5FDYXcoj6Bwe0xD+R7NLErF8RORJhFZAWA9gHZVXQWPx68uf9bRwJyhqicD\nOA/Ap0XkTJjgR0lbD3XayvNdAIer6omwk8rDw4vrh4gMBXAvgMszDjdV7TFP+VJz/FS1V1VPgt1x\nnSkibfB4/Kol7n8HEP1HxoMyyxoaVX018/4agJ/Dbos6RGQMAIjIWACN/m+MhcrzdwAHR9ZryGOq\nqq9pJogJ4AfI3to2XPlEpAUmfLer6v2Zxak5fvnKl6bj51DVrQAeAHAKPB6/aon7HwEcKSITRKQV\nwPsALKzSvmqCiAzOuAiIyBAA5wB4BlauSzKrXQzg/rwJhIugbwyzUHkWAnifiLSKyGEAjoRNXAud\nPuXLnDCOdwB4NvO5Ecv3IwCrVPXGyLI0Hb9+5UvL8RORA1xISUQGAZgB6zD1d/yq2BM8E9bD/TyA\nufXumfZQnsNgo35WwER9bmb5/gCWZcq6FMB+9c5rGWW6E8A6ALsA/A3AhwGMKFQeAPNgvfR/BnBO\nvfNfYfn+H4CnM8fy57AYZ8OVDzayoifSJp/InHMF22NKypeW43dcpkwrADwF4AuZ5d6OHycxEUJI\nCmGHKiGEpBCKOyGEpBCKOyGEpBCKOyGEpBCKOyGEpBCKOyGEpBCKOyGEpBCKOyGEpJD/D12AN+as\nlJGgAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x11363c828>"
      ]
     },
     "metadata": {}
    }
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "source": [
    "## Page Level\n",
    "\n",
    "If you open the raw dataset file for a HTRC EF volume on your computer, you may notice that features are provided for each page. While this lesson has focused on volumes, most of the features that we have seen can be accessed for a single page; e.g. `Page.tokenlist()` instead of `Volume.tokenlist()`. The methods to access the features are named the same, with the exception that `line_count`, `empty_line_count`, and `sentence_count` are not pluralized.\n",
    "\n",
    "Like iterating over `FeatureReader.volumes()` to get Volume objects, it is possible to iterate across pages with `Volume.pages()`."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "# Next Steps\n",
    "\n",
    "Now that you know the basics of the HTRC Feature Reader, you can learn more about the [Extracted Features dataset](https://analytics.hathitrust.org/features). The [Feature Reader home page](https://github.com/htrc/htrc-feature-reader/blob/master/README.ipynb) contains a lesson similar to this one but for more advanced users (that's you now!), and the [code documentation](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html) gives exact information about what types of information can be called.\n",
    "\n",
    "Underwood (2015) has released a [custom subset of the HTRC EF Dataset](https://analytics.hathitrust.org/genre), comprised of volumes classified by genre: fiction, poetry, and drama. Though many historians will be interested in other corners of the dataset, fiction is a good place to tinker with text mining ideas because of its expressiveness.\n",
    "\n",
    "Finally, the repository for the HTRC Feature Feature has [advanced tutorial notebooks](https://github.com/htrc/htrc-feature-reader/tree/master/examples) showing how to use the library further. For example, one advanced tutorial shows how to [derive 'plot arcs' for a text](https://github.com/htrc/htrc-feature-reader/blob/master/examples/Within-Book%20Sentiment%20Trends.ipynb), a process popularized by Jockers (2015).\n",
    "\n",
    "![Plot Arc Example](https://github.com/programminghistorian/ph-submissions/raw/gh-pages/images/text-mining-with-extracted-features/plot-arc.png)\n",
    "<center>**Plot Arc Example** </center>"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "# References\n",
    "\n",
    "Boris Capitanu, Ted Underwood, Peter Organisciak, Sayan Bhattacharyya, Loretta Auvil, Colleen Fallaw, J. Stephen Downie (2015). \"Extracted Feature Dataset from 4.8 Million HathiTrust Digital Library Public Domain Volumes\" (0.2)[Dataset]. *HathiTrust Research Center*, doi:10.13012/j8td9v7m.\n",
    "\n",
    "Matthew L. Jockers (Feb 2015). \"Revealing Sentiment and Plot Arcs with the Syuzhet Package\". *Matthew L. Jockers*. Blog. http://www.matthewjockers.net/2015/02/02/syuzhet/.\n",
    "\n",
    "Ted Underwood, Boris Capitanu, Peter Organisciak, Sayan Bhattacharyya, Loretta Auvil, Colleen Fallaw, J. Stephen Downie (2015). \"Word Frequencies in English-Language Literature, 1700-1922\" (0.2) [Dataset]. *HathiTrust Research Center*. doi:10.13012/J8JW8BSJ.\n",
    "\n",
    "Hadley Wickham (2011). \"The split-apply-combine strategy for data analysis\". *Journal of Statistical Software*, 40(1), 1-29."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "# Appendix: Downloading custom files via rsync\n",
    "\n",
    "The HTRC Extracted Features (EF) dataset is accessible using *rsync*, a Unix command line program for syncing files. It is already preinstalled on Linux or Mac OS. Windows users need to use *rsync* by downloading a program such as [https://cygwin.com/](Cygwin), which provides a Unix-like command line environment in Windows.\n",
    "\n",
    "To download all *1.3 TB* comprising the EF dataset, you can use this command (be aware the full transfer will take a very long time):\n",
    "\n",
    "```bash\n",
    "rsync -rv data.sharc.hathitrust.org::pd-features/basic/ .\n",
    "```\n",
    "\n",
    " This recurses (the `-r` flag) through all the folders on the HTRC server, and syncs all the files to a location on your system in this case a `.` meaning \"the current folder\"). The `-v` flag means `--verbose`, which simply gives you more information.\n",
    " You can also sync individual files by inputing a full file path. A list of all file paths is available: \n",
    "\n",
    "```bash\n",
    "rsync -azv data.sharc.hathitrust.org::pd-features/listing/pd-basic-file-listing.txt .\n",
    "```\n",
    "\n",
    " It is also possible to sync a subset of files defined in a text file. The Feature Reader Library [has a document describing steps to compile such a list](https://github.com/htrc/htrc-feature-reader/blob/master/examples/ID_to_Rsync_Link.ipynb)."
   ],
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3",
   "language": "python"
  },
  "language_info": {
   "mimetype": "text/x-python",
   "nbconvert_exporter": "python",
   "name": "python",
   "file_extension": ".py",
   "version": "3.5.1",
   "pygments_lexer": "ipython3",
   "codemirror_mode": {
    "version": 3,
    "name": "ipython"
   }
  }
 }
}