jasonost/Tokenize Your Collection.ipynb

## Tokenize Your Collection.ipynb
{
 "worksheets": [
  {
   "cells": [
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "This notebook has answers to the question for the 9/15 assignment on tokenizing your text collection"
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "#### Describe your text collection\n\nMy text comes from the \"detailed description\" field of the ClinicalTrials.gov database compiled and hosted by the [Clinical Trials Transformation Initiative](http://www.ctti-clinicaltrials.org/what-we-do/analysis-dissemination/state-clinical-trials/aact-database) (CTTI). In the database documentation, this field is described as\n\n> Extended description of the protocol, including more technical information (as compared to the Brief Summary) if desired. Do not include the entire protocol; do not duplicate information recorded in other data elements, such as eligibility criteria or outcome measures.\n\nThe original database entries contained line breaks in this field, and while cleaning the data for our information visualization project in Spring 2013, we replaced all line breaks with spaces and replaced all sequences of two or more spaces with a single space. There is no additional markup, although there are special characters (such as bullet points) in the text.\n\nThere are over 90,000 records in the database with a detailed description, totalling over 21,000,000 tokens. For this class, I randomly selected 2,000 descriptions between 100 and 1,000 characters in length, producing a corpus that is 3.5MB."
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "#### Initial computation\n\nLoad the text and report the length of the string"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "raw = open('ct_desc_jason.txt').read()\nprint len(raw)",
     "prompt_number": 1,
     "outputs": [
      {
       "output_type": "stream",
       "text": "3718448\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    },
    {
     "metadata": {},
     "cell_type": "markdown",
     "source": "Split the string on spaces and report the number of tokens when using this method"
    },
    {
     "metadata": {},
     "cell_type": "code",
     "input": "tokens = raw.split(' ')\nprint len(tokens)",
     "prompt_number": 2,
     "outputs": [
      {
       "output_type": "stream",
       "text": "567276\n",
       "stream": "stdout"
      }
     ],
     "language": "python",
     "trusted": true,
     "collapsed": false
    }
   ],
   "metadata": {}
  }
 ],
 "metadata": {
  "name": "",
  "signature": "sha256:d7a739b653943ff3b0853a500c684d03cae55c23940687bcf71db3adee1a90e2"
 },
 "nbformat": 3
}
	{
	"worksheets": [
	{
	"cells": [
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "This notebook has answers to the question for the 9/15 assignment on tokenizing your text collection"
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "#### Describe your text collection\n\nMy text comes from the \"detailed description\" field of the ClinicalTrials.gov database compiled and hosted by the [Clinical Trials Transformation Initiative](http://www.ctti-clinicaltrials.org/what-we-do/analysis-dissemination/state-clinical-trials/aact-database) (CTTI). In the database documentation, this field is described as\n\n> Extended description of the protocol, including more technical information (as compared to the Brief Summary) if desired. Do not include the entire protocol; do not duplicate information recorded in other data elements, such as eligibility criteria or outcome measures.\n\nThe original database entries contained line breaks in this field, and while cleaning the data for our information visualization project in Spring 2013, we replaced all line breaks with spaces and replaced all sequences of two or more spaces with a single space. There is no additional markup, although there are special characters (such as bullet points) in the text.\n\nThere are over 90,000 records in the database with a detailed description, totalling over 21,000,000 tokens. For this class, I randomly selected 2,000 descriptions between 100 and 1,000 characters in length, producing a corpus that is 3.5MB."
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "#### Initial computation\n\nLoad the text and report the length of the string"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "raw = open('ct_desc_jason.txt').read()\nprint len(raw)",
	"prompt_number": 1,
	"outputs": [
	{
	"output_type": "stream",
	"text": "3718448\n",
	"stream": "stdout"
	}
	],
	"language": "python",
	"trusted": true,
	"collapsed": false
	},
	{
	"metadata": {},
	"cell_type": "markdown",
	"source": "Split the string on spaces and report the number of tokens when using this method"
	},
	{
	"metadata": {},
	"cell_type": "code",
	"input": "tokens = raw.split(' ')\nprint len(tokens)",
	"prompt_number": 2,
	"outputs": [
	{
	"output_type": "stream",
	"text": "567276\n",
	"stream": "stdout"
	}
	],
	"language": "python",
	"trusted": true,
	"collapsed": false
	}
	],
	"metadata": {}
	}
	],
	"metadata": {
	"name": "",
	"signature": "sha256:d7a739b653943ff3b0853a500c684d03cae55c23940687bcf71db3adee1a90e2"
	},
	"nbformat": 3
	}