Skip to content

Instantly share code, notes, and snippets.

@jasonost
Created September 12, 2014 21:28
Show Gist options
  • Save jasonost/17155c092ad02630163d to your computer and use it in GitHub Desktop.
Save jasonost/17155c092ad02630163d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"worksheets": [
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "This notebook has answers to the question for the 9/15 assignment on tokenizing your text collection"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### Describe your text collection\n\nMy text comes from the \"detailed description\" field of the ClinicalTrials.gov database compiled and hosted by the [Clinical Trials Transformation Initiative](http://www.ctti-clinicaltrials.org/what-we-do/analysis-dissemination/state-clinical-trials/aact-database) (CTTI). In the database documentation, this field is described as\n\n> Extended description of the protocol, including more technical information (as compared to the Brief Summary) if desired. Do not include the entire protocol; do not duplicate information recorded in other data elements, such as eligibility criteria or outcome measures.\n\nThe original database entries contained line breaks in this field, and while cleaning the data for our information visualization project in Spring 2013, we replaced all line breaks with spaces and replaced all sequences of two or more spaces with a single space. There is no additional markup, although there are special characters (such as bullet points) in the text.\n\nThere are over 90,000 records in the database with a detailed description, totalling over 21,000,000 tokens. For this class, I randomly selected 2,000 descriptions between 100 and 1,000 characters in length, producing a corpus that is 3.5MB."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "#### Initial computation\n\nLoad the text and report the length of the string"
},
{
"metadata": {},
"cell_type": "code",
"input": "raw = open('ct_desc_jason.txt').read()\nprint len(raw)",
"prompt_number": 1,
"outputs": [
{
"output_type": "stream",
"text": "3718448\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Split the string on spaces and report the number of tokens when using this method"
},
{
"metadata": {},
"cell_type": "code",
"input": "tokens = raw.split(' ')\nprint len(tokens)",
"prompt_number": 2,
"outputs": [
{
"output_type": "stream",
"text": "567276\n",
"stream": "stdout"
}
],
"language": "python",
"trusted": true,
"collapsed": false
}
],
"metadata": {}
}
],
"metadata": {
"name": "",
"signature": "sha256:d7a739b653943ff3b0853a500c684d03cae55c23940687bcf71db3adee1a90e2"
},
"nbformat": 3
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment