Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save juanshishido/261eede4fc846e96af57 to your computer and use it in GitHub Desktop.
Save juanshishido/261eede4fc846e96af57 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Imports"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "%pprint\n\nimport re\n\nfrom nltk.book import *\nfrom collections import defaultdict",
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"text": "Pretty printing has been turned OFF\n*** Introductory Examples for the NLTK Book ***\nLoading text1, ..., text9 and sent1, ..., sent9\nType the name of the text or sentence to view it.\nType: 'texts()' or 'sents()' to list the materials.\ntext1: Moby Dick by Herman Melville 1851\ntext2: Sense and Sensibility by Jane Austen 1811\ntext3: The Book of Genesis\ntext4: Inaugural Address Corpus\ntext5: Chat Corpus\ntext6: Monty Python and the Holy Grail\ntext7: Wall Street Journal\ntext8: Personals Corpus\ntext9: The Man Who Was Thursday by G . K . Chesterton 1908\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Normalize"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "text4[:50]",
"execution_count": 2,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Among', 'the', 'vicissitudes', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxieties', 'than', 'that', 'of', 'which', 'the', 'notification', 'was', 'transmitted', 'by', 'your', 'order', ',', 'and', 'received', 'on', 'the', '14th', 'day', 'of', 'the', 'present', 'month']"
},
"metadata": {},
"execution_count": 2
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## Lowercase and Remove Punctuation"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "text4_clean = [re.sub('[^A-Za-z]+', '', w.lower()) for w in text4]\ntext4_clean = [w for w in text4_clean if w != '']",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "# Questions"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "1. Number of words\n2. Number of unique words\n3. Average word length\n4. Longest word"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Number of words**"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "len(text4_clean)",
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "132147"
},
"metadata": {},
"execution_count": 4
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Number of unique words**"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "len(set(text4_clean))",
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "8970"
},
"metadata": {},
"execution_count": 5
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Average word length**"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "word_lengths = defaultdict(int)\nfor w in set(text4_clean):\n word_lengths[w] = len(w)",
"execution_count": 6,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "sum(word_lengths.values()) / len(word_lengths)",
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "7.822408026755853"
},
"metadata": {},
"execution_count": 7
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Longest word(s)**"
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "max_word_len = max(word_lengths.values())",
"execution_count": 8,
"outputs": []
},
{
"metadata": {
"trusted": true,
"collapsed": false
},
"cell_type": "code",
"source": "print(max_word_len, ':', [k for k, v in word_lengths.items() if v == max_word_len])",
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"text": "17 : ['instrumentalities', 'misrepresentation', 'antiphilosophists', 'contradistinction']\n",
"name": "stdout"
}
]
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"codemirror_mode": {
"version": 3,
"name": "ipython"
},
"mimetype": "text/x-python",
"version": "3.4.2",
"file_extension": ".py",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment