-
-
Save iangow/37ad49f907eb1c9d834aae52d0b94cc1 to your computer and use it in GitHub Desktop.
Some ideas for word_count.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from nltk.tokenize import TweetTokenizer\n", | |
"from nltk.tokenize import word_tokenize\n", | |
"from nltk.tokenize import sent_tokenize\n", | |
"import re" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This would be the minimal approach to the issue of punctuation: simply not counting words that are *entirely* punctuation." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Note that I added \"?\" to the end of this expression.\n", | |
"punctuation_regex = r'[.,\\/#!$%\\'\\^&\\*;:{}=\\-_`~()?]'\n", | |
"\n", | |
"def _is_punctuation(token):\n", | |
" match = re.match(r'^' + punctuation_regex + r'$', token)\n", | |
" return match is not None\n", | |
"\n", | |
"def word_count(sent):\n", | |
" words = word_tokenize(sent)\n", | |
" words_clean = [t for t in words if not _is_punctuation(t)]\n", | |
" print(words_clean)\n", | |
" return len(words_clean)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I think we need to create a corpus of sample texts to test. Examples from the ABCL monograph, examples from prior papers on fog, and other test cases, as this will be how we evaluate alternative approaches." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sample_texts = [\n", | |
" \"Tom's idea is crazy, isn't it? Bob has a better plan.\",\n", | |
" \"We're currently in the process of planning the next locations while taking state regulations into consideration.\"\n", | |
"]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def text_word_count(text):\n", | |
" sents = sent_tokenize(text)\n", | |
" return sum([word_count(sent) for sent in sents])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['Tom', \"'s\", 'idea', 'is', 'crazy', 'is', \"n't\", 'it']\n", | |
"['Bob', 'has', 'a', 'better', 'plan']\n", | |
"13\n", | |
"['We', \"'re\", 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n", | |
"17\n" | |
] | |
} | |
], | |
"source": [ | |
"for text in sample_texts:\n", | |
" print(text_word_count(text))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In general, NLTK will view `we're` as two words, because grammatically this is equivalent to `we are`, which is a [noun] followed by a [verb]. But for the purpose of counting words, this grammatical approach makes less sense. It also makes less sense for `Tom's` above, which is better understood as one word (though in some languages `'s` *would* be a separate \"word\" (e.g., `di` in Italian or `的`).\n", | |
"\n", | |
"So in the following, I create a \"clean\" sentence without punctuation and then count words in that." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"['Toms', 'idea', 'is', 'crazy', 'isnt', 'it']\n", | |
"['Bob', 'has', 'a', 'better', 'plan']\n", | |
"11\n", | |
"['Were', 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n", | |
"16\n" | |
] | |
} | |
], | |
"source": [ | |
"def clean_sent(sent):\n", | |
" return re.sub(punctuation_regex, \"\", sent)\n", | |
" \n", | |
"def text_word_count(text):\n", | |
" sents = sent_tokenize(text)\n", | |
" clean_sents = [clean_sent(sent) for sent in sents]\n", | |
" return sum([word_count(sent) for sent in clean_sents])\n", | |
"\n", | |
"for text in sample_texts:\n", | |
" print(text_word_count(text))" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.9.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment