Skip to content

Instantly share code, notes, and snippets.

@iangow
Created April 11, 2021 13:12
Show Gist options
  • Save iangow/37ad49f907eb1c9d834aae52d0b94cc1 to your computer and use it in GitHub Desktop.
Save iangow/37ad49f907eb1c9d834aae52d0b94cc1 to your computer and use it in GitHub Desktop.
Some ideas for word_count.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from nltk.tokenize import TweetTokenizer\n",
"from nltk.tokenize import word_tokenize\n",
"from nltk.tokenize import sent_tokenize\n",
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This would be the minimal approach to the issue of punctuation: simply not counting words that are *entirely* punctuation."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Note that I added \"?\" to the end of this expression.\n",
"punctuation_regex = r'[.,\\/#!$%\\'\\^&\\*;:{}=\\-_`~()?]'\n",
"\n",
"def _is_punctuation(token):\n",
" match = re.match(r'^' + punctuation_regex + r'$', token)\n",
" return match is not None\n",
"\n",
"def word_count(sent):\n",
" words = word_tokenize(sent)\n",
" words_clean = [t for t in words if not _is_punctuation(t)]\n",
" print(words_clean)\n",
" return len(words_clean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I think we need to create a corpus of sample texts to test. Examples from the ABCL monograph, examples from prior papers on fog, and other test cases, as this will be how we evaluate alternative approaches."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"sample_texts = [\n",
" \"Tom's idea is crazy, isn't it? Bob has a better plan.\",\n",
" \"We're currently in the process of planning the next locations while taking state regulations into consideration.\"\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def text_word_count(text):\n",
" sents = sent_tokenize(text)\n",
" return sum([word_count(sent) for sent in sents])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Tom', \"'s\", 'idea', 'is', 'crazy', 'is', \"n't\", 'it']\n",
"['Bob', 'has', 'a', 'better', 'plan']\n",
"13\n",
"['We', \"'re\", 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n",
"17\n"
]
}
],
"source": [
"for text in sample_texts:\n",
" print(text_word_count(text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general, NLTK will view `we're` as two words, because grammatically this is equivalent to `we are`, which is a [noun] followed by a [verb]. But for the purpose of counting words, this grammatical approach makes less sense. It also makes less sense for `Tom's` above, which is better understood as one word (though in some languages `'s` *would* be a separate \"word\" (e.g., `di` in Italian or `的`).\n",
"\n",
"So in the following, I create a \"clean\" sentence without punctuation and then count words in that."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Toms', 'idea', 'is', 'crazy', 'isnt', 'it']\n",
"['Bob', 'has', 'a', 'better', 'plan']\n",
"11\n",
"['Were', 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n",
"16\n"
]
}
],
"source": [
"def clean_sent(sent):\n",
" return re.sub(punctuation_regex, \"\", sent)\n",
" \n",
"def text_word_count(text):\n",
" sents = sent_tokenize(text)\n",
" clean_sents = [clean_sent(sent) for sent in sents]\n",
" return sum([word_count(sent) for sent in clean_sents])\n",
"\n",
"for text in sample_texts:\n",
" print(text_word_count(text))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment