iangow/word_count_test.ipynb Secret

## word_count_test.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.tokenize import TweetTokenizer\n",
    "from nltk.tokenize import word_tokenize\n",
    "from nltk.tokenize import sent_tokenize\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This would be the minimal approach to the issue of punctuation: simply not counting words that are *entirely* punctuation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note that I added \"?\" to the end of this expression.\n",
    "punctuation_regex = r'[.,\\/#!$%\\'\\^&\\*;:{}=\\-_`~()?]'\n",
    "\n",
    "def _is_punctuation(token):\n",
    "    match = re.match(r'^' + punctuation_regex + r'$', token)\n",
    "    return match is not None\n",
    "\n",
    "def word_count(sent):\n",
    "    words = word_tokenize(sent)\n",
    "    words_clean = [t for t in words if not _is_punctuation(t)]\n",
    "    print(words_clean)\n",
    "    return len(words_clean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I think we need to create a corpus of sample texts to test. Examples from the ABCL monograph, examples from prior papers on fog, and other test cases, as this will be how we evaluate alternative approaches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_texts = [\n",
    "    \"Tom's idea is crazy, isn't it? Bob has a better plan.\",\n",
    "    \"We're currently in the process of planning the next locations while taking state regulations into consideration.\"\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def text_word_count(text):\n",
    "    sents = sent_tokenize(text)\n",
    "    return sum([word_count(sent) for sent in sents])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Tom', \"'s\", 'idea', 'is', 'crazy', 'is', \"n't\", 'it']\n",
      "['Bob', 'has', 'a', 'better', 'plan']\n",
      "13\n",
      "['We', \"'re\", 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n",
      "17\n"
     ]
    }
   ],
   "source": [
    "for text in sample_texts:\n",
    "    print(text_word_count(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general, NLTK will view `we're` as two words, because grammatically this is equivalent to `we are`, which is a [noun] followed by a [verb]. But for the purpose of counting words, this grammatical approach makes less sense. It also makes less sense for `Tom's` above, which is better understood as one word (though in some languages `'s` *would* be a separate \"word\" (e.g., `di` in Italian or `的`).\n",
    "\n",
    "So in the following, I create a \"clean\" sentence without punctuation and then count words in that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Toms', 'idea', 'is', 'crazy', 'isnt', 'it']\n",
      "['Bob', 'has', 'a', 'better', 'plan']\n",
      "11\n",
      "['Were', 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n",
      "16\n"
     ]
    }
   ],
   "source": [
    "def clean_sent(sent):\n",
    "    return re.sub(punctuation_regex, \"\", sent)\n",
    "    \n",
    "def text_word_count(text):\n",
    "    sents = sent_tokenize(text)\n",
    "    clean_sents = [clean_sent(sent) for sent in sents]\n",
    "    return sum([word_count(sent) for sent in clean_sents])\n",
    "\n",
    "for text in sample_texts:\n",
    "    print(text_word_count(text))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"from nltk.tokenize import TweetTokenizer\n",
	"from nltk.tokenize import word_tokenize\n",
	"from nltk.tokenize import sent_tokenize\n",
	"import re"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This would be the minimal approach to the issue of punctuation: simply not counting words that are entirely punctuation."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Note that I added \"?\" to the end of this expression.\n",
	"punctuation_regex = r'[.,\\/#!$%\\'\\^&\\*;:{}=\\-_`~()?]'\n",
	"\n",
	"def _is_punctuation(token):\n",
	" match = re.match(r'^' + punctuation_regex + r'$', token)\n",
	" return match is not None\n",
	"\n",
	"def word_count(sent):\n",
	" words = word_tokenize(sent)\n",
	" words_clean = [t for t in words if not _is_punctuation(t)]\n",
	" print(words_clean)\n",
	" return len(words_clean)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I think we need to create a corpus of sample texts to test. Examples from the ABCL monograph, examples from prior papers on fog, and other test cases, as this will be how we evaluate alternative approaches."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"sample_texts = [\n",
	" \"Tom's idea is crazy, isn't it? Bob has a better plan.\",\n",
	" \"We're currently in the process of planning the next locations while taking state regulations into consideration.\"\n",
	"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"def text_word_count(text):\n",
	" sents = sent_tokenize(text)\n",
	" return sum([word_count(sent) for sent in sents])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['Tom', \"'s\", 'idea', 'is', 'crazy', 'is', \"n't\", 'it']\n",
	"['Bob', 'has', 'a', 'better', 'plan']\n",
	"13\n",
	"['We', \"'re\", 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n",
	"17\n"
	]
	}
	],
	"source": [
	"for text in sample_texts:\n",
	" print(text_word_count(text))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In general, NLTK will view `we're` as two words, because grammatically this is equivalent to `we are`, which is a [noun] followed by a [verb]. But for the purpose of counting words, this grammatical approach makes less sense. It also makes less sense for `Tom's` above, which is better understood as one word (though in some languages `'s` would be a separate \"word\" (e.g., `di` in Italian or `的`).\n",
	"\n",
	"So in the following, I create a \"clean\" sentence without punctuation and then count words in that."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"['Toms', 'idea', 'is', 'crazy', 'isnt', 'it']\n",
	"['Bob', 'has', 'a', 'better', 'plan']\n",
	"11\n",
	"['Were', 'currently', 'in', 'the', 'process', 'of', 'planning', 'the', 'next', 'locations', 'while', 'taking', 'state', 'regulations', 'into', 'consideration']\n",
	"16\n"
	]
	}
	],
	"source": [
	"def clean_sent(sent):\n",
	" return re.sub(punctuation_regex, \"\", sent)\n",
	" \n",
	"def text_word_count(text):\n",
	" sents = sent_tokenize(text)\n",
	" clean_sents = [clean_sent(sent) for sent in sents]\n",
	" return sum([word_count(sent) for sent in clean_sents])\n",
	"\n",
	"for text in sample_texts:\n",
	" print(text_word_count(text))"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}