Skip to content

Instantly share code, notes, and snippets.

@aparrish
Last active October 23, 2020 10:33
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save aparrish/4b096b95bfbd636733b7b9f2636b8cf4 to your computer and use it in GitHub Desktop.
Save aparrish/4b096b95bfbd636733b7b9f2636b8cf4 to your computer and use it in GitHub Desktop.
Quick word counts with Counter. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quick word counts with Counter objects\n",
"\n",
"By [Allison Parrish](http://www.decontextualize.com/)\n",
"\n",
"We discussed how to count the number of times a word occurs in a text the \"old fashioned\" way, using dictionaries. But counting how many times something occurs is a very common task in programming, so Python includes a special kind of object—a `Counter`—to make the task easier.\n",
"\n",
"To use the `Counter` object, you need to import it from the `collections` module."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import Counter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are we going to count? Let's read in the contents of the first chapter of Genesis from the King James Version of the Bible. ([Download it from here](http://rwet.decontextualize.com/texts/genesis.txt) and make sure it's in the same directory as your Python script.)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"text = open(\"genesis.txt\").read() # read the entire file in as a string\n",
"words = text.split() # split it up into words"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, create a `Counter` object by calling `Counter()` with the list of things you want to count:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"count = Counter(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's all you need to do! The `Counter` object has counted up all of the items in the list and stored them with their frequencies. Evaluating the entire object shows its contents: "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'And': 33,\n",
" 'Be': 2,\n",
" 'Behold,': 1,\n",
" 'Day,': 1,\n",
" 'Earth;': 1,\n",
" 'God': 32,\n",
" 'Heaven.': 1,\n",
" 'I': 2,\n",
" 'In': 1,\n",
" 'Let': 8,\n",
" 'Night.': 1,\n",
" 'Seas:': 1,\n",
" 'So': 1,\n",
" 'Spirit': 1,\n",
" 'a': 2,\n",
" 'above': 2,\n",
" 'abundantly': 1,\n",
" 'abundantly,': 1,\n",
" 'after': 11,\n",
" 'air,': 3,\n",
" 'all': 2,\n",
" 'also.': 1,\n",
" 'and': 63,\n",
" 'and,': 1,\n",
" 'appear:': 1,\n",
" 'be': 7,\n",
" 'bearing': 1,\n",
" 'beast': 3,\n",
" 'beginning': 1,\n",
" 'behold,': 1,\n",
" 'blessed': 2,\n",
" 'bring': 3,\n",
" 'brought': 2,\n",
" 'called': 5,\n",
" 'cattle': 1,\n",
" 'cattle,': 2,\n",
" 'created': 5,\n",
" 'creature': 3,\n",
" 'creepeth': 3,\n",
" 'creeping': 2,\n",
" 'darkness': 2,\n",
" 'darkness.': 1,\n",
" 'darkness:': 1,\n",
" 'day': 2,\n",
" 'day,': 1,\n",
" 'day.': 6,\n",
" 'days,': 1,\n",
" 'deep.': 1,\n",
" 'divide': 3,\n",
" 'divided': 2,\n",
" 'dominion': 2,\n",
" 'dry': 2,\n",
" 'earth': 8,\n",
" 'earth,': 6,\n",
" 'earth.': 4,\n",
" 'earth:': 2,\n",
" 'evening': 6,\n",
" 'every': 12,\n",
" 'face': 3,\n",
" 'female': 1,\n",
" 'fifth': 1,\n",
" 'fill': 1,\n",
" 'firmament': 7,\n",
" 'firmament,': 1,\n",
" 'firmament:': 1,\n",
" 'first': 1,\n",
" 'fish': 2,\n",
" 'fly': 1,\n",
" 'for': 6,\n",
" 'form,': 1,\n",
" 'forth': 5,\n",
" 'fourth': 1,\n",
" 'fowl': 6,\n",
" 'from': 5,\n",
" 'fruit': 3,\n",
" 'fruit,': 1,\n",
" 'fruitful,': 2,\n",
" 'gathered': 1,\n",
" 'gathering': 1,\n",
" 'give': 2,\n",
" 'given': 2,\n",
" 'good.': 6,\n",
" 'good:': 1,\n",
" 'grass,': 2,\n",
" 'great': 2,\n",
" 'greater': 1,\n",
" 'green': 1,\n",
" 'had': 1,\n",
" 'hath': 1,\n",
" 'have': 4,\n",
" 'he': 6,\n",
" 'heaven': 5,\n",
" 'heaven.': 1,\n",
" 'herb': 4,\n",
" 'him;': 1,\n",
" 'his': 9,\n",
" 'image': 1,\n",
" 'image,': 2,\n",
" 'in': 13,\n",
" 'is': 4,\n",
" 'it': 15,\n",
" 'it:': 1,\n",
" 'itself,': 2,\n",
" 'kind,': 6,\n",
" 'kind:': 4,\n",
" 'land': 2,\n",
" 'lesser': 1,\n",
" 'let': 6,\n",
" 'life,': 2,\n",
" 'light': 7,\n",
" 'light,': 1,\n",
" 'light.': 1,\n",
" 'light:': 1,\n",
" 'lights': 2,\n",
" 'lights;': 1,\n",
" 'likeness:': 1,\n",
" 'living': 3,\n",
" 'made': 4,\n",
" 'made,': 1,\n",
" 'make': 1,\n",
" 'male': 1,\n",
" 'man': 2,\n",
" 'may': 1,\n",
" 'meat.': 1,\n",
" 'meat:': 1,\n",
" 'midst': 1,\n",
" 'morning': 6,\n",
" 'moved': 1,\n",
" 'moveth': 1,\n",
" 'moveth,': 1,\n",
" 'moving': 1,\n",
" 'multiply': 1,\n",
" 'multiply,': 2,\n",
" 'night,': 1,\n",
" 'night:': 1,\n",
" 'night;': 1,\n",
" 'of': 20,\n",
" 'one': 1,\n",
" 'open': 1,\n",
" 'our': 2,\n",
" 'over': 10,\n",
" 'own': 1,\n",
" 'place,': 1,\n",
" 'replenish': 1,\n",
" 'rule': 3,\n",
" 'said': 1,\n",
" 'said,': 9,\n",
" 'saw': 7,\n",
" 'saying,': 1,\n",
" 'sea,': 2,\n",
" 'seas,': 1,\n",
" 'seasons,': 1,\n",
" 'second': 1,\n",
" 'seed': 3,\n",
" 'seed,': 2,\n",
" 'seed;': 1,\n",
" 'set': 1,\n",
" 'shall': 1,\n",
" 'signs,': 1,\n",
" 'sixth': 1,\n",
" 'so.': 6,\n",
" 'stars': 1,\n",
" 'subdue': 1,\n",
" 'that': 14,\n",
" 'the': 108,\n",
" 'their': 2,\n",
" 'them': 4,\n",
" 'them,': 3,\n",
" 'them.': 1,\n",
" 'there': 5,\n",
" 'thing': 5,\n",
" 'thing,': 1,\n",
" 'third': 1,\n",
" 'to': 11,\n",
" 'together': 2,\n",
" 'tree': 3,\n",
" 'tree,': 1,\n",
" 'two': 1,\n",
" 'under': 2,\n",
" 'unto': 2,\n",
" 'upon': 10,\n",
" 'us': 1,\n",
" 'very': 1,\n",
" 'void;': 1,\n",
" 'was': 17,\n",
" 'waters': 8,\n",
" 'waters,': 1,\n",
" 'waters.': 2,\n",
" 'were': 8,\n",
" 'whales,': 1,\n",
" 'wherein': 1,\n",
" 'which': 5,\n",
" 'whose': 2,\n",
" 'winged': 1,\n",
" 'without': 1,\n",
" 'years:': 1,\n",
" 'yielding': 5,\n",
" 'you': 2})"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use the object as a dictionary to get the count for a particular value:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"8"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count['earth']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Counter` object makes it easy to get only the most common values using the `.most_common()` method. The parameter you pass to the method determines how many items it returns, from most to least common:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 108), ('and', 63), ('And', 33), ('God', 32), ('of', 20)]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count.most_common(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can iterate over this using a `for` loop to print out just the words:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the\n",
"and\n",
"And\n",
"God\n",
"of\n",
"was\n",
"it\n",
"that\n",
"in\n",
"every\n"
]
}
],
"source": [
"for word, number in count.most_common(10):\n",
" print word"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Improving the counts\n",
"\n",
"The word counts returned from this procedure are a little bit weird, in that (a) they count instances of the same word with different cases separately (i.e., \"And\" and \"and\"); and (b) words with punctuation at the end are counted separately from words with no punctuation (i.e., \"day\" and \"day,\").\n",
"\n",
"To fix this problem, we want to \"clean\" our list of words. In this case, \"cleaning\" will consist of:\n",
"\n",
"* Convert all the words to lower case\n",
"* Remove punctuation from the end of the string.\n",
"\n",
"On their own, these are easy operations. The `.lower()` method of a string returns a copy of the string with all letters in lower case:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'hello there, bob'"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Hello there, Bob\".lower()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And the \"strip\" method can be used to remove punctuation from the end of a string, like so:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Okay'"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Okay,\".strip(\",.;:\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combine them into a single expression like so:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'okay'"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"Okay,\".lower().strip(\",.;:\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we need to do is take the list of words and create a new list of words with these transformations applied. You can write this very succinctly with a list comprehension, but let's do it \"long hand\" so it's easier to understand:"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"clean_words = []\n",
"for item in words:\n",
" cleaned = item.lower().strip(\",.;:\")\n",
" clean_words.append(cleaned)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pass these to a new `Counter` object and here's the result:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"count = Counter(clean_words)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'a': 2,\n",
" 'above': 2,\n",
" 'abundantly': 2,\n",
" 'after': 11,\n",
" 'air': 3,\n",
" 'all': 2,\n",
" 'also': 1,\n",
" 'and': 97,\n",
" 'appear': 1,\n",
" 'be': 9,\n",
" 'bearing': 1,\n",
" 'beast': 3,\n",
" 'beginning': 1,\n",
" 'behold': 2,\n",
" 'blessed': 2,\n",
" 'bring': 3,\n",
" 'brought': 2,\n",
" 'called': 5,\n",
" 'cattle': 3,\n",
" 'created': 5,\n",
" 'creature': 3,\n",
" 'creepeth': 3,\n",
" 'creeping': 2,\n",
" 'darkness': 4,\n",
" 'day': 10,\n",
" 'days': 1,\n",
" 'deep': 1,\n",
" 'divide': 3,\n",
" 'divided': 2,\n",
" 'dominion': 2,\n",
" 'dry': 2,\n",
" 'earth': 21,\n",
" 'evening': 6,\n",
" 'every': 12,\n",
" 'face': 3,\n",
" 'female': 1,\n",
" 'fifth': 1,\n",
" 'fill': 1,\n",
" 'firmament': 9,\n",
" 'first': 1,\n",
" 'fish': 2,\n",
" 'fly': 1,\n",
" 'for': 6,\n",
" 'form': 1,\n",
" 'forth': 5,\n",
" 'fourth': 1,\n",
" 'fowl': 6,\n",
" 'from': 5,\n",
" 'fruit': 4,\n",
" 'fruitful': 2,\n",
" 'gathered': 1,\n",
" 'gathering': 1,\n",
" 'give': 2,\n",
" 'given': 2,\n",
" 'god': 32,\n",
" 'good': 7,\n",
" 'grass': 2,\n",
" 'great': 2,\n",
" 'greater': 1,\n",
" 'green': 1,\n",
" 'had': 1,\n",
" 'hath': 1,\n",
" 'have': 4,\n",
" 'he': 6,\n",
" 'heaven': 7,\n",
" 'herb': 4,\n",
" 'him': 1,\n",
" 'his': 9,\n",
" 'i': 2,\n",
" 'image': 3,\n",
" 'in': 14,\n",
" 'is': 4,\n",
" 'it': 16,\n",
" 'itself': 2,\n",
" 'kind': 10,\n",
" 'land': 2,\n",
" 'lesser': 1,\n",
" 'let': 14,\n",
" 'life': 2,\n",
" 'light': 10,\n",
" 'lights': 3,\n",
" 'likeness': 1,\n",
" 'living': 3,\n",
" 'made': 5,\n",
" 'make': 1,\n",
" 'male': 1,\n",
" 'man': 2,\n",
" 'may': 1,\n",
" 'meat': 2,\n",
" 'midst': 1,\n",
" 'morning': 6,\n",
" 'moved': 1,\n",
" 'moveth': 2,\n",
" 'moving': 1,\n",
" 'multiply': 3,\n",
" 'night': 4,\n",
" 'of': 20,\n",
" 'one': 1,\n",
" 'open': 1,\n",
" 'our': 2,\n",
" 'over': 10,\n",
" 'own': 1,\n",
" 'place': 1,\n",
" 'replenish': 1,\n",
" 'rule': 3,\n",
" 'said': 10,\n",
" 'saw': 7,\n",
" 'saying': 1,\n",
" 'sea': 2,\n",
" 'seas': 2,\n",
" 'seasons': 1,\n",
" 'second': 1,\n",
" 'seed': 6,\n",
" 'set': 1,\n",
" 'shall': 1,\n",
" 'signs': 1,\n",
" 'sixth': 1,\n",
" 'so': 7,\n",
" 'spirit': 1,\n",
" 'stars': 1,\n",
" 'subdue': 1,\n",
" 'that': 14,\n",
" 'the': 108,\n",
" 'their': 2,\n",
" 'them': 8,\n",
" 'there': 5,\n",
" 'thing': 6,\n",
" 'third': 1,\n",
" 'to': 11,\n",
" 'together': 2,\n",
" 'tree': 4,\n",
" 'two': 1,\n",
" 'under': 2,\n",
" 'unto': 2,\n",
" 'upon': 10,\n",
" 'us': 1,\n",
" 'very': 1,\n",
" 'void': 1,\n",
" 'was': 17,\n",
" 'waters': 11,\n",
" 'were': 8,\n",
" 'whales': 1,\n",
" 'wherein': 1,\n",
" 'which': 5,\n",
" 'whose': 2,\n",
" 'winged': 1,\n",
" 'without': 1,\n",
" 'years': 1,\n",
" 'yielding': 5,\n",
" 'you': 2})"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Removing stopwords\n",
"\n",
"The ten most common words are, according to the `Counter` object:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('the', 108),\n",
" ('and', 97),\n",
" ('god', 32),\n",
" ('earth', 21),\n",
" ('of', 20),\n",
" ('was', 17),\n",
" ('it', 16),\n",
" ('let', 14),\n",
" ('that', 14),\n",
" ('in', 14)]"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count.most_common(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Intuitively, it seems strange to count these words like \"the\" and \"and\" among the \"most common,\" because words like these are presumably common across *all* texts, not just this text in particular. To solve this problem, we can use \"stopwords\": a list of commonly-occurring English words that shouldn't be counted for the purpose of word frequency. No one exactly agrees on what this list should be, but here's one attempt (from [here](https://gist.github.com/sebleier/554280)):"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"stopwords = [\n",
" \"i\",\n",
" \"me\",\n",
" \"my\",\n",
" \"myself\",\n",
" \"we\",\n",
" \"our\",\n",
" \"ours\",\n",
" \"ourselves\",\n",
" \"you\",\n",
" \"your\",\n",
" \"yours\",\n",
" \"yourself\",\n",
" \"yourselves\",\n",
" \"he\",\n",
" \"him\",\n",
" \"his\",\n",
" \"himself\",\n",
" \"she\",\n",
" \"her\",\n",
" \"hers\",\n",
" \"herself\",\n",
" \"it\",\n",
" \"its\",\n",
" \"itself\",\n",
" \"they\",\n",
" \"them\",\n",
" \"their\",\n",
" \"theirs\",\n",
" \"themselves\",\n",
" \"what\",\n",
" \"which\",\n",
" \"who\",\n",
" \"whom\",\n",
" \"this\",\n",
" \"that\",\n",
" \"these\",\n",
" \"those\",\n",
" \"am\",\n",
" \"is\",\n",
" \"are\",\n",
" \"was\",\n",
" \"were\",\n",
" \"be\",\n",
" \"been\",\n",
" \"being\",\n",
" \"have\",\n",
" \"has\",\n",
" \"had\",\n",
" \"having\",\n",
" \"do\",\n",
" \"does\",\n",
" \"did\",\n",
" \"doing\",\n",
" \"a\",\n",
" \"an\",\n",
" \"the\",\n",
" \"and\",\n",
" \"but\",\n",
" \"if\",\n",
" \"or\",\n",
" \"because\",\n",
" \"as\",\n",
" \"until\",\n",
" \"while\",\n",
" \"of\",\n",
" \"at\",\n",
" \"by\",\n",
" \"for\",\n",
" \"with\",\n",
" \"about\",\n",
" \"against\",\n",
" \"between\",\n",
" \"into\",\n",
" \"through\",\n",
" \"during\",\n",
" \"before\",\n",
" \"after\",\n",
" \"above\",\n",
" \"below\",\n",
" \"to\",\n",
" \"from\",\n",
" \"up\",\n",
" \"down\",\n",
" \"in\",\n",
" \"out\",\n",
" \"on\",\n",
" \"off\",\n",
" \"over\",\n",
" \"under\",\n",
" \"again\",\n",
" \"further\",\n",
" \"then\",\n",
" \"once\",\n",
" \"here\",\n",
" \"there\",\n",
" \"when\",\n",
" \"where\",\n",
" \"why\",\n",
" \"how\",\n",
" \"all\",\n",
" \"any\",\n",
" \"both\",\n",
" \"each\",\n",
" \"few\",\n",
" \"more\",\n",
" \"most\",\n",
" \"other\",\n",
" \"some\",\n",
" \"such\",\n",
" \"no\",\n",
" \"nor\",\n",
" \"not\",\n",
" \"only\",\n",
" \"own\",\n",
" \"same\",\n",
" \"so\",\n",
" \"than\",\n",
" \"too\",\n",
" \"very\",\n",
" \"s\",\n",
" \"t\",\n",
" \"can\",\n",
" \"will\",\n",
" \"just\",\n",
" \"don\",\n",
" \"should\",\n",
" \"now\"\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make use of this list, we'll revise our loop from before. In addition to converting the strings to lower case and removing punctuation, we'll only add a word to the list if it isn't present in the `stopwords` list. Like so:"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"clean_words = []\n",
"for item in words:\n",
" cleaned = item.lower().strip(\",.;:\")\n",
" if cleaned not in stopwords:\n",
" clean_words.append(cleaned)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see what the most common words are:"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('god', 32),\n",
" ('earth', 21),\n",
" ('let', 14),\n",
" ('every', 12),\n",
" ('waters', 11),\n",
" ('day', 10),\n",
" ('said', 10),\n",
" ('kind', 10),\n",
" ('upon', 10),\n",
" ('light', 10)]"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"count = Counter(clean_words)\n",
"count.most_common(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Much better!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment