Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save msalganik/21a585ff38bee58db320ed3329d801b1 to your computer and use it in GitHub Desktop.
Save msalganik/21a585ff38bee58db320ed3329d801b1 to your computer and use it in GitHub Desktop.
revised activity for Chapter 2: replication and extension of Michel et al 2011

[very hard, data collection, requires coding, my favorite] In a widely discussed paper, Michel and colleagues [-@michel_quantitative_2011] analyzed the content of more than 5 million digitized books in an attempt to identify long-term cultural trends. The data that they used has now been released as the Google NGrams dataset, and so we can use the data to replicate and extend some of their work.

In one of the many results in the paper, Michel and colleagues argue that we are forgetting faster and faster. For a particular year, say "1883", they calculated the proportion of 1-grams published in each year between 1875 and 1975 that were "1883". The reasoned that this proportion is a measure of the interest in events that happened in that year. In Fig 3a they plot the usage trajectories for three years: 1883, 1910, and 1950. These three years share a common pattern: little use before that year, then a spike, then decay. Next, to quantify the rate of decay for each year, Michel and colleagues calculate the "half-life" of each year for all years between 1875 and 1975. In Fig 3a (inset) they show that the half-life of each year is decreasing, and they argue that this means that we are forgetting the past faster and faster. Michael and colleagues used Version 1 of the English corpus, but subsequently Google has released a second version of the corpus. Please read all the parts of the question before you begin coding.

This activity will give you practice writing reusable code, interpreting results, and data wrangling (such as working with awkward files and handling missing data). This activity will also help you get up and running with a rich and interesting dataset.

a) Get the raw data from Google Books NGram Viewer website. In particular you should use version 2 of the English language corpus which was released on July 1, 2012. Uncompressed this file 1.4GB.

b) Recreate the main part of Fig 3a of @michel_quantitative_2011. To recreate this figure you will need two files: the one you downloaded in part (a) and the "total counts" file that you can use to convert the raw counts into proportions. Note that the total counts file has a structure that may make it a bit hard to read in. Does Version 2 of the NGram data produce similar results to those presented in @michel_quantitative_2011 which are based on Version 1 data?

c) Now check your graph against the graph created by the NGram Viewer.

d) Recreate Fig 3a (main figure) but change the y-axis to be the raw mention count (not the rate of mentions).

e) Does the difference between (b) and (d) lead to reevaluate any of the results of Michel et al. (2011). Why or why not?

f) Now, using the proportion of mentions, replicate the inset of Fig 3a. That is, for each year between 1875 and 1975, calculate the half-life of that year. The half-life is defined to be the number of years that pass before the proportion of mentions reaches half its peak value. Note that @michel_quantitative_2011 do something more complicated to estimate the half-life---see Sec III.6 of the Supporting Online Information---but they claim that both approaches produce similar results. Does Version 2 of the NGram data produce similar results to those presented in @michel_quantitative_2011 which are based on Version 1 data? (Hint: Don't be surprised if it doesn't.)

g) Were there any years that were outliers such as years that were forgotten particularly quickly or particularly slowly? Briefly speculate about possible reasons for that pattern and explain how you identified the outliers.

h) Now replicate this result for version 2 of the NGrams data in Chinese, French, German, Hebrew, Italian, Russian and Spanish.

i) Comparing across all languages, were there any years that were outliers, such as years that were forgotten particularly quickly or particularly slowly? Briefly speculate about possible reasons for that pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment