shawngraham/gist:7efd64c08a94c39a593f

## gistfile1.rmd
---
title: "Topic Modeling the Colonial Newspaper Database"
author: "Shawn Graham"
date: "February 17, 2015"
output: html_document
---

In [Module 3](https://github.com/hist3907b-winter2015/module3-wranglingdata), we used TEI to mark up primary documents. Melodee Beals has been using TEI to markup newspaper articles, creating the [Colonial Newspapers Database](https://github.com/mhbeals/Colonial-Newspaper-Database) (which she shared on github). We then used Github Pages and an XLST stylesheet to convert that database into a table of comma-separated values <https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv>. We are now going to topic model the text of those newspaper articles, to see what patterns of discourse may lie within.

# Getting Started

First we need to set up our workspace. On my machine, I've created a folder on my desktop called 'Beals' to keep all of this neat and tidy.

```{r}
setwd("C:\\Users\\Shawn Graham\\Desktop\\Beals")
# give yourself as much memory as you've got
options(java.parameters = "-Xmx5120m")
#install.packages('rJava')
library(rJava)
## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
#install.packages('mallet')
library(mallet)
```

The first line sets our working directory to that new folder. The next line increases the available memory for java (since our topic modeling tool, MALLET, uses Java). Then we tell R Studio to use the rJava and the Mallet libraries. *If* you were running that code above for the first time, you would remove the # in front of the 'install.packages' command (use ctrl+f to find every line with 'install' on it, and remove the #). Copy the code above into a new R script document in R studio. Put the cursor in the first line; hit ctrl+enter to run the line through the console.

# Getting the data

Now we want to tell R Studio to grab our data from our github page. The thing is, R Studio can easily grab materials from websites where the url is http; but when it is https (as it is with github), things get a bit more fussy. So what we do is use a special package to grab the data, and then shove it into a variable that we can then tease apart for our analysis.

```{r}
#install.packages('RCurl')
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv", .opts = list(ssl.verifypeer = FALSE))

#on a mac, the final bit of that line beginning with .opts was not necessary

documents <- read.csv(text = x, col.names=c("Article_ID", "Newspaper Title", "Newspaper City", "Newspaper Province", "Newspaper Country", "Year", "Month", "Day", "Article Type", "Text", "Keywords"), colClasses=rep("character", 3), sep=",", quote="")
```

Remember how we used 'curl' in one of our scripts in Module 3 to grab data from the Canadiana.org api? RCurl is its doppleganger within R. So, we create a variable called 'x' and we pass to it the Colonial Newspaper Database as csv. Then, we read the contents of that csv file and tell R what the structure of the data is. All of this gets passed to a variable called 'documents'. Incidentally, if we then wanted to look only at the keywords in this table, we would do that by querying 'documents$Keywords'. The $ is quite handy!

Now, let's take a look at our data.

```{r}
counts <- table(documents$Newspaper.City)
barplot(counts, main="Cities", xlab="Number of Articles")

```

Clearly, we're getting an Edinburgh/Glasgow perspective on things. And somewhere in our data, there's a mispelled 'Edinbugh'. Do you see any other error(s) in the plot? How would you correct it(them)?

```{r}
years <- table(documents$Year)
barplot(years, main="Publication Year", xlab="Year", ylab="Number of Articles")
```

There's a lot of material in 1789, another peak around 1819, againg in the late 1830s. We can ask ourselves now: is this an artefact of the data, or of our collection methods? This would be a question a reviewer would want answers to. Let's assume for now that these two plots are 'true' - that, for whatever reasons, only Edinburgh and Glasgow were concerned with these colonial reports, and that they were _particulary_ interested during those three periods. This is already an interesting question that we as historians would want to explore. Try making some more visualizations like this of other aspects of the data. What other patterns do you see that are worth investigating? Here's some [extra help on plotting in R](http://www.statmethods.net/advgraphs/ggplot2.html).

This quick distant look now necessitates some close reading - and back again!

# A distant read

Let's look at the words themselves in our data (and, as an aside, you might read [this by David Mimno](http://www.mimno.org/articles/phrases/))

```{r}
mallet.instances <- mallet.import(documents$Article_ID, documents$Text, "C:\\Mallet\\stoplists\\en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

#when you direct to a path on a windows machine, you need to use \\ rather than one \.
```

That line above passes the article ID and the _text_ of our newspaper articles to the Mallet routine. Note also the path to my *stoplist*. That's a generic one I use all the time, when I'm first playing with topic models. You might want to create your own, one for each project given the particulars of your project. Here's a stoplist you can use <http://bridge.library.wisc.edu/jockersStopList.txt> Download that, and save it into your working directory. Then, in the block above where I've got the location of my stoplist, just put "jockersStopList.txt" and R Studio will look for it in your working directory. Note that Jockers compiled this stoplist for his research in literary history of the 19th century. Your mileage may vary! Finally, the last bit after 'token.regexp' applies a regular expression against our newspaper articles, cleaning them up.

```{r}
#set the number of desired topics
num.topics <- 50
topic.model <- MalletLDA(num.topics)
```

Now we've told Mallet how many topics to search for; this is a number you'd want to fiddle with, to find the 'best' number of topics. The next line creates a variable 'topic.model' which will eventually be filled by Mallet using the LDA approach, for 50 topics. Let's get some info on our topic model, on our distribution of words in these newspaper articles.

```{r}
topic.model$loadDocuments(mallet.instances)
## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
write.csv(word.freqs, "cnd-word-freqs.csv" )
```

As the comment indicates, if we take a look at word frequencies right now, before we generate the model, we can see if there are words that would be better added to our stopword list. The final line writes these word frequencies to a new csv file. Do you see how you might create a bar plot of word frequencies?

Now we do the heavy lifting: generating a topic model:

```{r}
## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)
## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(1000)
## Run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)
## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
```

Let's look at some of our topics.

```{r}
## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[7,])
```

Now we'll write the distribution of the topics by document (ie newspaper article) to a csv file that we could explore/visualize with other tools. Then, we'll take a look at the key words describing each topic.

```{r}
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "cnd-topics-docs.csv" )
## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
# have a look at keywords for each topic
topics.labels
write.csv(topics.labels, "cnd-topics-labels.csv")
```

Some interesting patterns suggest themselves already! But a list of words doesn't capture the relative importance of particular words in particular topics. A word might appear in more than one topic, for instance, but *really* dominate one rather than the other. Word-clouds, that much-maligned technique, is actually rather useful in this regard. Let's visualize these topic words.

```{r}
### do word clouds of the topics
#install.packages('wordcloud')
library(wordcloud)
for(i in 1:num.topics){
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words[i,], 25)
  print(wordcloud(topic.top.words$words,
                  topic.top.words$weights,
                  c(4,.8), rot.per=0,
                  random.order=F))
}
```

Note that when you run all this in RStudio, your plots will appear in the 'plots' window in the bottom right quadrant of the interface, and you can export them as png or pdf. If you export as pdf, you can then open them in a vector graphics program like [Inkscape](https://inkscape.org/en/) or Illustrator, and modify the individual elements to make a more pleasing visual.

Finally, let's do some clustering. We can cluster topics based on how they share words in your corpus:

```{r}
## cluster based on shared words
plot(hclust(dist(topic.words)), labels=topics.labels)
```

The plot is a bit crowded; in RStudio you can open it in a new window to see the dendrogram more clearly.

Now, if we want to get really fancy, we can make a network visualization of how topics interlink due to their distribution in documents. The next bit of code does that, and saves in .graphml format, which packages like Gephi <http://gephi.org> can read.

```{r}
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$article_id

#install.packages("cluster")
library(cluster)
topic_df_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# keep only closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
topic_df_dist[ sweep(topic_df_dist, 1, (apply(topic_df_dist,1,min) + apply(topic_df_dist,1,sd) )) > 0 ] <- 0

#install.packages("igraph")
library(igraph)
g <- as.undirected(graph.adjacency(topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0, vertex.label.dist=0.5, vertex.label = NA)

write.graph(g, file="cnd.graphml", format="graphml")
```

# Conclusion

There are many ways of visualizing and transforming our data. This document only captures a small fraction of the kinds of things you could do. Another good exploration is at <http://bridge.library.wisc.edu/hw1a-Rcoding-Jockers.html>. Ben Marwick does really fun things with the Day of Archaeology blog posts <https://github.com/benmarwick/dayofarchaeology> and indeed, some of the code above comes from Marwick's explorations. Keep your R scripts in your open notebook, and somebody might come along and use them, cite them, improve them, share them! Keep also all your data. Here's an example from my own work <https://github.com/shawngraham/ferguson>.
	---
	title: "Topic Modeling the Colonial Newspaper Database"
	author: "Shawn Graham"
	date: "February 17, 2015"
	output: html_document
	---

	In [Module 3](https://github.com/hist3907b-winter2015/module3-wranglingdata), we used TEI to mark up primary documents. Melodee Beals has been using TEI to markup newspaper articles, creating the [Colonial Newspapers Database](https://github.com/mhbeals/Colonial-Newspaper-Database) (which she shared on github). We then used Github Pages and an XLST stylesheet to convert that database into a table of comma-separated values <https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv>. We are now going to topic model the text of those newspaper articles, to see what patterns of discourse may lie within.

	# Getting Started

	First we need to set up our workspace. On my machine, I've created a folder on my desktop called 'Beals' to keep all of this neat and tidy.

	```{r}
	setwd("C:\\Users\\Shawn Graham\\Desktop\\Beals")
	# give yourself as much memory as you've got
	options(java.parameters = "-Xmx5120m")
	#install.packages('rJava')
	library(rJava)
	## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
	#install.packages('mallet')
	library(mallet)
	```

	The first line sets our working directory to that new folder. The next line increases the available memory for java (since our topic modeling tool, MALLET, uses Java). Then we tell R Studio to use the rJava and the Mallet libraries. If you were running that code above for the first time, you would remove the # in front of the 'install.packages' command (use ctrl+f to find every line with 'install' on it, and remove the #). Copy the code above into a new R script document in R studio. Put the cursor in the first line; hit ctrl+enter to run the line through the console.

	# Getting the data

	Now we want to tell R Studio to grab our data from our github page. The thing is, R Studio can easily grab materials from websites where the url is http; but when it is https (as it is with github), things get a bit more fussy. So what we do is use a special package to grab the data, and then shove it into a variable that we can then tease apart for our analysis.

	```{r}
	#install.packages('RCurl')
	library(RCurl)
	x <- getURL("https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv", .opts = list(ssl.verifypeer = FALSE))

	#on a mac, the final bit of that line beginning with .opts was not necessary

	documents <- read.csv(text = x, col.names=c("Article_ID", "Newspaper Title", "Newspaper City", "Newspaper Province", "Newspaper Country", "Year", "Month", "Day", "Article Type", "Text", "Keywords"), colClasses=rep("character", 3), sep=",", quote="")
	```

	Remember how we used 'curl' in one of our scripts in Module 3 to grab data from the Canadiana.org api? RCurl is its doppleganger within R. So, we create a variable called 'x' and we pass to it the Colonial Newspaper Database as csv. Then, we read the contents of that csv file and tell R what the structure of the data is. All of this gets passed to a variable called 'documents'. Incidentally, if we then wanted to look only at the keywords in this table, we would do that by querying 'documents$Keywords'. The $ is quite handy!

	Now, let's take a look at our data.

	```{r}
	counts <- table(documents$Newspaper.City)
	barplot(counts, main="Cities", xlab="Number of Articles")

	```

	Clearly, we're getting an Edinburgh/Glasgow perspective on things. And somewhere in our data, there's a mispelled 'Edinbugh'. Do you see any other error(s) in the plot? How would you correct it(them)?

	```{r}
	years <- table(documents$Year)
	barplot(years, main="Publication Year", xlab="Year", ylab="Number of Articles")
	```

	There's a lot of material in 1789, another peak around 1819, againg in the late 1830s. We can ask ourselves now: is this an artefact of the data, or of our collection methods? This would be a question a reviewer would want answers to. Let's assume for now that these two plots are 'true' - that, for whatever reasons, only Edinburgh and Glasgow were concerned with these colonial reports, and that they were _particulary_ interested during those three periods. This is already an interesting question that we as historians would want to explore. Try making some more visualizations like this of other aspects of the data. What other patterns do you see that are worth investigating? Here's some [extra help on plotting in R](http://www.statmethods.net/advgraphs/ggplot2.html).

	This quick distant look now necessitates some close reading - and back again!

	# A distant read

	Let's look at the words themselves in our data (and, as an aside, you might read [this by David Mimno](http://www.mimno.org/articles/phrases/))

	```{r}
	mallet.instances <- mallet.import(documents$Article_ID, documents$Text, "C:\\Mallet\\stoplists\\en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

	#when you direct to a path on a windows machine, you need to use \\ rather than one \.
	```

	That line above passes the article ID and the _text_ of our newspaper articles to the Mallet routine. Note also the path to my stoplist. That's a generic one I use all the time, when I'm first playing with topic models. You might want to create your own, one for each project given the particulars of your project. Here's a stoplist you can use <http://bridge.library.wisc.edu/jockersStopList.txt> Download that, and save it into your working directory. Then, in the block above where I've got the location of my stoplist, just put "jockersStopList.txt" and R Studio will look for it in your working directory. Note that Jockers compiled this stoplist for his research in literary history of the 19th century. Your mileage may vary! Finally, the last bit after 'token.regexp' applies a regular expression against our newspaper articles, cleaning them up.

	```{r}
	#set the number of desired topics
	num.topics <- 50
	topic.model <- MalletLDA(num.topics)
	```

	Now we've told Mallet how many topics to search for; this is a number you'd want to fiddle with, to find the 'best' number of topics. The next line creates a variable 'topic.model' which will eventually be filled by Mallet using the LDA approach, for 50 topics. Let's get some info on our topic model, on our distribution of words in these newspaper articles.

	```{r}
	topic.model$loadDocuments(mallet.instances)
	## Get the vocabulary, and some statistics about word frequencies.
	## These may be useful in further curating the stopword list.
	vocabulary <- topic.model$getVocabulary()
	word.freqs <- mallet.word.freqs(topic.model)
	head(word.freqs)
	write.csv(word.freqs, "cnd-word-freqs.csv" )
	```

	As the comment indicates, if we take a look at word frequencies right now, before we generate the model, we can see if there are words that would be better added to our stopword list. The final line writes these word frequencies to a new csv file. Do you see how you might create a bar plot of word frequencies?

	Now we do the heavy lifting: generating a topic model:

	```{r}
	## Optimize hyperparameters every 20 iterations,
	## after 50 burn-in iterations.
	topic.model$setAlphaOptimization(20, 50)
	## Now train a model. Note that hyperparameter optimization is on, by default.
	## We can specify the number of iterations. Here we'll use a large-ish round number.
	topic.model$train(1000)
	## Run through a few iterations where we pick the best topic for each token,
	## rather than sampling from the posterior distribution.
	topic.model$maximize(10)
	## Get the probability of topics in documents and the probability of words in topics.
	## By default, these functions return raw word counts. Here we want probabilities,
	## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
	doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
	topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
	```

	Let's look at some of our topics.

	```{r}
	## What are the top words in topic 7?
	## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
	mallet.top.words(topic.model, topic.words[7,])
	```

	Now we'll write the distribution of the topics by document (ie newspaper article) to a csv file that we could explore/visualize with other tools. Then, we'll take a look at the key words describing each topic.

	```{r}
	topic.docs <- t(doc.topics)
	topic.docs <- topic.docs / rowSums(topic.docs)
	write.csv(topic.docs, "cnd-topics-docs.csv" )
	## Get a vector containing short names for the topics
	topics.labels <- rep("", num.topics)
	for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
	# have a look at keywords for each topic
	topics.labels
	write.csv(topics.labels, "cnd-topics-labels.csv")
	```

	Some interesting patterns suggest themselves already! But a list of words doesn't capture the relative importance of particular words in particular topics. A word might appear in more than one topic, for instance, but really dominate one rather than the other. Word-clouds, that much-maligned technique, is actually rather useful in this regard. Let's visualize these topic words.

	```{r}
	### do word clouds of the topics
	#install.packages('wordcloud')
	library(wordcloud)
	for(i in 1:num.topics){
	topic.top.words <- mallet.top.words(topic.model,
	topic.words[i,], 25)
	print(wordcloud(topic.top.words$words,
	topic.top.words$weights,
	c(4,.8), rot.per=0,
	random.order=F))
	}
	```

	Note that when you run all this in RStudio, your plots will appear in the 'plots' window in the bottom right quadrant of the interface, and you can export them as png or pdf. If you export as pdf, you can then open them in a vector graphics program like [Inkscape](https://inkscape.org/en/) or Illustrator, and modify the individual elements to make a more pleasing visual.

	Finally, let's do some clustering. We can cluster topics based on how they share words in your corpus:

	```{r}
	## cluster based on shared words
	plot(hclust(dist(topic.words)), labels=topics.labels)
	```

	The plot is a bit crowded; in RStudio you can open it in a new window to see the dendrogram more clearly.

	Now, if we want to get really fancy, we can make a network visualization of how topics interlink due to their distribution in documents. The next bit of code does that, and saves in .graphml format, which packages like Gephi <http://gephi.org> can read.

	```{r}
	topic_docs <- data.frame(topic.docs)
	names(topic_docs) <- documents$article_id

	#install.packages("cluster")
	library(cluster)
	topic_df_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))
	# Change row values to zero if less than row minimum plus row standard deviation
	# keep only closely related documents and avoid a dense spagetti diagram
	# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
	topic_df_dist[ sweep(topic_df_dist, 1, (apply(topic_df_dist,1,min) + apply(topic_df_dist,1,sd) )) > 0 ] <- 0

	#install.packages("igraph")
	library(igraph)
	g <- as.undirected(graph.adjacency(topic_df_dist))
	layout1 <- layout.fruchterman.reingold(g, niter=500)
	plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0, vertex.label.dist=0.5, vertex.label = NA)

	write.graph(g, file="cnd.graphml", format="graphml")
	```

	# Conclusion

	There are many ways of visualizing and transforming our data. This document only captures a small fraction of the kinds of things you could do. Another good exploration is at <http://bridge.library.wisc.edu/hw1a-Rcoding-Jockers.html>. Ben Marwick does really fun things with the Day of Archaeology blog posts <https://github.com/benmarwick/dayofarchaeology> and indeed, some of the code above comes from Marwick's explorations. Keep your R scripts in your open notebook, and somebody might come along and use them, cite them, improve them, share them! Keep also all your data. Here's an example from my own work <https://github.com/shawngraham/ferguson>.