shawngraham/comparing-2016-and-2011.rmd

## comparing-2016-and-2011.rmd
---
title: "Distantly Reading Digital Archaeology"
subtitle: "Part III - Evolution"
author: "Shawn Graham"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: default
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
bibliography: skeleton.bib
link-citations: yes
---

```{r setup, include=FALSE}
library(tufte)
# invalidate cache when the tufte version changes
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
options(htmltools.dir.version = FALSE)

options(java.parameters = "-Xmx5120m")
require(mallet)
require(rJava)
library(mallet)
setwd("/Users/shawngraham/experiments/mob-past/both")
```

# Introduction

A year or two ago, there was a workshop in the Boston area on Digital Archaeology^[Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology [conference site](http://uwm.edu/mobilizing-the-past/)]. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota^[See the press blurb [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)]. I've already generated, and discussed, both a topic model and a word vector model of Mobilizing the Past^[See [Electric Archaeology, October 10 2016](https://electricarchaeology.ca/2016/10/08/distantly-reading-digital-archaeology-mobilizing-the-past-for-a-digital-future/)]. Now, I'm ready to generate a topic model across *both* volumes to see how this slippery beast 'digital archaeology' has changed since  _Archaeology 2.0_ came out in 2011.^[Available [here](http://escholarship.org/uc/item/1r6137tb)]

I split both volumes into 500 line chunks, and then, re-using a script by Ben Marwick first developed to explore change over time in what archaeologists write about during the Day of Archaeology^[[Day of Archaeology](http://dayofarchaeology.com); [Ben's script and analysis](https://github.com/benmarwick/dayofarchaeology)], I fit the topic model and see how the topics break down by volume.

## Generate the model

```{r, include=FALSE}
mobtext <- read.csv("biggercsv.csv", stringsAsFactors = FALSE)
documents <- data.frame(text = mobtext$text,
                        id = make.unique(mobtext$id),
                        class = mobtext$year,
                        author = mobtext$author,
                        stringsAsFactors=FALSE)
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
n.topics <- 30
topic.model <- MalletLDA(n.topics)
topic.model$loadDocuments(mallet.instances)
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
topic.model$setAlphaOptimization(20, 50)
topic.model$train(1000)
topic.model$maximize(10)
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

```

Again, if you check the code for this r-markdown document, you'll see exactly how to do this on your own materials. I fed R a csv where each line was a 500 line bin from the texts, and where there were columns indicating the author, the chapter, and the year for that text chunk. This material had been made all lowercase beforehand (using the `tr` command at the terminal prompt). Then, we use a standard stop-words file to remove the 'if and but' etc, all those wee words that (in this case) get in the way. The topic model is set to generate 30 topics, using the MALLET package.^[```documents <- data.frame(text = mobtext$text,
                        id = make.unique(mobtext$id),
                        class = mobtext$year,
                        author = mobtext$author,
                        stringsAsFactors=FALSE)
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")```]

etc. Let's take a look at what we find:

```{r}
topics.labels
```

## Check to make sure things make a bit of sense
Only one author writes in both volumes, Eric Kansa. Eric is well known for his advocacy of Open Access approaches in archaeology, and in particular, to data. Looking at the titles of his works in these volumes, the first is very much about building infrastructure and the second about the unintended consequences of techno-utopianism in archaeology. So we'll get R to show us the topic breakdown in the chunks (remember, 500 line bins) that Eric wrote.

```{r, include=FALSE}
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$author

# find top n topics for a certain author
df1 <- t(topic_docs[,grep("kansa", names(topic_docs))])
colnames(df1) <- topics.labels
require(reshape2)
topic.proportions.df <- melt(cbind(data.frame(df1),
                                   document=factor(1:nrow(df1))),
                             variable.name="topic",
                             id.vars = "document")
```
```{r fig-fullwidth1, fig.fullwidth = TRUE, fig.height= 15, warning = FALSE, cache = TRUE}
library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +
  coord_flip() +
  facet_wrap(~ document, ncol=5)
```

And happily, we see what we expected to see, which gives us reassurance that the topics we see in the pieces we *haven't* read are sensible.^[Obviously, this isn't a full-proof approach, but it's good enough for this quick and dirty distant read.]

## And now, a dendrogram!
Dendrograms are handy for topic models because they give you a good sense of how the topics relate to one another. Keep in mind that we have *both* volumes in here, so you _should_ see a bit of difference in these topics than in the previous parts of this little series.

```{r fig-fullwidth2, fig.fullwidth = TRUE, fig.height = 15, fig.cap = "A Dendrogram of 30 Topics in _Mobilizing the Past_ and _Archaeology 2.0_", warning = FALSE, cache=TRUE}

plot(hclust(dist(topic.words)), labels=topics.labels)

```

## But wait, there's more! 2011 versus 2016

```{r, include=FALSE}
topic_docs_t <- data.frame(t(topic_docs))
topic_docs_t$year <- documents$class
df3 <- aggregate(topic_docs_t, by=list(topic_docs_t$year), FUN=mean)
years <- sort(as.character(na.omit(as.numeric(as.character(unique(topic_docs_t$year))))))
df3 <- data.frame(t(df3[(df3$Group.1 %in% years),-length(df3)]), stringsAsFactors = FALSE)
names(df3) <- unname(sapply(years, function(i) paste0("y",i)))
df3 <- df3[-1,]
df3 <- data.frame(apply(df3, 2, as.numeric, as.character))
df3$topic <- 1:n.topics
df3$diff <- df3[,1] - df3[,2]
df3[with(df3, order(-abs(diff))), ]
```

```{r fig-fullwidth3, fig.fullwidth = TRUE, fig.cap = "Topics in 2011 v 2016", warning = FALSE, cache=TRUE}
library(reshape2)
df3m <- melt(df3[,-4], id = 3)
ggplot(df3m, aes(fill = as.factor(topic), topic, value)) +
  geom_bar(stat="identity") +
  coord_flip()  +
  facet_wrap(~ variable)
```

There are some pretty clear differences. I'll leave it up to you to explore those differences for yourself! Clearly though, one main difference is that _Mobilizing the Past_ has more case studies, more site-specific explorations of particular issues in digital archaeology (in keeping with the editors' original wish to generate a kind of handbook perhaps?) where has _Archaeology2.0_ is more about laying the groundwork for understanding what 'digital archaeology' was (at that time, in that (North American) place). While I would've thought that perhaps both volumes would not be so clearly distinct - they're both about digital archaeology, both covering the same thing, the field is sorted, right? - that's not the case. The digital moves fast, but I think what we're seeing here is that we are _not_ reinventing the wheel. These two volumes do not compete, so much as they complement.

And so we progress.


```{r bib, include=FALSE}
# create a bib file for the R packages used in this document
knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib')
```
	---
	title: "Distantly Reading Digital Archaeology"
	subtitle: "Part III - Evolution"
	author: "Shawn Graham"
	date: "`r Sys.Date()`"
	output:
	tufte::tufte_html: default
	tufte::tufte_handout:
	citation_package: natbib
	latex_engine: xelatex
	tufte::tufte_book:
	citation_package: natbib
	latex_engine: xelatex
	bibliography: skeleton.bib
	link-citations: yes
	---

	```{r setup, include=FALSE}
	library(tufte)
	# invalidate cache when the tufte version changes
	knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
	options(htmltools.dir.version = FALSE)

	options(java.parameters = "-Xmx5120m")
	require(mallet)
	require(rJava)
	library(mallet)
	setwd("/Users/shawngraham/experiments/mob-past/both")
	```

	# Introduction

	A year or two ago, there was a workshop in the Boston area on Digital Archaeology^[Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology [conference site](http://uwm.edu/mobilizing-the-past/)]. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota^[See the press blurb [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)]. I've already generated, and discussed, both a topic model and a word vector model of Mobilizing the Past^[See [Electric Archaeology, October 10 2016](https://electricarchaeology.ca/2016/10/08/distantly-reading-digital-archaeology-mobilizing-the-past-for-a-digital-future/)]. Now, I'm ready to generate a topic model across both volumes to see how this slippery beast 'digital archaeology' has changed since _Archaeology 2.0_ came out in 2011.^[Available [here](http://escholarship.org/uc/item/1r6137tb)]

	I split both volumes into 500 line chunks, and then, re-using a script by Ben Marwick first developed to explore change over time in what archaeologists write about during the Day of Archaeology^[[Day of Archaeology](http://dayofarchaeology.com); [Ben's script and analysis](https://github.com/benmarwick/dayofarchaeology)], I fit the topic model and see how the topics break down by volume.

	## Generate the model

	```{r, include=FALSE}
	mobtext <- read.csv("biggercsv.csv", stringsAsFactors = FALSE)
	documents <- data.frame(text = mobtext$text,
	id = make.unique(mobtext$id),
	class = mobtext$year,
	author = mobtext$author,
	stringsAsFactors=FALSE)
	mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
	n.topics <- 30
	topic.model <- MalletLDA(n.topics)
	topic.model$loadDocuments(mallet.instances)
	vocabulary <- topic.model$getVocabulary()
	word.freqs <- mallet.word.freqs(topic.model)
	topic.model$setAlphaOptimization(20, 50)
	topic.model$train(1000)
	topic.model$maximize(10)
	doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
	topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
	topic.docs <- t(doc.topics)
	topic.docs <- topic.docs / rowSums(topic.docs)
	topics.labels <- rep("", n.topics)
	for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

	```

	Again, if you check the code for this r-markdown document, you'll see exactly how to do this on your own materials. I fed R a csv where each line was a 500 line bin from the texts, and where there were columns indicating the author, the chapter, and the year for that text chunk. This material had been made all lowercase beforehand (using the `tr` command at the terminal prompt). Then, we use a standard stop-words file to remove the 'if and but' etc, all those wee words that (in this case) get in the way. The topic model is set to generate 30 topics, using the MALLET package.^[```documents <- data.frame(text = mobtext$text,
	id = make.unique(mobtext$id),
	class = mobtext$year,
	author = mobtext$author,
	stringsAsFactors=FALSE)
	mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")```]

	etc. Let's take a look at what we find:

	```{r}
	topics.labels
	```

	## Check to make sure things make a bit of sense
	Only one author writes in both volumes, Eric Kansa. Eric is well known for his advocacy of Open Access approaches in archaeology, and in particular, to data. Looking at the titles of his works in these volumes, the first is very much about building infrastructure and the second about the unintended consequences of techno-utopianism in archaeology. So we'll get R to show us the topic breakdown in the chunks (remember, 500 line bins) that Eric wrote.

	```{r, include=FALSE}
	topic_docs <- data.frame(topic.docs)
	names(topic_docs) <- documents$author

	# find top n topics for a certain author
	df1 <- t(topic_docs[,grep("kansa", names(topic_docs))])
	colnames(df1) <- topics.labels
	require(reshape2)
	topic.proportions.df <- melt(cbind(data.frame(df1),
	document=factor(1:nrow(df1))),
	variable.name="topic",
	id.vars = "document")
	```
	```{r fig-fullwidth1, fig.fullwidth = TRUE, fig.height= 15, warning = FALSE, cache = TRUE}
	library(ggplot2)
	ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
	geom_bar(stat="identity") +
	ylab("proportion") +
	theme(axis.text.x = element_text(angle=90, hjust=1)) +
	coord_flip() +
	facet_wrap(~ document, ncol=5)
	```

	And happily, we see what we expected to see, which gives us reassurance that the topics we see in the pieces we haven't read are sensible.^[Obviously, this isn't a full-proof approach, but it's good enough for this quick and dirty distant read.]

	## And now, a dendrogram!
	Dendrograms are handy for topic models because they give you a good sense of how the topics relate to one another. Keep in mind that we have both volumes in here, so you _should_ see a bit of difference in these topics than in the previous parts of this little series.

	```{r fig-fullwidth2, fig.fullwidth = TRUE, fig.height = 15, fig.cap = "A Dendrogram of 30 Topics in _Mobilizing the Past_ and _Archaeology 2.0_", warning = FALSE, cache=TRUE}

	plot(hclust(dist(topic.words)), labels=topics.labels)

	```

	## But wait, there's more! 2011 versus 2016

	```{r, include=FALSE}
	topic_docs_t <- data.frame(t(topic_docs))
	topic_docs_t$year <- documents$class
	df3 <- aggregate(topic_docs_t, by=list(topic_docs_t$year), FUN=mean)
	years <- sort(as.character(na.omit(as.numeric(as.character(unique(topic_docs_t$year))))))
	df3 <- data.frame(t(df3[(df3$Group.1 %in% years),-length(df3)]), stringsAsFactors = FALSE)
	names(df3) <- unname(sapply(years, function(i) paste0("y",i)))
	df3 <- df3[-1,]
	df3 <- data.frame(apply(df3, 2, as.numeric, as.character))
	df3$topic <- 1:n.topics
	df3$diff <- df3[,1] - df3[,2]
	df3[with(df3, order(-abs(diff))), ]
	```

	```{r fig-fullwidth3, fig.fullwidth = TRUE, fig.cap = "Topics in 2011 v 2016", warning = FALSE, cache=TRUE}
	library(reshape2)
	df3m <- melt(df3[,-4], id = 3)
	ggplot(df3m, aes(fill = as.factor(topic), topic, value)) +
	geom_bar(stat="identity") +
	coord_flip() +
	facet_wrap(~ variable)
	```

	There are some pretty clear differences. I'll leave it up to you to explore those differences for yourself! Clearly though, one main difference is that _Mobilizing the Past_ has more case studies, more site-specific explorations of particular issues in digital archaeology (in keeping with the editors' original wish to generate a kind of handbook perhaps?) where has _Archaeology2.0_ is more about laying the groundwork for understanding what 'digital archaeology' was (at that time, in that (North American) place). While I would've thought that perhaps both volumes would not be so clearly distinct - they're both about digital archaeology, both covering the same thing, the field is sorted, right? - that's not the case. The digital moves fast, but I think what we're seeing here is that we are _not_ reinventing the wheel. These two volumes do not compete, so much as they complement.

	And so we progress.


	```{r bib, include=FALSE}
	# create a bib file for the R packages used in this document
	knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib')
	```