shawngraham/tm-mobpast.rmd

## tm-mobpast.rmd
---
title: "A Topic Model of Mobilizing the Past"
subtitle: "or, Distantly Reading Digital Archaeology"
author: "Shawn Graham"
date: "`r Sys.Date()`"
output:
  tufte::tufte_html: default
  tufte::tufte_handout:
    citation_package: natbib
    latex_engine: xelatex
  tufte::tufte_book:
    citation_package: natbib
    latex_engine: xelatex
bibliography: skeleton.bib
link-citations: yes
---

```{r setup, include=FALSE}
library(tufte)
# invalidate cache when the tufte version changes
knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
options(htmltools.dir.version = FALSE)

options(java.parameters = "-Xmx5120m")
require(mallet)
require(rJava)
library(mallet)
setwd("/Users/shawngraham/experiments/mob-past/fulltext-500lines")
```

# Introduction

A year or two ago, there was a workshop in the Boston area on Digital Archaeology^[Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology [conference site](http://uwm.edu/mobilizing-the-past/)]. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota^[See the press blurb [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)]. I thought it would be interesting to review that work by reading it distantly, and then see how it compares with an earlier important work on digital archaeology, _Archaeology 2.0_^[Available [here](http://escholarship.org/uc/item/1r6137tb). Apparently, there's an undiscovered 'easter egg' in that volume too.] This current essay is step one in that program: generating, and reflecting, on a topic model.

# Generating A Topic Model

There are any number of ways to generate, or fit, a topic model to a collection of materials. Since I was allowed to have an advance copy of the volume, and it arrived as a pdf, I extracted the text from the volume into a single text file. Since the two volumes are different lengths, and I want things to be roughly comparable, I instead broked each one into 500 line chunks, and ingested them.^[I used the command `split -l 500 completevolume.txt` at the terminal to achieve this.] The actual code for the topic model (and especially, for comparing topic distribution within and across the volumes) was repurposed from Ben Marwick's 'Day of Archaeology' code.^['A Distant Reading of a Day of Archaeology [https://github.com/benmarwick/dayofarchaeology](https://github.com/benmarwick/dayofarchaeology)].

I then twiddled various knobs and dials and eventually decided that 30 topics was about right. Here's what I found:

```{r, include=FALSE}
library(mallet)
mobtext <- read.csv("bigcsv.csv", stringsAsFactors = FALSE)

documents <- data.frame(text = mobtext$text,
                        id = make.unique(mobtext$id),
                        class = mobtext$chapter,
                        stringsAsFactors=FALSE)
mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

n.topics <- 30
topic.model <- MalletLDA(n.topics)

topic.model$loadDocuments(mallet.instances)

vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

topic.model$setAlphaOptimization(20, 50)

topic.model$train(1000)

topic.model$maximize(10)

doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)

topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
```

```{r}
topics.labels
```

## Check that these topics make sense

We know that Bill Caraher's piece in _Mobilizing the Past_ makes the case for a 'slow' archaeology.^[Bill has written a lot about slow archaeology; start [here](https://mediterraneanworld.wordpress.com/2014/02/17/toward-a-slow-archaeology-part-1/)] Let's see what topics are present in his chapter:
```{r, include=FALSE}
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$id
df1 <- t(topic_docs[,grep("xbd", names(topic_docs))])

#8963295 is a person who has 'for sale' in her post
colnames(df1) <- topics.labels
require(reshape2)
topic.proportions.df <- melt(cbind(data.frame(df1),
                                   document=factor(1:nrow(df1))),
                             variable.name="topic",
                             id.vars = "document")
```

```{r fig-fullwidthq, fig.fullwidth = TRUE, warning = FALSE, cache = TRUE}
library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +
  coord_flip() +
  facet_wrap(~ document, ncol=5)
```

That makes a lot of sense, knowing what we know already about Bill's work. What about Eric's piece? (I feel safe in drawing attention to Bill and Eric's pieces because they have already been shared online, in various ways)

```{r, include=FALSE}
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$id
df1 <- t(topic_docs[,grep("xbf", names(topic_docs))])

#8963295 is a person who has 'for sale' in her post
colnames(df1) <- topics.labels
require(reshape2)
topic.proportions.df <- melt(cbind(data.frame(df1),
                                   document=factor(1:nrow(df1))),
                             variable.name="topic",
                             id.vars = "document")
```

```{r fig-fullwidtheric, fig.fullwidth = TRUE, warning = FALSE, cache = TRUE}
library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +
  coord_flip() +
  facet_wrap(~ document, ncol=5)
```

Finally, let's look at the introduction to the volume:

```{r, include=FALSE}
topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$id
df1 <- t(topic_docs[,grep("xac", names(topic_docs))])

#8963295 is a person who has 'for sale' in her post
colnames(df1) <- topics.labels
require(reshape2)
topic.proportions.df <- melt(cbind(data.frame(df1),
                                   document=factor(1:nrow(df1))),
                             variable.name="topic",
                             id.vars = "document")
```

```{r fig-fullwidthpreface, fig.fullwidth = TRUE, warning = FALSE, cache = TRUE}
library(ggplot2)
ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
  geom_bar(stat="identity") +
  ylab("proportion") +
  theme(axis.text.x = element_text(angle=90, hjust=1)) +
  coord_flip() +
  facet_wrap(~ document, ncol=5)
```

The preface (which can be read online, see the link above in the sidenote) discusses how the organizers had hoped that some sort of handbook or how-to for digital archaeology might emerge from their meeting; the distribution of topics here reflect that concern for method. So, this wee check reassures us that the topics that we found are making sense, given what we might naturally expect knowing the context.

## Topics across the entire volume
Let's see how topics play out across the entire volume. First, let's generate a dendrogram to see how the topics relate to one another. This can help us work out which topics are likely subtopics of one another, or otherwise close to one another in the semantic space of this volume. (My apologies if this dendrogram spreads down the page. Can't quite get the hang of this layout package.^[[Tufte](http://rstudio.github.io/tufte/envisioned/)])

```{r fig-fullwidth2, fig.fullwidth = TRUE, fig.height = 20, fig.cap = "A Dendrogram of 30 Topics in _Mobilizing the Past_", warning = FALSE, cache=TRUE}

plot(hclust(dist(topic.words)), labels=topics.labels)
```

We can also visualize the average proportions of each topic by chapter, as well as by section.

```{r, include=FALSE}
topic_docs_t <- data.frame(t(topic_docs))
topic_docs_t$year <- documents$class

df3 <- aggregate(topic_docs_t, by=list(topic_docs_t$year), FUN=mean)

years <- sort(as.character(na.omit(as.numeric(as.character(unique(topic_docs_t$year))))))
df3 <- data.frame(t(df3[(df3$Group.1 %in% years),-length(df3)]), stringsAsFactors = FALSE)

names(df3) <- unname(sapply(years, function(i) paste0("y",i)))

df3 <- df3[-1,]

df3 <- data.frame(apply(df3, 2, as.numeric, as.character))
df3$topic <- 1:n.topics

df3$Part_One <- df3[,1,2] + df3[,3,4] + df3[,5,6] + df3[,7,8] + df3[,9]
df3[with(df3, order(Part_One)),]

df3$Part_Two <- df3[,10,11] + df3[,12,13] + df3[,14]
df3[with(df3, order(Part_Two)),]

df3$Part_Three <- df3[,15,16] + df3[,17,18]
df3[with(df3, order(Part_Three)),]

df3$Part_Four <- df3[,19,20]
df3[with(df3, order(Part_Four)),]

df3$Part_Five <- df3[,21,22]
df3[with(df3, order(Part_Five)),]

df3$diff2_1 <- df3[,25] - df3[,21] # difference b/w part two and part one
df3[with(df3, order(-abs(diff2_1))), ]
library(reshape2)

df3m <- melt(df3[,-29], id = 23) # the first number is the diff you want, the second is the topics
```

```{r fig-fullwidth3, fig.width = 10, fig.height = 30, fig.fullwidth = TRUE, fig.cap = "Topics over Chapters", warning = FALSE, cache=TRUE}

ggplot(df3m, aes(fill = as.factor(topic), topic, value)) +
  geom_bar(stat="identity") +
  coord_flip()  +
  facet_wrap(~ variable)
```

It's quite clear here that most authors stick to one or two main topics in their chapters - which makes sense; a chapter shouldn't be all over the map, one intuits, if it is to be of interest and utility. What is a bit more interesting is the way the book chapters, in aggregate, cover different aspects of things digital. I take this as a healthy sign that the field hasn't congealed yet, that there is a wide possibility-space of ways to be digital and archaeological. The first section is thematically more varied. Section two seems more concerned with the mobile web and public reception (which the titles of the papers would not lead you to expect; is my topic model not fine-grained enough?). Section three: all workflows. Section four: ethics. The book seems to wind up with a consideration of the institutional contexts.

# So what have we learned?
I can hear you now: _Couldn't you, Shawn, have said the same thing by perusing the table of contexts?_ Well, yes, no doubt, to a degree. But here we can see how topics group, and knowing how they average across the volume gives us a real sense of where the field - as these practitioners see it - is and where it's heading.

_And we still haven't read a damned thing_. Keep that in mind: this is just the result of the computer counting words. Speaking of which, in a subsequent post, I'll look at word use in particular. Topic models give us a top down look at things; word vectors give us the bottom up.^[For more on word vectors, see [Ben Schmidt](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html).] This will give us a sense of whether the 'brogrammer' culture^[ugh] has crept into archaeology, how digital work in archaeology is gendered^['Some of us are brave'[JDH](http://journalofdigitalhumanities.org/1-1/all-the-digital-humanists-are-white-all-the-nerds-are-men-but-some-of-us-are-brave-by-moya-z-bailey/)] and so on.

And of course, as archaeologists, it's change over time that really piques our interest. I will eventually compare this volume with the earlier volume edited by Kansa, Kansa, and Watrall, with topic models and word vectors. But I think at this point, you can spot interesting patterns in what's going on, and I invite you to use [hypothesis](http://hypothes.is) to annotate this little public experiment.

```{r bib, include=FALSE}
# create a bib file for the R packages used in this document
knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib')
```
	---
	title: "A Topic Model of Mobilizing the Past"
	subtitle: "or, Distantly Reading Digital Archaeology"
	author: "Shawn Graham"
	date: "`r Sys.Date()`"
	output:
	tufte::tufte_html: default
	tufte::tufte_handout:
	citation_package: natbib
	latex_engine: xelatex
	tufte::tufte_book:
	citation_package: natbib
	latex_engine: xelatex
	bibliography: skeleton.bib
	link-citations: yes
	---

	```{r setup, include=FALSE}
	library(tufte)
	# invalidate cache when the tufte version changes
	knitr::opts_chunk$set(tidy = FALSE, cache.extra = packageVersion('tufte'))
	options(htmltools.dir.version = FALSE)

	options(java.parameters = "-Xmx5120m")
	require(mallet)
	require(rJava)
	library(mallet)
	setwd("/Users/shawngraham/experiments/mob-past/fulltext-500lines")
	```

	# Introduction

	A year or two ago, there was a workshop in the Boston area on Digital Archaeology^[Mobilizing the Past for a Digital Future: the Potential of Digital Archaeology [conference site](http://uwm.edu/mobilizing-the-past/)]. The proceedings from that workshop are now available from the Digital Press at the University of North Dakota^[See the press blurb [here](https://thedigitalpress.org/mobilizing-the-past-for-a-digital-future/)]. I thought it would be interesting to review that work by reading it distantly, and then see how it compares with an earlier important work on digital archaeology, _Archaeology 2.0_^[Available [here](http://escholarship.org/uc/item/1r6137tb). Apparently, there's an undiscovered 'easter egg' in that volume too.] This current essay is step one in that program: generating, and reflecting, on a topic model.

	# Generating A Topic Model

	There are any number of ways to generate, or fit, a topic model to a collection of materials. Since I was allowed to have an advance copy of the volume, and it arrived as a pdf, I extracted the text from the volume into a single text file. Since the two volumes are different lengths, and I want things to be roughly comparable, I instead broked each one into 500 line chunks, and ingested them.^[I used the command `split -l 500 completevolume.txt` at the terminal to achieve this.] The actual code for the topic model (and especially, for comparing topic distribution within and across the volumes) was repurposed from Ben Marwick's 'Day of Archaeology' code.^['A Distant Reading of a Day of Archaeology [https://github.com/benmarwick/dayofarchaeology](https://github.com/benmarwick/dayofarchaeology)].

	I then twiddled various knobs and dials and eventually decided that 30 topics was about right. Here's what I found:

	```{r, include=FALSE}
	library(mallet)
	mobtext <- read.csv("bigcsv.csv", stringsAsFactors = FALSE)

	documents <- data.frame(text = mobtext$text,
	id = make.unique(mobtext$id),
	class = mobtext$chapter,
	stringsAsFactors=FALSE)
	mallet.instances <- mallet.import(documents$id, documents$text, "en.txt", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

	n.topics <- 30
	topic.model <- MalletLDA(n.topics)

	topic.model$loadDocuments(mallet.instances)

	vocabulary <- topic.model$getVocabulary()
	word.freqs <- mallet.word.freqs(topic.model)

	topic.model$setAlphaOptimization(20, 50)

	topic.model$train(1000)

	topic.model$maximize(10)

	doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
	topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

	topic.docs <- t(doc.topics)
	topic.docs <- topic.docs / rowSums(topic.docs)

	topics.labels <- rep("", n.topics)
	for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
	```

	```{r}
	topics.labels
	```

	## Check that these topics make sense

	We know that Bill Caraher's piece in _Mobilizing the Past_ makes the case for a 'slow' archaeology.^[Bill has written a lot about slow archaeology; start [here](https://mediterraneanworld.wordpress.com/2014/02/17/toward-a-slow-archaeology-part-1/)] Let's see what topics are present in his chapter:
	```{r, include=FALSE}
	topic_docs <- data.frame(topic.docs)
	names(topic_docs) <- documents$id
	df1 <- t(topic_docs[,grep("xbd", names(topic_docs))])

	#8963295 is a person who has 'for sale' in her post
	colnames(df1) <- topics.labels
	require(reshape2)
	topic.proportions.df <- melt(cbind(data.frame(df1),
	document=factor(1:nrow(df1))),
	variable.name="topic",
	id.vars = "document")
	```

	```{r fig-fullwidthq, fig.fullwidth = TRUE, warning = FALSE, cache = TRUE}
	library(ggplot2)
	ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
	geom_bar(stat="identity") +
	ylab("proportion") +
	theme(axis.text.x = element_text(angle=90, hjust=1)) +
	coord_flip() +
	facet_wrap(~ document, ncol=5)
	```

	That makes a lot of sense, knowing what we know already about Bill's work. What about Eric's piece? (I feel safe in drawing attention to Bill and Eric's pieces because they have already been shared online, in various ways)

	```{r, include=FALSE}
	topic_docs <- data.frame(topic.docs)
	names(topic_docs) <- documents$id
	df1 <- t(topic_docs[,grep("xbf", names(topic_docs))])

	#8963295 is a person who has 'for sale' in her post
	colnames(df1) <- topics.labels
	require(reshape2)
	topic.proportions.df <- melt(cbind(data.frame(df1),
	document=factor(1:nrow(df1))),
	variable.name="topic",
	id.vars = "document")
	```

	```{r fig-fullwidtheric, fig.fullwidth = TRUE, warning = FALSE, cache = TRUE}
	library(ggplot2)
	ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
	geom_bar(stat="identity") +
	ylab("proportion") +
	theme(axis.text.x = element_text(angle=90, hjust=1)) +
	coord_flip() +
	facet_wrap(~ document, ncol=5)
	```

	Finally, let's look at the introduction to the volume:

	```{r, include=FALSE}
	topic_docs <- data.frame(topic.docs)
	names(topic_docs) <- documents$id
	df1 <- t(topic_docs[,grep("xac", names(topic_docs))])

	#8963295 is a person who has 'for sale' in her post
	colnames(df1) <- topics.labels
	require(reshape2)
	topic.proportions.df <- melt(cbind(data.frame(df1),
	document=factor(1:nrow(df1))),
	variable.name="topic",
	id.vars = "document")
	```

	```{r fig-fullwidthpreface, fig.fullwidth = TRUE, warning = FALSE, cache = TRUE}
	library(ggplot2)
	ggplot(topic.proportions.df, aes(topic, value, fill=document)) +
	geom_bar(stat="identity") +
	ylab("proportion") +
	theme(axis.text.x = element_text(angle=90, hjust=1)) +
	coord_flip() +
	facet_wrap(~ document, ncol=5)
	```

	The preface (which can be read online, see the link above in the sidenote) discusses how the organizers had hoped that some sort of handbook or how-to for digital archaeology might emerge from their meeting; the distribution of topics here reflect that concern for method. So, this wee check reassures us that the topics that we found are making sense, given what we might naturally expect knowing the context.

	## Topics across the entire volume
	Let's see how topics play out across the entire volume. First, let's generate a dendrogram to see how the topics relate to one another. This can help us work out which topics are likely subtopics of one another, or otherwise close to one another in the semantic space of this volume. (My apologies if this dendrogram spreads down the page. Can't quite get the hang of this layout package.^[[Tufte](http://rstudio.github.io/tufte/envisioned/)])

	```{r fig-fullwidth2, fig.fullwidth = TRUE, fig.height = 20, fig.cap = "A Dendrogram of 30 Topics in _Mobilizing the Past_", warning = FALSE, cache=TRUE}

	plot(hclust(dist(topic.words)), labels=topics.labels)
	```

	We can also visualize the average proportions of each topic by chapter, as well as by section.

	```{r, include=FALSE}
	topic_docs_t <- data.frame(t(topic_docs))
	topic_docs_t$year <- documents$class

	df3 <- aggregate(topic_docs_t, by=list(topic_docs_t$year), FUN=mean)

	years <- sort(as.character(na.omit(as.numeric(as.character(unique(topic_docs_t$year))))))
	df3 <- data.frame(t(df3[(df3$Group.1 %in% years),-length(df3)]), stringsAsFactors = FALSE)

	names(df3) <- unname(sapply(years, function(i) paste0("y",i)))

	df3 <- df3[-1,]

	df3 <- data.frame(apply(df3, 2, as.numeric, as.character))
	df3$topic <- 1:n.topics

	df3$Part_One <- df3[,1,2] + df3[,3,4] + df3[,5,6] + df3[,7,8] + df3[,9]
	df3[with(df3, order(Part_One)),]

	df3$Part_Two <- df3[,10,11] + df3[,12,13] + df3[,14]
	df3[with(df3, order(Part_Two)),]

	df3$Part_Three <- df3[,15,16] + df3[,17,18]
	df3[with(df3, order(Part_Three)),]

	df3$Part_Four <- df3[,19,20]
	df3[with(df3, order(Part_Four)),]

	df3$Part_Five <- df3[,21,22]
	df3[with(df3, order(Part_Five)),]

	df3$diff2_1 <- df3[,25] - df3[,21] # difference b/w part two and part one
	df3[with(df3, order(-abs(diff2_1))), ]
	library(reshape2)

	df3m <- melt(df3[,-29], id = 23) # the first number is the diff you want, the second is the topics
	```

	```{r fig-fullwidth3, fig.width = 10, fig.height = 30, fig.fullwidth = TRUE, fig.cap = "Topics over Chapters", warning = FALSE, cache=TRUE}

	ggplot(df3m, aes(fill = as.factor(topic), topic, value)) +
	geom_bar(stat="identity") +
	coord_flip() +
	facet_wrap(~ variable)
	```

	It's quite clear here that most authors stick to one or two main topics in their chapters - which makes sense; a chapter shouldn't be all over the map, one intuits, if it is to be of interest and utility. What is a bit more interesting is the way the book chapters, in aggregate, cover different aspects of things digital. I take this as a healthy sign that the field hasn't congealed yet, that there is a wide possibility-space of ways to be digital and archaeological. The first section is thematically more varied. Section two seems more concerned with the mobile web and public reception (which the titles of the papers would not lead you to expect; is my topic model not fine-grained enough?). Section three: all workflows. Section four: ethics. The book seems to wind up with a consideration of the institutional contexts.

	# So what have we learned?
	I can hear you now: _Couldn't you, Shawn, have said the same thing by perusing the table of contexts?_ Well, yes, no doubt, to a degree. But here we can see how topics group, and knowing how they average across the volume gives us a real sense of where the field - as these practitioners see it - is and where it's heading.

	_And we still haven't read a damned thing_. Keep that in mind: this is just the result of the computer counting words. Speaking of which, in a subsequent post, I'll look at word use in particular. Topic models give us a top down look at things; word vectors give us the bottom up.^[For more on word vectors, see [Ben Schmidt](http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html).] This will give us a sense of whether the 'brogrammer' culture^[ugh] has crept into archaeology, how digital work in archaeology is gendered^['Some of us are brave'[JDH](http://journalofdigitalhumanities.org/1-1/all-the-digital-humanists-are-white-all-the-nerds-are-men-but-some-of-us-are-brave-by-moya-z-bailey/)] and so on.

	And of course, as archaeologists, it's change over time that really piques our interest. I will eventually compare this volume with the earlier volume edited by Kansa, Kansa, and Watrall, with topic models and word vectors. But I think at this point, you can spot interesting patterns in what's going on, and I invite you to use [hypothesis](http://hypothes.is) to annotate this little public experiment.

	```{r bib, include=FALSE}
	# create a bib file for the R packages used in this document
	knitr::write_bib(c('base', 'rmarkdown'), file = 'skeleton.bib')
	```