agoldst/threepercent-post.Rmd

## threepercent-post.Rmd
As long promised, here are some links to the data I showed a table of during our discussion of Casanova about U.S. literary translation.
By kind permission of Chad Post, I can make available an aggregate data file of all the literature translations catalogued by Three Percent. I've decided to put the data file, together with some scripts and information about the munging, in a [github repository](http://github.com/agoldst/threepercent). The data consists of a single CSV file with one line for each title: [all_titles.csv](https://github.com/agoldst/threepercent/blob/master/all_titles.csv) ([Wikipedia on CSV format](http://en.wikipedia.org/wiki/Comma-separated_values)).

I have produced this by exporting the first "sheet" of each of the five yearly spreadsheets available at [the Three Percent Translation Database](http://www.rochester.edu/College/translation/threepercent/index.php?s=database) and then combining the files. According to Chad Post, updated data will be available soon, at which point I can reproduce the aggregate file.

The data of most immediate interest are the year of publication, the language of origin, the country of origin, and the genre (fiction vs. poetry).

Here are the top 20 countries by number of translations (labeled "Freq") in 2012:

```{r setup,include=FALSE,cache=FALSE}
opts_chunk$set(echo=FALSE,warning=FALSE,prompt=FALSE,comment="")
options(width=70)
```

```{r initChunk}
setwd("~/Developer/threepercent")
source("~/Developer/threepercent/analysis.R")
```

```{r dependson="initChunk"}
print(subset(
  top.in.year(countries.df,year=2012,n=20,per="Freq"),
  select=c(Country,Year,Freq)),
      row.names=FALSE)
```

and by language:

```{r dependson="initChunk"}
print(top.in.year(languages.df,year=2012,per="Freq",n=20),
      row.names=FALSE)
```

I became curious about what it would look like to "normalize" this data so that the raw counts were comparable in reasonable terms. This turned out to be kind of a challenge to get started on, but anyway here's what I managed.

One possibility would be to think about translations per capita of the "home" country. The WHO has recent population figures. One interesting challenge turned out to be matching Three Percent's "country" names with the international organization's; This is not a politically innocent question! But for a first exploration, I did not fill in the population figures for such countries as Palestine, Taiwan, and Quebec (!). The per-capita numbers are quite amusing:

```{r dependson="initChunk"}
cat("2012 top Countries\nby U.S. Translations per million population")
print(subset(
    top.in.year(countries.df,year=2012,n=20,per="per.million.pop"),
    select=c(Country,per.million.pop,Freq)),
  row.names=FALSE)
cat("2008 top Countries\nby U.S. Translations per million population")
print(subset(
    top.in.year(countries.df,year=2008,n=20,per="per.million.pop"),
    select=c(Country,per.million.pop,Freq)),
  row.names=FALSE)
```
Scandinavia is way out ahead on this one, even on the early side of the Nordic noir boom. Djibouti, with its small population, gets a big bonus for being the source of a single translated book in 2012.

UNESCO makes some data on book production available. They have counts of the number of "Literature" titles produced in some (not all) countries from 1995 to 1999, as well as counts of total book production. So we can produce a rough sense of the proportion of the output of literary titles translated in the US for each country. Ideally these ratios would give a sense of the way translation reshapes the image of global literary production for US publishing. One could take all this even further by considering print runs, etc. Unfortunately even these numbers are very spotty and hard to understand---many of the raw numbers are quite low. In order to get a bit more data, where 1999 numbers were missing I filled in numbers from the latest year available. (The process is automated in the [analysis.R](https://github.com/agoldst/threepercent/blob/master/analysis.R) script.)

```{r dependson="initChunk"}
cat("2012 top Countries\nby U.S. Translations\nper home literary titles produced (UNESCO)")
top.per.lit <- subset(
    top.in.year(countries.df,year=2012,n=20,per="per.lit"),
    select=c(Country,per.lit,Freq))
print(top.per.lit,row.names=FALSE)
cat("Number of home literary titles (UNESCO)")
country.lit.prod[as.character(top.per.lit$Country)]
```

One can also think about the most *under*-represented countries in this sense. Here are the countries for which the number of home literary titles per U.S. translation is highest:
```{r dependson="initChunk"}
nonzero <- subset(countries.df,subset=(Year==2012 & Freq > 0))
under.rep <- nonzero[order(1 / nonzero$per.lit,decreasing=TRUE)[1:20],]
print(under.rep,row.names=FALSE)
cat("Number of home literary titles (UNESCO)")
country.lit.prod[as.character(under.rep$Country)]
```
That India comes tops on this metric is indeed interesting though by no means independently significant.

Anyway, these are simply starting points. The main utility of looking at these numbers should be to put under pressure any notion of national or linguistic "representativeness" one might initially bring to the problem of world literature. Mufti's remark on diversity being a colonial and Orientalist problematic was particularly brought home to me as I found myself turning to data gathered by global organizations in order to transform Three Percent's raw counts into *comparable* numbers.

In the interests of completeness, here's a [gist](https://gist.github.com/agoldst/5184034) with the R markdown source for this post showing the calculations that produce the tables here.

[*Edit* 3/18/2013: removed `##` comment characters from R output. I should have said that I produced the [handout from class](https://sakai.rutgers.edu/access/content/group/f4c20120-44b2-4309-802a-543ef7f6a4cb/public/countries2008-2012.pdf) by pasting together the counts included in the separate sheets of the original spreadsheet, but they can also be derived from the raw listing of titles I was working on in this post:
```{r dependson="initChunk",eval=FALSE,echo=TRUE}
# this is the definition of country.table in analysis.R
country.table <- table(tx$Country,tx$Year)
# to save to a file in the format of the class handout
write.csv(as.matrix(country.table),file="countries2008-2012.csv")
```
Fun.]
	As long promised, here are some links to the data I showed a table of during our discussion of Casanova about U.S. literary translation.
	By kind permission of Chad Post, I can make available an aggregate data file of all the literature translations catalogued by Three Percent. I've decided to put the data file, together with some scripts and information about the munging, in a [github repository](http://github.com/agoldst/threepercent). The data consists of a single CSV file with one line for each title: [all_titles.csv](https://github.com/agoldst/threepercent/blob/master/all_titles.csv) ([Wikipedia on CSV format](http://en.wikipedia.org/wiki/Comma-separated_values)).

	I have produced this by exporting the first "sheet" of each of the five yearly spreadsheets available at [the Three Percent Translation Database](http://www.rochester.edu/College/translation/threepercent/index.php?s=database) and then combining the files. According to Chad Post, updated data will be available soon, at which point I can reproduce the aggregate file.

	The data of most immediate interest are the year of publication, the language of origin, the country of origin, and the genre (fiction vs. poetry).

	Here are the top 20 countries by number of translations (labeled "Freq") in 2012:

	```{r setup,include=FALSE,cache=FALSE}
	opts_chunk$set(echo=FALSE,warning=FALSE,prompt=FALSE,comment="")
	options(width=70)
	```

	```{r initChunk}
	setwd("~/Developer/threepercent")
	source("~/Developer/threepercent/analysis.R")
	```

	```{r dependson="initChunk"}
	print(subset(
	top.in.year(countries.df,year=2012,n=20,per="Freq"),
	select=c(Country,Year,Freq)),
	row.names=FALSE)
	```

	and by language:

	```{r dependson="initChunk"}
	print(top.in.year(languages.df,year=2012,per="Freq",n=20),
	row.names=FALSE)
	```

	I became curious about what it would look like to "normalize" this data so that the raw counts were comparable in reasonable terms. This turned out to be kind of a challenge to get started on, but anyway here's what I managed.

	One possibility would be to think about translations per capita of the "home" country. The WHO has recent population figures. One interesting challenge turned out to be matching Three Percent's "country" names with the international organization's; This is not a politically innocent question! But for a first exploration, I did not fill in the population figures for such countries as Palestine, Taiwan, and Quebec (!). The per-capita numbers are quite amusing:

	```{r dependson="initChunk"}
	cat("2012 top Countries\nby U.S. Translations per million population")
	print(subset(
	top.in.year(countries.df,year=2012,n=20,per="per.million.pop"),
	select=c(Country,per.million.pop,Freq)),
	row.names=FALSE)
	cat("2008 top Countries\nby U.S. Translations per million population")
	print(subset(
	top.in.year(countries.df,year=2008,n=20,per="per.million.pop"),
	select=c(Country,per.million.pop,Freq)),
	row.names=FALSE)
	```
	Scandinavia is way out ahead on this one, even on the early side of the Nordic noir boom. Djibouti, with its small population, gets a big bonus for being the source of a single translated book in 2012.

	UNESCO makes some data on book production available. They have counts of the number of "Literature" titles produced in some (not all) countries from 1995 to 1999, as well as counts of total book production. So we can produce a rough sense of the proportion of the output of literary titles translated in the US for each country. Ideally these ratios would give a sense of the way translation reshapes the image of global literary production for US publishing. One could take all this even further by considering print runs, etc. Unfortunately even these numbers are very spotty and hard to understand---many of the raw numbers are quite low. In order to get a bit more data, where 1999 numbers were missing I filled in numbers from the latest year available. (The process is automated in the [analysis.R](https://github.com/agoldst/threepercent/blob/master/analysis.R) script.)

	```{r dependson="initChunk"}
	cat("2012 top Countries\nby U.S. Translations\nper home literary titles produced (UNESCO)")
	top.per.lit <- subset(
	top.in.year(countries.df,year=2012,n=20,per="per.lit"),
	select=c(Country,per.lit,Freq))
	print(top.per.lit,row.names=FALSE)
	cat("Number of home literary titles (UNESCO)")
	country.lit.prod[as.character(top.per.lit$Country)]
	```

	One can also think about the most under-represented countries in this sense. Here are the countries for which the number of home literary titles per U.S. translation is highest:
	```{r dependson="initChunk"}
	nonzero <- subset(countries.df,subset=(Year==2012 & Freq > 0))
	under.rep <- nonzero[order(1 / nonzero$per.lit,decreasing=TRUE)[1:20],]
	print(under.rep,row.names=FALSE)
	cat("Number of home literary titles (UNESCO)")
	country.lit.prod[as.character(under.rep$Country)]
	```
	That India comes tops on this metric is indeed interesting though by no means independently significant.

	Anyway, these are simply starting points. The main utility of looking at these numbers should be to put under pressure any notion of national or linguistic "representativeness" one might initially bring to the problem of world literature. Mufti's remark on diversity being a colonial and Orientalist problematic was particularly brought home to me as I found myself turning to data gathered by global organizations in order to transform Three Percent's raw counts into comparable numbers.

	In the interests of completeness, here's a [gist](https://gist.github.com/agoldst/5184034) with the R markdown source for this post showing the calculations that produce the tables here.

	[Edit 3/18/2013: removed `##` comment characters from R output. I should have said that I produced the [handout from class](https://sakai.rutgers.edu/access/content/group/f4c20120-44b2-4309-802a-543ef7f6a4cb/public/countries2008-2012.pdf) by pasting together the counts included in the separate sheets of the original spreadsheet, but they can also be derived from the raw listing of titles I was working on in this post:
	```{r dependson="initChunk",eval=FALSE,echo=TRUE}
	# this is the definition of country.table in analysis.R
	country.table <- table(tx$Country,tx$Year)
	# to save to a file in the format of the class handout
	write.csv(as.matrix(country.table),file="countries2008-2012.csv")
	```
	Fun.]