benmarwick/gdoc-revisions-analysis.R

## gdoc-revisions-analysis.R
# related to my question here: https://github.com/tidyverse/googledrive/issues/218

# How to get access to specific revisions of a google drive document? Yes, we can do that.

# How to get the username for each revision, and the size of the document at each revision (bytes or words, assuming we are specifically talking about docs and not sheets or other types of files) Yes, we can do that.

# Goal is to measure the contributions of each author to a collaboratively-authored google doc with as little work as possible
# How to get access to specific revisions of a google drive document?

# Getting the revisions for a Google Doc

# Limitations:
# - When many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured.
# - When many edits happen in a short period of time, these are combined by Google and we cannot see them individually. We do not have good temporal resolution.
# - Older edits are deleted by Google to save space. We do not know what the rules are for this.


library(googledrive)
library(tidyverse)

# replace this with the ID of your google doc
fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"

# Get the name of the file and some other metadata
file <- build_request(
  path = "drive/v3/files/{fileId}",
  method = "GET",
  params = list(
    fileId = fileId,
    fields = "*"
  ),
  token = drive_token()
)
file_ret <-  process_response(make_request(file))

# Now for this doc, query the Drive API to get get URLs and other meta-data for all the revisions available to us

req2 <- build_request(
  path = "drive/v2/files/{fileId}/revisions",
  method = "GET",
  params = list(
    fileId = fileId
  ),
  token = drive_token()
)
revs2 <-  process_response(make_request(req2))

# See
# https://developers.google.com/drive/api/v2/reference/revisions#resource
# for an explanation of each variable that we have here

# tidy revisions into a dataframe
revs2_df <-
  map_df(
    revs2$items,
    `[`,
    c(
      "kind",
      "etag" ,
      "id",
      "selfLink"   ,
      "mimeType"     ,
      "modifiedDate",
      "published"   ,
      "lastModifyingUserName"
    )
  )
# get exportLinks URLs out of its nest
revs2_export_url <- map_df(revs2$items, "exportLinks")
# bind together
revs2_df_bind <- bind_cols(revs2_df, revs2_export_url)

# Now we have a dataframe of edits that includes the date of the edit, the person who did the edit, and URLs to download the document as it was at that edit.

# Here are the download URLs, eg for plain text:

paths <- revs2_df_bind$`text/plain`
rev_ids <- revs2_df_bind$id

# I have found only one way to get the contents of these versions, and that is to download them using browseURL. This is very bad because we can't control the download location or the destination filename. And our system is jammed up while the broswer is busy. Furthermore, we get blocked if we download too fast, so we have to sleep for 1 sec in between. Even more annoying is that Chrome stops autonumerbing downloads with the same name at 100. That's a real pain if there are >100 revisions! Safari can do >100, but uses a different file-naming system. There must be a better way!
# Note that we need to be logged into to Google Drive in the browser before we attempt to download these files.

# this will download the plain text file for each revision
map(paths, function(x) {browseURL(x); Sys.sleep(1)})

# read in the files that we got from the versions

rev_file_names <- list.files("~/Downloads", # not ideal...
                             pattern = file_ret$name,
                             full.names = TRUE)

# get files in a useful order...
# get the rev idx, this depends on the browser, Chrome does (n).txt,
# Safari does -n.txt. Also we need to deal with the
# first download which has no number


#------ broswer specific actions ---------
# if using chrome...
rev_file_names_indx <-
map(rev_file_names,
    ~gsub("\\(([^()]+)\\)",
          "\\1",
          str_extract_all(.x, "\\(([^()]+)\\)")[[1]])) %>%
  map_dbl(., ~ifelse(length(.x) == 0,
                               0, as.numeric(.x)))

# if using safari
rev_file_names_indx <-
map(rev_file_names,
    ~.x %>% str_extract_all("-\\d*") %>%
      str_remove_all("-")) %>%
  map_dbl(., ~ifelse(identical(.x, "character(0)"),
                               1, as.numeric(.x)))
#------ end of broswer specific actions ---------

# into a data frame
rev_file_names_df <-
  data_frame(file_name = rev_file_names,
             idx = rev_file_names_indx) %>%
  arrange(idx)

# name of user for each revision is in, only get as many names as
# we could download files, in case that was rate limited:

user_names <- revs2_df_bind$lastModifyingUserName[1:nrow(rev_file_names_df)]
modifiedDate <- revs2_df_bind$modifiedDate[1:nrow(rev_file_names_df)]

# how to quantify difference between revisions? Word count seems
# resonable for simple text documents

rev_file_names_df_word_counts <-
rev_file_names_df %>%
  nest(-idx) %>%
  mutate(word_count = map_int(data,
                          ~readr::read_file(.x$file_name) %>%
                            str_count(., '\\w+'))) %>%
  select(-data) %>%
  mutate(diff_count =  word_count - lag(word_count)) %>%
  mutate(lastModifyingUserName = user_names,
         modifiedDate = modifiedDate) %>%
  separate(modifiedDate,
           into= c("ymd", "time"),
           sep = "T") %>%
  mutate(ymd = as.POSIXct(ymd)) %>%
  mutate(date_time = as.POSIXct(paste(ymd,
                                      time),
                                format="%Y-%m-%d %H:%M:%S"))

# clean up by deleting all the downloads, now that we've read them in
unlink(rev_file_names)

# now we've got a table of the revisions, with a user
# name for each revision, and a word count diff
#-------------------------------------------------------------------

# EDA

# Let's explore it a little with some plots

# total contribution by each author
rev_file_names_df_word_counts %>%
  group_by(lastModifyingUserName) %>%
  summarise(total_words_added = sum(diff_count, na.rm = TRUE)) %>%
  ggplot(aes(reorder(lastModifyingUserName,
                     -total_words_added),
             total_words_added)) +
  geom_col() +
  xlab("Author") +
  ylab("Total number of words contributed to the document") +
  theme_minimal()

# change in document size over time
library(lubridate)
rev_file_names_df_word_counts %>%
  ggplot(aes(ymd,
             diff_count,
             fill = lastModifyingUserName)) +
  geom_col() +
  scale_x_datetime(date_labels = "%b %Y") +
  xlab("Date") +
  ylab("Change in the number of words") +
  theme_minimal()


# Sniffing the web traffic
# when using the gdocs 'view revision history' shows that 'end' and 'ar'
# are GET variables that refers to the revision ID. 'end' is the revision
# ID of the rev that we want to look at when we click a rev on the list
# on the RHS. Can we make a GET request that uses this 'end' variable with
# googledrive?


## googledrive-article-revisions.Rmd
---
title: "Inspecting revision histories"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{File Identification}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

One of the most useful features of Google Docs is collaborative writing where multiple people can edit the same document at the same time. Data about the changes made to a document are captured by Google, and some of this is available to use via the API. This can be useful for identifying who has contributed to a document, how much the contributed, and when.

Before we get into the details, it's useful to know the limitations of the API. Our experiments indicate that when many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured. This means we cannot expect very fine-grained revision data from documents that had many simultanoues edits. Similarly, when many edits happen in a short period of time, these are combined by Google and we cannot see them individually via the API. So the temporal resolution of the data provided by the API is limited. We have not yet investigated the exact resolution. Finally, older edits are deleted by Google to save space. We do not know what their rules are for this.

## Get revision IDs for a Google Doc

To get started, let's connect to our Google Doc and get a list of revisions. We'll need the ID of the Doc, which we can see in the URL of the Doc. We'll use some of the low-level functions in the googledrive package that make programming with the API simpler. Our `build_request` function will prompt for authentication in our browser, we'll need to paste the authorization code at the R console to continue. For a given Google Doc (in the native gdoc format, not a binary file), we can get a list of IDs of revisions like this:

```{r}
library(googledrive)
library(tidyverse)

# replace this with the ID of your google doc
fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"

request <- generate_request(
  endpoint = "drive.revisions.list",
  params = list(
    fileId = fileId
  ),
  token = drive_token()
)
revisions <- process_response(make_request(request))
```

The `revisions` object is a list where each item contains metadata about a specific revision of the document. In this document we have `length(revisions$revisions)` revisions. To get information about the meaning of each metadata field, take a look at <https://developers.google.com/drive/api/v3/reference/revisions#resource>.

Now we want a vector of the revision ids so we can iterate over them to get the full content of the file at each revision:

```{r}
all_revision_ids <- map_chr(revisions$revisions, "id") %>% as.numeric()
```

## Export the contents of the Google Doc at a given revision

Now we can write a little custom function to export the contents of the Google Doc at a specific revision. We choose to export the doc as a plain text file for simplicity.

```{r}
library(httr)

get_content_of_revisions <- function(fileId, revisionId){

url <- modify_url(
  url = "https://docs.google.com/feeds/download/documents/export/Export",
  query = list(
    id = fileId,
    revision = revisionId,
    exportFormat = "txt"
  )
)
x <- GET(url, verbose(), drive_token())
}
```

## Export the full text all of the available revisions of a Google Doc

We can use this function to contact the Google Drive API and get the content of each revision of the Google Doc. This will take a few moments, and a lot of text will scroll by in the R console:

```{r}
all_revisions_contents <- map2(fileId, all_revision_ids, get_content_of_revisions)
```

And finally we can convert the responses from the API into plain text, and tidy it a little bit, ready for some exploratory data analysis, etc.

```{r}
library(rvest)

all_revisions_contents_text <-
  map(all_revisions_contents,
      ~read_html(.x) %>%
        html_text() %>%
        str_squish())
```

The output here is a list where each item is a character vector with a length of one. Each item in the list holds the text of the google doc at a specific revision.

## Analyse the revisions of a Google Doc


--


## zzz-observations.md

      
    Raw
  

              zzz-observations.md
            
          
    When I look at the web traffic while navigating between revisions in a doc using File -> Version history -> See version history, I see that the main URL that delivers the content for a version looks like this:
https://docs.google.com/document/d/1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps/showrevision?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&end=53983&start=53898&smv=4&srfn=false&ern=false&token=AC4w5VgBbwN5o4jfoEoUNDwrkoTtT_b4vQ%3A1528698882690&ouid=102337699239108482617&includes_info_params=true
I can recognise the file ID, 1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps, and I see start= and end= that refer to revision IDs that I recognise from the API. I do not know what token= and ouid= are, but I guess something to do with authentication. The URL will work fine without them, e.g. this will still get the JSON in my browser:
https://docs.google.com/document/d/1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps/showrevision?id=1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps&end=53983&start=53898
When I run this URL in my browser I get a JSON file that is that specific revision, it matches what I see on the Google Docs revision page. Are there any clues in this URL that hint at a way to use googledrive to download these revision files? Because that would be heaps better than using browseURL to get them.
	# related to my question here: https://github.com/tidyverse/googledrive/issues/218

	# How to get access to specific revisions of a google drive document? Yes, we can do that.

	# How to get the username for each revision, and the size of the document at each revision (bytes or words, assuming we are specifically talking about docs and not sheets or other types of files) Yes, we can do that.

	# Goal is to measure the contributions of each author to a collaboratively-authored google doc with as little work as possible
	# How to get access to specific revisions of a google drive document?

	# Getting the revisions for a Google Doc

	# Limitations:
	# - When many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured.
	# - When many edits happen in a short period of time, these are combined by Google and we cannot see them individually. We do not have good temporal resolution.
	# - Older edits are deleted by Google to save space. We do not know what the rules are for this.


	library(googledrive)
	library(tidyverse)

	# replace this with the ID of your google doc
	fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"

	# Get the name of the file and some other metadata
	file <- build_request(
	path = "drive/v3/files/{fileId}",
	method = "GET",
	params = list(
	fileId = fileId,
	fields = "*"
	),
	token = drive_token()
	)
	file_ret <- process_response(make_request(file))

	# Now for this doc, query the Drive API to get get URLs and other meta-data for all the revisions available to us

	req2 <- build_request(
	path = "drive/v2/files/{fileId}/revisions",
	method = "GET",
	params = list(
	fileId = fileId
	),
	token = drive_token()
	)
	revs2 <- process_response(make_request(req2))

	# See
	# https://developers.google.com/drive/api/v2/reference/revisions#resource
	# for an explanation of each variable that we have here

	# tidy revisions into a dataframe
	revs2_df <-
	map_df(
	revs2$items,
	`[`,
	c(
	"kind",
	"etag" ,
	"id",
	"selfLink" ,
	"mimeType" ,
	"modifiedDate",
	"published" ,
	"lastModifyingUserName"
	)
	)
	# get exportLinks URLs out of its nest
	revs2_export_url <- map_df(revs2$items, "exportLinks")
	# bind together
	revs2_df_bind <- bind_cols(revs2_df, revs2_export_url)

	# Now we have a dataframe of edits that includes the date of the edit, the person who did the edit, and URLs to download the document as it was at that edit.

	# Here are the download URLs, eg for plain text:

	paths <- revs2_df_bind$`text/plain`
	rev_ids <- revs2_df_bind$id

	# I have found only one way to get the contents of these versions, and that is to download them using browseURL. This is very bad because we can't control the download location or the destination filename. And our system is jammed up while the broswer is busy. Furthermore, we get blocked if we download too fast, so we have to sleep for 1 sec in between. Even more annoying is that Chrome stops autonumerbing downloads with the same name at 100. That's a real pain if there are >100 revisions! Safari can do >100, but uses a different file-naming system. There must be a better way!
	# Note that we need to be logged into to Google Drive in the browser before we attempt to download these files.

	# this will download the plain text file for each revision
	map(paths, function(x) {browseURL(x); Sys.sleep(1)})

	# read in the files that we got from the versions

	rev_file_names <- list.files("~/Downloads", # not ideal...
	pattern = file_ret$name,
	full.names = TRUE)

	# get files in a useful order...
	# get the rev idx, this depends on the browser, Chrome does (n).txt,
	# Safari does -n.txt. Also we need to deal with the
	# first download which has no number


	#------ broswer specific actions ---------
	# if using chrome...
	rev_file_names_indx <-
	map(rev_file_names,
	~gsub("\\(([^()]+)\\)",
	"\\1",
	str_extract_all(.x, "\\(([^()]+)\\)")[[1]])) %>%
	map_dbl(., ~ifelse(length(.x) == 0,
	0, as.numeric(.x)))

	# if using safari
	rev_file_names_indx <-
	map(rev_file_names,
	~.x %>% str_extract_all("-\\d*") %>%
	str_remove_all("-")) %>%
	map_dbl(., ~ifelse(identical(.x, "character(0)"),
	1, as.numeric(.x)))
	#------ end of broswer specific actions ---------

	# into a data frame
	rev_file_names_df <-
	data_frame(file_name = rev_file_names,
	idx = rev_file_names_indx) %>%
	arrange(idx)

	# name of user for each revision is in, only get as many names as
	# we could download files, in case that was rate limited:

	user_names <- revs2_df_bind$lastModifyingUserName[1:nrow(rev_file_names_df)]
	modifiedDate <- revs2_df_bind$modifiedDate[1:nrow(rev_file_names_df)]

	# how to quantify difference between revisions? Word count seems
	# resonable for simple text documents

	rev_file_names_df_word_counts <-
	rev_file_names_df %>%
	nest(-idx) %>%
	mutate(word_count = map_int(data,
	~readr::read_file(.x$file_name) %>%
	str_count(., '\\w+'))) %>%
	select(-data) %>%
	mutate(diff_count = word_count - lag(word_count)) %>%
	mutate(lastModifyingUserName = user_names,
	modifiedDate = modifiedDate) %>%
	separate(modifiedDate,
	into= c("ymd", "time"),
	sep = "T") %>%
	mutate(ymd = as.POSIXct(ymd)) %>%
	mutate(date_time = as.POSIXct(paste(ymd,
	time),
	format="%Y-%m-%d %H:%M:%S"))

	# clean up by deleting all the downloads, now that we've read them in
	unlink(rev_file_names)

	# now we've got a table of the revisions, with a user
	# name for each revision, and a word count diff
	#-------------------------------------------------------------------

	# EDA

	# Let's explore it a little with some plots

	# total contribution by each author
	rev_file_names_df_word_counts %>%
	group_by(lastModifyingUserName) %>%
	summarise(total_words_added = sum(diff_count, na.rm = TRUE)) %>%
	ggplot(aes(reorder(lastModifyingUserName,
	-total_words_added),
	total_words_added)) +
	geom_col() +
	xlab("Author") +
	ylab("Total number of words contributed to the document") +
	theme_minimal()

	# change in document size over time
	library(lubridate)
	rev_file_names_df_word_counts %>%
	ggplot(aes(ymd,
	diff_count,
	fill = lastModifyingUserName)) +
	geom_col() +
	scale_x_datetime(date_labels = "%b %Y") +
	xlab("Date") +
	ylab("Change in the number of words") +
	theme_minimal()



	# Sniffing the web traffic
	# when using the gdocs 'view revision history' shows that 'end' and 'ar'
	# are GET variables that refers to the revision ID. 'end' is the revision
	# ID of the rev that we want to look at when we click a rev on the list
	# on the RHS. Can we make a GET request that uses this 'end' variable with
	# googledrive?
	---
	title: "Inspecting revision histories"
	output: rmarkdown::html_vignette
	vignette: >
	%\VignetteIndexEntry{File Identification}
	%\VignetteEngine{knitr::rmarkdown}
	%\VignetteEncoding{UTF-8}
	---

	```{r setup, include = FALSE}
	knitr::opts_chunk$set(
	collapse = TRUE,
	comment = "#>"
	)
	```

	One of the most useful features of Google Docs is collaborative writing where multiple people can edit the same document at the same time. Data about the changes made to a document are captured by Google, and some of this is available to use via the API. This can be useful for identifying who has contributed to a document, how much the contributed, and when.

	Before we get into the details, it's useful to know the limitations of the API. Our experiments indicate that when many users edit at the same time, we only get the name of the first editor active in that session. The others are not captured. This means we cannot expect very fine-grained revision data from documents that had many simultanoues edits. Similarly, when many edits happen in a short period of time, these are combined by Google and we cannot see them individually via the API. So the temporal resolution of the data provided by the API is limited. We have not yet investigated the exact resolution. Finally, older edits are deleted by Google to save space. We do not know what their rules are for this.

	## Get revision IDs for a Google Doc

	To get started, let's connect to our Google Doc and get a list of revisions. We'll need the ID of the Doc, which we can see in the URL of the Doc. We'll use some of the low-level functions in the googledrive package that make programming with the API simpler. Our `build_request` function will prompt for authentication in our browser, we'll need to paste the authorization code at the R console to continue. For a given Google Doc (in the native gdoc format, not a binary file), we can get a list of IDs of revisions like this:

	```{r}
	library(googledrive)
	library(tidyverse)

	# replace this with the ID of your google doc
	fileId <- "1s0CPFXnMQjZNts6gYAnkcGXGSAgugTupzMf8YeoCbps"

	request <- generate_request(
	endpoint = "drive.revisions.list",
	params = list(
	fileId = fileId
	),
	token = drive_token()
	)
	revisions <- process_response(make_request(request))
	```

	The `revisions` object is a list where each item contains metadata about a specific revision of the document. In this document we have `length(revisions$revisions)` revisions. To get information about the meaning of each metadata field, take a look at <https://developers.google.com/drive/api/v3/reference/revisions#resource>.

	Now we want a vector of the revision ids so we can iterate over them to get the full content of the file at each revision:

	```{r}
	all_revision_ids <- map_chr(revisions$revisions, "id") %>% as.numeric()
	```

	## Export the contents of the Google Doc at a given revision

	Now we can write a little custom function to export the contents of the Google Doc at a specific revision. We choose to export the doc as a plain text file for simplicity.

	```{r}
	library(httr)

	get_content_of_revisions <- function(fileId, revisionId){

	url <- modify_url(
	url = "https://docs.google.com/feeds/download/documents/export/Export",
	query = list(
	id = fileId,
	revision = revisionId,
	exportFormat = "txt"
	)
	)
	x <- GET(url, verbose(), drive_token())
	}
	```

	## Export the full text all of the available revisions of a Google Doc

	We can use this function to contact the Google Drive API and get the content of each revision of the Google Doc. This will take a few moments, and a lot of text will scroll by in the R console:

	```{r}
	all_revisions_contents <- map2(fileId, all_revision_ids, get_content_of_revisions)
	```

	And finally we can convert the responses from the API into plain text, and tidy it a little bit, ready for some exploratory data analysis, etc.

	```{r}
	library(rvest)

	all_revisions_contents_text <-
	map(all_revisions_contents,
	~read_html(.x) %>%
	html_text() %>%
	str_squish())
	```

	The output here is a list where each item is a character vector with a length of one. Each item in the list holds the text of the google doc at a specific revision.

	## Analyse the revisions of a Google Doc




	--