Skip to content

Instantly share code, notes, and snippets.

@statsccpr
Last active May 30, 2019 18:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save statsccpr/afeb31e3291c001bfea50a0acb3d59f3 to your computer and use it in GitHub Desktop.
Save statsccpr/afeb31e3291c001bfea50a0acb3d59f3 to your computer and use it in GitHub Desktop.
exercise_webscrape

PAA 2019 Web Scraping

Chad Pickering 5/29/2019

library(knitr)
opts_chunk$set(eval=FALSE)

Background

Accompanying Slides are here https://ucla.box.com/v/slides-webscrape-2019

We will be scraping information from the Population Association of America's 2019 annual meeting website.

See http://paa2019.populationassociation.org/days . Here, we see that there are multiple papers within each topic, each paper with multiple authors, most of whom have a corresponding university/institution. We want to create a dataset with all papers across all sessions for all three days of the conference, where a row represents one author on one particular paper. We also want to keep track of when and where each of the papers are showcased. In summary, here are the variables we want to have in our dataset:

  • Session number
  • Session title
  • Paper name
  • Author name
  • Institution name
  • Location
  • Date
  • Time
  • Poster? (binary variable)

We can use this dataset to understand the representation and frequency of institutions, authors, and topics at the conference.

Scraping

Install the package rvest, a package in Hadley Wickham's tidyverse. See more here: https://rvest.tidyverse.org/

Alternatives include 'scrapeR', 'RCurl', etc. for R and BeautifulSoup for Python.

# install.packages("rvest")
library(rvest)

Work Through One Example Result

Choose a session number - we choose 2 for now because it contains all of the information we want to target (1 leaves out paper name because they are all speakers only). Assume all sessions are structured identically until we have evidence that that is false. Assign the URL string to a variable and pass that variable to the function read_html(), which reads the webpage into R as XML (the content/document and nodes).

If we view the object paa_html, we can see that is an xml_document that contains two main nodes,

and . Nodes are like signposts that show us where to look in the ocean of HTML text. As we'll see, nodes will be instrumental in showing us where specific pieces of information are that we would like to extract.
paa_url <- "http://paa2019.populationassociation.org/sessions/2"
paa_html <- read_html(paa_url)
paa_html

The first piece of information we would like is the location, date, and time, since they seem to be very close together and we can probably extract them all at once. Looking at the website itself using right click -> inspect, we navigate our way to that block of text clicking on the dropdowns and seeing what highlights on the site itself. We narrow it down to the node

- use the function `html_nodes()` to extract pieces out of HTML documents - the notation used here is conventional for CSS selectors. Then use the `html_text()` function to see the text. Note that our use of pipes (`%>%`) just feed the results of one statement to the next.
paa_html %>%
  html_nodes("div.daytime") %>%
  html_text()

This looks like a mess, but it's exactly what we want - the date, time, and location are all in the string, usefully delimited by the forward slash and dot, respectively. We can easily use regular expressions to clean up this string and later separate these fields into three separate columns in a data frame.

The \r\n characters that you see are line separators, which can be removed using the regular expression ^\\s+|\\s+$, which targets any number of white space characters at the beginning or end of the string. Implement this using the gsub() function, where the first argument is the regular expression you want to match, the second is what you want to replace that string with, and the third, here denoted with a ., is the input string, which in our case is the entire string we extract.

### Day/time/place raw column 
paa_html %>%
  html_nodes("div.daytime") %>%
  html_text() %>%
  gsub("^\\s+|\\s+$", "", .)

day_time_place <- paa_html %>%
  html_nodes("div.daytime") %>%
  html_text() %>%
  gsub("^\\s+|\\s+$", "", .)

Now we try to extract the session number and session title.

paa_html %>%
  html_nodes("table") %>%
  html_text()

Can we be a little more specific?

paa_html %>%
  html_nodes("table") %>%
  html_nodes("h2") %>%
  html_text() 

We now have what we wanted, but we need to figure out a way to separate the session number and title with no obvious delimiter. First name the string session_raw.

session_raw <- paa_html %>%
  html_nodes("table") %>%
  html_nodes("h2") %>%
  html_text() 

The regexpr() function returns the positions of a substring matched by a regular expression and the length of the substrings.

regexpr("^Session ([0-9]{1,3})", session_raw)

The regmatches() function takes in the position and length of the substring and extracts the substring(s) of interest. It is very common to use regexpr() and regmatches() together like this.

regmatches(session_raw, regexpr("^Session ([0-9]{1,3})", session_raw))

We only want the session number, so use gsub() again to get rid of any letters and spaces in the substring to leave only the number.

session_no_index <- regmatches(session_raw, regexpr("^Session ([0-9]{1,3})", session_raw))
session_no <- gsub("[A-Za-z]+ ", "", session_no_index)
session_no

The title is a bit easier, since we can just remove the substring we extracted earlier from the larger string.

session_title <- gsub("Session [0-9]{1,3}", "", session_raw)
session_title

Lastly, we need the paper names with corresponding authors and institutions. We see that <div id="papers", class="translate"> is the object with all of the papers, names, and institutions. Notice that instead of using the dot in the form div.class_name, we now use a hash symbol - div#id_name.

paa_html %>% 
  html_nodes("div#papers") %>%
  html_text()

We see that all of the papers are separated by the number of the paper and a lot of surrounding whitespace. Let's use this to our advantage to split this very long string into substrings, one for each paper (we will deal with splitting up authors later). We use the function strsplit() to create our own delimiter in the form of a regular expression targeting numbers followed by periods, while also getting rid of the whitespace on either side.

papers_raw <- paa_html %>% 
  html_nodes("div#papers") %>%
  html_text()
  
strsplit(papers_raw, "\\s+[0-9]+[.]\\s+")

Using unlist() "unravels" the outer layer of the list so now we have 7 elements instead of 1 element with 7 sub-elements. After unlisting, we get rid of the first element because it is an empty string.

unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))

unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))[-1]

all_papers_raw <- unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))[-1]

Now we need to split up those papers into their author/institution components. Let's do this for the first paper and then make a loop to automate the process for each paper.

The first element of all_papers_raw is a string with all of the information for paper 1: paper name, authors, institutions. We use the dot as a delimiter and see that the result is another "nested" list, so we unlist it. It contains two elements, the first being the paper name, the second being the authors/institutions.

unlist(strsplit(all_papers_raw[1], ""))

split_by_dot <- unlist(strsplit(all_papers_raw[1], "")) 

We expect the length of split_by_dot to be 2, but if it's 1, we set the paper title to NA because none exists. The length would be 1 if the particular session contained no papers and only speakers from various institutions. If length is not 1 (so, 2), we strip out any whitespace at the end of the first substring (split_by_dot[1]) and assign it to be paper_title.

ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))

paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))

Now we deal with the second element split_by_dot[2], the one with the authors and institutions. If we are dealing with a session with no paper titles, the authors (speakers) will be in the first element, so use strsplit() with a semi-colon delimiter to create one string per author/institution. Otherwise, authors are in the second element.

The result is a "nested" list with one element that contains one sub-element per author/institution. Unlist this as usual.

ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"), 
                                 strsplit(split_by_dot[2], ";"))

split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"), 
                                 strsplit(split_by_dot[2], ";"))

unlist(split_by_semicolon)

split_by_semicolon <- unlist(split_by_semicolon)

Now we can strip away all whitespace at the beginning of each author string, and assign to a variable author_institution_vector.

gsub("^\\s+", "", split_by_semicolon)

author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)

Lastly, we want to create a vector that repeats the title of the paper as many times as there are authors using the rep() function.

rep(paper_title, length(author_institution_vector))

paper_col <- rep(paper_title, length(author_institution_vector))

What we end up with is a data frame with a paper column and an author/institution column.

data.frame(cbind(paper_col, author_institution_vector))

Iterate

Iterate Across the Papers, for 1 Session

We want to do that process for each paper in our session. Time to use a for loop.

We need the total number of papers in a session so that we can tell our for loop how many iterations to complete. To do this, we take the length of the papers object straight from the website. The nodeset within this papers object is always equal to the number of papers in a session, so take this as the length, and assign to the variable npapers.

paa_html %>%
  html_nodes("div#papers") %>%
  html_nodes("span.rank")
paa_html %>%
  html_nodes("div#papers") %>%
  html_nodes("span.rank") %>%
  length()

npapers <- paa_html %>%
  html_nodes("div#papers") %>%
  html_nodes("span.rank") %>%
  length()

Construct the for loop. Substitute the element number for p, which will iterate from 1 to the number of papers in the session. All of the code within the loop we just wrote except for the last part, which is an if-else statement. If we are on the first iteration of the loop, we generate the initial data frame, so we assign that to session_df, the name of the data frame that will eventually contain all of the information for the current session. With each iteration of the for loop, we append our newly created data frame from the current paper, new_df, to the existing and ever-growing session_df using the function rbind().

for(p in 1:npapers){
  split_by_dot <- unlist(strsplit(all_papers_raw[p], "")) 
  
  paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
    
  split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"), 
                                 strsplit(split_by_dot[2], ";"))
  split_by_semicolon <- unlist(split_by_semicolon)
  author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
    
  paper_col <- rep(paper_title, length(author_institution_vector))
  new_df <- data.frame(cbind(paper_col, author_institution_vector))
    
  if(p == 1){
    session_df <- new_df
  } else {
    session_df <- rbind(session_df, new_df)
  }
}

session_df

Iterate Across the Sessions, for 1 Website

Now, we can fit everything together into another OUTTER for loop to create our final data frame.

The variable paa_df is an empty data frame that we will begin filling with every iteration of the for loop. We hard code the total number of sessions (not recommended) at 252. We paste together the main portion of the URL for a session and its number, which is the value of the current iteration of the loop, then read in the HTML. All of the code is as we've already demonstrated.

The day/time/place string, session number string, and session title string are all repeated the number of times equal to the number of rows in the session-specific data frame session_df. The session_df data frame is an object that is created in every iteration of the loop and appended to the main data frame paa_df.

do NOT spam/flood/attack the website

  • For illustration, we only iterate across 5 sessions
  1. Respect the Websites Resources
  • For one iteration, build in a wait time (Two reasons)
  1. Respect the Websites Resources
  2. And ??? (What if you have a slow internet connection?)
paa_df <- data.frame()

# swap to 5 for illustration, do not spam/flood/attack the website

nsessions <- 5 # ideally needs to be automated

for(sess in 1:nsessions){
  
  # IMPORTANT
  stop('Dont be scum, respect the web site. Intentionally build in wait time, use ?Sys.sleep')
  
  # Comment out the `stop()` line above
  # AND
  # Fill in Sys.sleep() Code Here
  
  
  paa_url <- paste0("http://paa2019.populationassociation.org/sessions/", sess) 
  paa_html <- read_html(paa_url)
  
  ### Day/time/place raw column (will be split later)
  day_time_place <- paa_html %>%
    html_nodes("div.daytime") %>%
    html_text() %>%
    gsub("^\\s+|\\s+$", "", .)
  
  ### Session number & session title:
  session_raw <- paa_html %>%
    html_nodes("table") %>%
    html_nodes("h2") %>%
    html_text() 
  
  session_no_index <- regmatches(session_raw, regexpr("^Session ([0-9]{1,3})", session_raw))
  session_no <- gsub("[A-Za-z]+ ", "", session_no_index)
  
  session_title <- gsub("Session [0-9]{1,3}", "", session_raw)
  
  ### Paper name/author name/institution name:
  papers_raw <- paa_html %>% # raw form
    html_nodes("div#papers") %>%
    html_text()
  
  all_papers_raw <- unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))[-1]
  
  # session_df <- data.frame()
  
  # total number of papers for a session:
  npapers <- paa_html %>%
    html_nodes("div#papers") %>%
    html_nodes("span.rank") %>%
    length()
  
  for(p in 1:npapers){
    split_by_dot <- unlist(strsplit(all_papers_raw[p], "")) # first element is paper name, second element is author/institution
    paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
    
    split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"), 
                                 strsplit(split_by_dot[2], ";"))
    split_by_semicolon <- unlist(split_by_semicolon)
    author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
    
    paper_col <- rep(paper_title, length(author_institution_vector))
    new_df <- data.frame(cbind(paper_col, author_institution_vector))
    
    if(p == 1){
      session_df <- new_df
    } else {
      session_df <- rbind(session_df, new_df)
    }
  }
  
  date_col <- rep(day_time_place, nrow(session_df))
  session_no_col <- rep(session_no, nrow(session_df))
  session_title_col <- rep(session_title, nrow(session_df))
  
  session_df <- cbind(session_no_col, session_title_col, session_df, date_col)
  
  if(sess == 1){
    paa_df <- session_df
  } else {
    paa_df <- rbind(paa_df, session_df)
  }
  
}

View(paa_df)

We then do essentially the exact same thing to scrape the 11 poster sessions. Note that the URL is in the same form where the poster session number will be pasted to the end.

ps_df <- data.frame()

npsessions <- 3 # ideally needs to be automated

for(psess in 1:npsessions){
  
  # IMPORTANT
  stop('Dont be scum, respect the web site. Intentionally build in wait time, use ?Sys.sleep')
  
  # Comment out the `stop()` line above
  # AND
  # Fill in Sys.sleep() Code Here
  
  
  
  ps_url <- paste0("http://paa2019.populationassociation.org/sessions/P", psess)
  ps_html <- read_html(ps_url)
  
  ### Day/time/place raw column (will be split later)
  day_time_place <- ps_html %>%
    html_nodes("div.daytime") %>%
    html_text() %>%
    gsub("^\\s+|\\s+$", "", .)
  
  ### Session number & session title:
  session_raw <- ps_html %>%
    html_nodes("table") %>%
    html_nodes("h2") %>%
    html_text() 
  
  session_no_index <- regmatches(session_raw, regexpr("^Poster Session\\s ([0-9]+)", session_raw))
  session_no <- gsub("[A-Za-z ]+ ", "", session_no_index)
  
  session_title <- gsub("Poster Session\\s [0-9]+", "", session_raw)
  
  ### Poster name/author name/institution name:
  papers_raw <- ps_html %>% # raw form
    html_nodes("div#papers") %>%
    html_text()
  
  all_papers_raw <- unlist(strsplit(papers_raw, "\\s+[0-9]{1,3}[.]\\s+"))[-1]
  
  # session_df <- data.frame()
  
  # total number of papers for a session:
  nposters <- ps_html %>%
    html_nodes("div#papers") %>%
    html_nodes("span.rank") %>%
    length()
  
  for(p in 1:nposters){
    split_by_dot <- unlist(strsplit(all_papers_raw[p], "")) # first element is paper name, second element is author/institution
    paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
    
    split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"), 
                                 strsplit(split_by_dot[2], ";"))
    split_by_semicolon <- unlist(split_by_semicolon)
    author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
    
    paper_col <- rep(paper_title, length(author_institution_vector))
    new_df <- data.frame(cbind(paper_col, author_institution_vector))
    
    if(p == 1){
      session_df <- new_df
    } else {
      session_df <- rbind(session_df, new_df)
    }
  }
  
  date_col <- rep(day_time_place, nrow(session_df))
  session_no_col <- rep(session_no, nrow(session_df))
  session_title_col <- rep(session_title, nrow(session_df))
  
  session_df <- cbind(session_no_col, session_title_col, session_df, date_col)
  
  if(psess == 1){
    ps_df <- session_df
  } else {
    ps_df <- rbind(ps_df, session_df)
  }
  
}

View(ps_df)

Both data frames need to have the same column names before combining. The final data frame is called sessions.

# Rename variables and combine data frames:
colnames(paa_df) <- c("session_no", "session_title", "project_name", "author_institution", "date_place")
colnames(ps_df) <- colnames(paa_df)

sessions <- rbind(paa_df, ps_df)

View(sessions)

It is useful to discriminate between those rows that describe a poster and those that do not since session numbers 1-11 are used twice, once for papers and once for posters. Make a binary variable that is set to 1 if a row is associated with a poster and 0 otherwise.

# Binary column - is this a poster or not?
sessions$poster <- c(rep(0, nrow(paa_df)), rep(1, nrow(ps_df)))

We need the columns we'll be cleaning up to be of the character class, so let's coerce those to strings so that they are no longer factors.

sessions$project_name <- as.character(sessions$project_name)
sessions$author_institution <- as.character(sessions$author_institution)
sessions$date_place <- as.character(sessions$date_place)

Let's first split author and institution into separate columns.

Using regexpr(), we see that the locations of all of the commas in each string are printed in a vector as long as the column itself, followed by a vector of the lengths of these substrings (all 1 except those strings that do not contain a comma, which are denoted by length -1).

Use regmatches() to extract the substrings just like we did before. However, notice that our goal is not to extract the commas but instead to extract everything else. For this, we use the argument invert = TRUE to output both halves of the string that was split by a comma. Notice that each set of two substrings are grouped as their own lists.

regexpr(",", sessions$author_institution)

regmatches(sessions$author_institution, 
                                regexpr(",", sessions$author_institution), invert = TRUE)

split_author_inst <- regmatches(sessions$author_institution, 
                                regexpr(",", sessions$author_institution), invert = TRUE)

Then, extract the first sub-element of each list of 2 - these are our authors. We'll need to use unlist() to remove the outer list structure.

unlist(lapply(split_author_inst, `[[`, 1))

sessions$author <- unlist(lapply(split_author_inst, `[[`, 1))

Lastly, use gsub() to clean things up. We notice that there is sometimes a space after the name followed by an optional period followed by even more whitespace - we replace this with an empty string to remove it. Reassign this cleaned vector to the column.

gsub("\\s*[.]?\\s*$", "", sessions$author)

sessions$author <- gsub("\\s*[.]?\\s*$", "", sessions$author)

Institution is a little more tricky because sometimes there isn't one (those lists of 2 above are sometimes lists of 1). For each row, let's see how many elements are in each list - this will tell us if we can in fact extract an institution or if we are to put NA in that cell instead. The inst vector stores the second element in the list if there is one to store, or an NA if not. We make this vector our new institution column.

Here, why don't we need to wait using Sys.sleep() ?

sapply(split_author_inst, length)

author_inst_length <- sapply(split_author_inst, length)

inst <- c()
for(i in 1:length(author_inst_length)){
  inst[i] <- ifelse(author_inst_length[i]==2, split_author_inst[[i]][2], NA)
}

sessions$institution <- inst
inst

Finally, we clean the column. Notice that there is a space at the beginning of each string and perhaps a period or some more whitespace at the end - remove with gsub().

sessions$institution <- gsub("^\\s*|[.]?\\s*[.]?$", "", sessions$institution)

For the place column, we use strsplit() to split on the dot delimiter so that we have a list structure we recognize. Let's take the second sub-elements in each list and make those our location column, then clean any whitespace at the beginning of the strings.

strsplit(sessions$date_place, "")

split_place <- strsplit(sessions$date_place, "")

sapply(split_place, '[[', 2)

sessions$location <- sapply(split_place, '[[', 2)

gsub("^\\s*", "", sessions$location)

sessions$location <- gsub("^\\s*", "", sessions$location)

Take the first sub-elements of the split_place list above and make another vector called date_time. Then use strsplit() on the / delimiter to split the strings in two.

sapply(split_place, '[[', 1)

date_time <- sapply(split_place, '[[', 1)

strsplit(date_time, "/")

split_date_time <- strsplit(date_time, "/")

And finally, like we've been doing all along, make the first sub-element in the list the date column and the second sub-element the time column. Remove whitespace at the beginning and end of each string with gsub().

sapply(split_date_time, '[[', 1)

sessions$date <- sapply(split_date_time, '[[', 1)

gsub("^\\s*|\\s*$", "", sessions$date)

sessions$date <- gsub("^\\s*|\\s*$", "", sessions$date)

sapply(split_date_time, '[[', 2)

sessions$time <- sapply(split_date_time, '[[', 2)

gsub("^\\s*|\\s*$", "", sessions$time)

sessions$time <- gsub("^\\s*|\\s*$", "", sessions$time)

Subset the columns in the order desired to create the final version of the data frame.

sessions <- sessions[,c(1:3, 7:11, 6)]

View(sessions)

If you want to export this data frame or subsets thereof to a CSV file, use the write.csv() function.

# Save as csv:
write.csv(sessions[sessions$poster==0,], file="paa2019_sessions.csv")
write.csv(sessions[sessions$poster==1,], file="paa2019_posters.csv")
---
title: "PAA 2019 Web Scraping"
author: "Chad Pickering"
date: "5/29/2019"
output: html_document
---
```{r}
library(knitr)
opts_chunk$set(eval=FALSE)
```
### Background
Accompanying Slides are here
https://ucla.box.com/v/slides-webscrape-2019
We will be scraping information from the Population Association of America's 2019 annual meeting website.
See http://paa2019.populationassociation.org/days . Here, we see that there are multiple papers within each topic, each paper with multiple authors, most of whom have a corresponding university/institution. We want to create a dataset with all papers across all sessions for all three days of the conference, where a row represents one author on one particular paper. We also want to keep track of when and where each of the papers are showcased. In summary, here are the variables we want to have in our dataset:
* Session number
* Session title
* Paper name
* Author name
* Institution name
* Location
* Date
* Time
* Poster? (binary variable)
We can use this dataset to understand the representation and frequency of institutions, authors, and topics at the conference.
# Scraping
Install the package rvest, a package in Hadley Wickham's tidyverse.
See more here: https://rvest.tidyverse.org/
Alternatives include 'scrapeR', 'RCurl', etc. for R and BeautifulSoup for Python.
```{r}
# install.packages("rvest")
library(rvest)
```
# Work Through One Example Result
Choose a session number - we choose 2 for now because it contains all of the information we want to target (1 leaves out paper name because they are all speakers only). Assume all sessions are structured identically until we have evidence that that is false. Assign the URL string to a variable and pass that variable to the function `read_html()`, which reads the webpage into R as XML (the content/document and nodes).
If we view the object `paa_html`, we can see that is an xml_document that contains two main nodes, <head> and <body>. Nodes are like signposts that show us where to look in the ocean of HTML text. As we'll see, nodes will be instrumental in showing us where specific pieces of information are that we would like to extract.
```{r}
paa_url <- "http://paa2019.populationassociation.org/sessions/2"
paa_html <- read_html(paa_url)
paa_html
```
The first piece of information we would like is the location, date, and time, since they seem to be very close together and we can probably extract them all at once. Looking at the website itself using right click -> inspect, we navigate our way to that block of text clicking on the dropdowns and seeing what highlights on the site itself. We narrow it down to the node <div class="daytime"> - use the function `html_nodes()` to extract pieces out of HTML documents - the notation used here is conventional for CSS selectors. Then use the `html_text()` function to see the text. Note that our use of pipes (`%>%`) just feed the results of one statement to the next.
```{r}
paa_html %>%
html_nodes("div.daytime") %>%
html_text()
```
This looks like a mess, but it's exactly what we want - the date, time, and location are all in the string, usefully delimited by the forward slash and dot, respectively. We can easily use regular expressions to clean up this string and later separate these fields into three separate columns in a data frame.
The `\r\n` characters that you see are line separators, which can be removed using the regular expression `^\\s+|\\s+$`, which targets any number of white space characters at the beginning or end of the string. Implement this using the `gsub()` function, where the first argument is the regular expression you want to match, the second is what you want to replace that string with, and the third, here denoted with a `.`, is the input string, which in our case is the entire string we extract.
```{r}
### Day/time/place raw column
paa_html %>%
html_nodes("div.daytime") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .)
day_time_place <- paa_html %>%
html_nodes("div.daytime") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .)
```
Now we try to extract the session number and session title.
```{r}
paa_html %>%
html_nodes("table") %>%
html_text()
```
Can we be a little more specific?
```{r}
paa_html %>%
html_nodes("table") %>%
html_nodes("h2") %>%
html_text()
```
We now have what we wanted, but we need to figure out a way to separate the session number and title with no obvious delimiter. First name the string `session_raw`.
```{r}
session_raw <- paa_html %>%
html_nodes("table") %>%
html_nodes("h2") %>%
html_text()
```
The `regexpr()` function returns the positions of a substring matched by a regular expression and the length of the substrings.
* To learn more more about regular expressions (powerul for defining symbolic patterns), try this intro guide at your own leisurely pace
https://www.oreilly.com/ideas/an-introduction-to-regular-expressions
```{r}
regexpr("^Session ([0-9]{1,3})", session_raw)
```
The `regmatches()` function takes in the position and length of the substring and extracts the substring(s) of interest. It is very common to use `regexpr()` and `regmatches()` together like this.
```{r}
regmatches(session_raw, regexpr("^Session ([0-9]{1,3})", session_raw))
```
We only want the session number, so use `gsub()` again to get rid of any letters and spaces in the substring to leave only the number.
```{r}
session_no_index <- regmatches(session_raw, regexpr("^Session ([0-9]{1,3})", session_raw))
session_no <- gsub("[A-Za-z]+ ", "", session_no_index)
session_no
```
The title is a bit easier, since we can just remove the substring we extracted earlier from the larger string.
```{r}
session_title <- gsub("Session [0-9]{1,3}", "", session_raw)
session_title
```
Lastly, we need the paper names with corresponding authors and institutions. We see that `<div id="papers", class="translate">` is the object with all of the papers, names, and institutions. Notice that instead of using the dot in the form div.class_name, we now use a hash symbol - div#id_name.
```{r}
paa_html %>%
html_nodes("div#papers") %>%
html_text()
```
We see that all of the papers are separated by the number of the paper and a lot of surrounding whitespace. Let's use this to our advantage to split this very long string into substrings, one for each paper (we will deal with splitting up authors later). We use the function `strsplit()` to create our own delimiter in the form of a regular expression targeting numbers followed by periods, while also getting rid of the whitespace on either side.
```{r}
papers_raw <- paa_html %>%
html_nodes("div#papers") %>%
html_text()
strsplit(papers_raw, "\\s+[0-9]+[.]\\s+")
```
Using `unlist()` "unravels" the outer layer of the list so now we have 7 elements instead of 1 element with 7 sub-elements. After unlisting, we get rid of the first element because it is an empty string.
```{r}
unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))
unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))[-1]
all_papers_raw <- unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))[-1]
```
Now we need to split up those papers into their author/institution components. Let's do this for the first paper and then make a loop to automate the process for each paper.
The first element of `all_papers_raw` is a string with all of the information for paper 1: paper name, authors, institutions. We use the dot as a delimiter and see that the result is another "nested" list, so we unlist it. It contains two elements, the first being the paper name, the second being the authors/institutions.
```{r}
unlist(strsplit(all_papers_raw[1], "•"))
split_by_dot <- unlist(strsplit(all_papers_raw[1], "•"))
```
We expect the length of `split_by_dot` to be 2, but if it's 1, we set the paper title to NA because none exists. The length would be 1 if the particular session contained no papers and only speakers from various institutions. If length is not 1 (so, 2), we strip out any whitespace at the end of the first substring (`split_by_dot[1]`) and assign it to be `paper_title`.
```{r}
ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
```
Now we deal with the second element `split_by_dot[2]`, the one with the authors and institutions. If we are dealing with a session with no paper titles, the authors (speakers) will be in the first element, so use `strsplit()` with a semi-colon delimiter to create one string per author/institution. Otherwise, authors are in the second element.
The result is a "nested" list with one element that contains one sub-element per author/institution. Unlist this as usual.
```{r}
ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"),
strsplit(split_by_dot[2], ";"))
split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"),
strsplit(split_by_dot[2], ";"))
unlist(split_by_semicolon)
split_by_semicolon <- unlist(split_by_semicolon)
```
Now we can strip away all whitespace at the beginning of each author string, and assign to a variable `author_institution_vector`.
```{r}
gsub("^\\s+", "", split_by_semicolon)
author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
```
Lastly, we want to create a vector that repeats the title of the paper as many times as there are authors using the `rep()` function.
```{r}
rep(paper_title, length(author_institution_vector))
paper_col <- rep(paper_title, length(author_institution_vector))
```
What we end up with is a data frame with a paper column and an author/institution column.
```{r}
data.frame(cbind(paper_col, author_institution_vector))
```
# Iterate
## Iterate Across the Papers, for 1 Session
We want to do that process for each paper in our session. Time to use a for loop.
We need the total number of papers in a session so that we can tell our for loop how many iterations to complete. To do this, we take the length of the papers object straight from the website. The nodeset within this papers object is always equal to the number of papers in a session, so take this as the length, and assign to the variable `npapers`.
```{r}
paa_html %>%
html_nodes("div#papers") %>%
html_nodes("span.rank")
```
```{r}
paa_html %>%
html_nodes("div#papers") %>%
html_nodes("span.rank") %>%
length()
npapers <- paa_html %>%
html_nodes("div#papers") %>%
html_nodes("span.rank") %>%
length()
```
Construct the for loop. Substitute the element number for `p`, which will iterate from 1 to the number of papers in the session. All of the code within the loop we just wrote except for the last part, which is an if-else statement. If we are on the first iteration of the loop, we generate the initial data frame, so we assign that to `session_df`, the name of the data frame that will eventually contain all of the information for the current session. With each iteration of the for loop, we append our newly created data frame from the current paper, `new_df`, to the existing and ever-growing `session_df` using the function `rbind()`.
```{r}
for(p in 1:npapers){
split_by_dot <- unlist(strsplit(all_papers_raw[p], "•"))
paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"),
strsplit(split_by_dot[2], ";"))
split_by_semicolon <- unlist(split_by_semicolon)
author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
paper_col <- rep(paper_title, length(author_institution_vector))
new_df <- data.frame(cbind(paper_col, author_institution_vector))
if(p == 1){
session_df <- new_df
} else {
session_df <- rbind(session_df, new_df)
}
}
session_df
```
## Iterate Across the Sessions, for 1 Website
Now, we can fit everything together into another **OUTTER** for loop to create our final data frame.
The variable `paa_df` is an empty data frame that we will begin filling with every iteration of the for loop. We hard code the total number of sessions (not recommended) at 252. We paste together the main portion of the URL for a session and its number, which is the value of the current iteration of the loop, then read in the HTML. All of the code is as we've already demonstrated.
The day/time/place string, session number string, and session title string are all repeated the number of times equal to the number of rows in the session-specific data frame `session_df`. The `session_df` data frame is an object that is created in every iteration of the loop and appended to the main data frame `paa_df`.
### do NOT spam/flood/attack the website
* For illustration, we only iterate across 5 sessions
1. Respect the Websites Resources
* For one iteration, build in a wait time (Two reasons)
1. Respect the Websites Resources
2. And ??? (What if you have a slow internet connection?)
```{r}
paa_df <- data.frame()
# swap to 5 for illustration, do not spam/flood/attack the website
nsessions <- 5 # ideally needs to be automated
for(sess in 1:nsessions){
# IMPORTANT
stop('Dont be scum, respect the web site. Intentionally build in wait time, use ?Sys.sleep')
# Comment out the `stop()` line above
# AND
# Fill in Sys.sleep() Code Here
paa_url <- paste0("http://paa2019.populationassociation.org/sessions/", sess)
paa_html <- read_html(paa_url)
### Day/time/place raw column (will be split later)
day_time_place <- paa_html %>%
html_nodes("div.daytime") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .)
### Session number & session title:
session_raw <- paa_html %>%
html_nodes("table") %>%
html_nodes("h2") %>%
html_text()
session_no_index <- regmatches(session_raw, regexpr("^Session ([0-9]{1,3})", session_raw))
session_no <- gsub("[A-Za-z]+ ", "", session_no_index)
session_title <- gsub("Session [0-9]{1,3}", "", session_raw)
### Paper name/author name/institution name:
papers_raw <- paa_html %>% # raw form
html_nodes("div#papers") %>%
html_text()
all_papers_raw <- unlist(strsplit(papers_raw, "\\s+[0-9]+[.]\\s+"))[-1]
# session_df <- data.frame()
# total number of papers for a session:
npapers <- paa_html %>%
html_nodes("div#papers") %>%
html_nodes("span.rank") %>%
length()
for(p in 1:npapers){
split_by_dot <- unlist(strsplit(all_papers_raw[p], "•")) # first element is paper name, second element is author/institution
paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"),
strsplit(split_by_dot[2], ";"))
split_by_semicolon <- unlist(split_by_semicolon)
author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
paper_col <- rep(paper_title, length(author_institution_vector))
new_df <- data.frame(cbind(paper_col, author_institution_vector))
if(p == 1){
session_df <- new_df
} else {
session_df <- rbind(session_df, new_df)
}
}
date_col <- rep(day_time_place, nrow(session_df))
session_no_col <- rep(session_no, nrow(session_df))
session_title_col <- rep(session_title, nrow(session_df))
session_df <- cbind(session_no_col, session_title_col, session_df, date_col)
if(sess == 1){
paa_df <- session_df
} else {
paa_df <- rbind(paa_df, session_df)
}
}
View(paa_df)
```
We then do essentially the exact same thing to scrape the 11 poster sessions. Note that the URL is in the same form where the poster session number will be pasted to the end.
```{r}
ps_df <- data.frame()
npsessions <- 3 # ideally needs to be automated
for(psess in 1:npsessions){
# IMPORTANT
stop('Dont be scum, respect the web site. Intentionally build in wait time, use ?Sys.sleep')
# Comment out the `stop()` line above
# AND
# Fill in Sys.sleep() Code Here
ps_url <- paste0("http://paa2019.populationassociation.org/sessions/P", psess)
ps_html <- read_html(ps_url)
### Day/time/place raw column (will be split later)
day_time_place <- ps_html %>%
html_nodes("div.daytime") %>%
html_text() %>%
gsub("^\\s+|\\s+$", "", .)
### Session number & session title:
session_raw <- ps_html %>%
html_nodes("table") %>%
html_nodes("h2") %>%
html_text()
session_no_index <- regmatches(session_raw, regexpr("^Poster Session\\s ([0-9]+)", session_raw))
session_no <- gsub("[A-Za-z ]+ ", "", session_no_index)
session_title <- gsub("Poster Session\\s [0-9]+", "", session_raw)
### Poster name/author name/institution name:
papers_raw <- ps_html %>% # raw form
html_nodes("div#papers") %>%
html_text()
all_papers_raw <- unlist(strsplit(papers_raw, "\\s+[0-9]{1,3}[.]\\s+"))[-1]
# session_df <- data.frame()
# total number of papers for a session:
nposters <- ps_html %>%
html_nodes("div#papers") %>%
html_nodes("span.rank") %>%
length()
for(p in 1:nposters){
split_by_dot <- unlist(strsplit(all_papers_raw[p], "•")) # first element is paper name, second element is author/institution
paper_title <- ifelse(length(split_by_dot)==1, NA, gsub("[[:space:]]*$", "", split_by_dot[1]))
split_by_semicolon <- ifelse(length(split_by_dot)==1, strsplit(split_by_dot[1], ";"),
strsplit(split_by_dot[2], ";"))
split_by_semicolon <- unlist(split_by_semicolon)
author_institution_vector <- gsub("^\\s+", "", split_by_semicolon)
paper_col <- rep(paper_title, length(author_institution_vector))
new_df <- data.frame(cbind(paper_col, author_institution_vector))
if(p == 1){
session_df <- new_df
} else {
session_df <- rbind(session_df, new_df)
}
}
date_col <- rep(day_time_place, nrow(session_df))
session_no_col <- rep(session_no, nrow(session_df))
session_title_col <- rep(session_title, nrow(session_df))
session_df <- cbind(session_no_col, session_title_col, session_df, date_col)
if(psess == 1){
ps_df <- session_df
} else {
ps_df <- rbind(ps_df, session_df)
}
}
View(ps_df)
```
Both data frames need to have the same column names before combining. The final data frame is called `sessions`.
```{r}
# Rename variables and combine data frames:
colnames(paa_df) <- c("session_no", "session_title", "project_name", "author_institution", "date_place")
colnames(ps_df) <- colnames(paa_df)
sessions <- rbind(paa_df, ps_df)
View(sessions)
```
It is useful to discriminate between those rows that describe a poster and those that do not since session numbers 1-11 are used twice, once for papers and once for posters. Make a binary variable that is set to 1 if a row is associated with a poster and 0 otherwise.
```{r}
# Binary column - is this a poster or not?
sessions$poster <- c(rep(0, nrow(paa_df)), rep(1, nrow(ps_df)))
```
We need the columns we'll be cleaning up to be of the character class, so let's coerce those to strings so that they are no longer factors.
```{r}
sessions$project_name <- as.character(sessions$project_name)
sessions$author_institution <- as.character(sessions$author_institution)
sessions$date_place <- as.character(sessions$date_place)
```
Let's first split author and institution into separate columns.
Using `regexpr()`, we see that the locations of all of the commas in each string are printed in a vector as long as the column itself, followed by a vector of the lengths of these substrings (all 1 except those strings that do not contain a comma, which are denoted by length -1).
Use `regmatches()` to extract the substrings just like we did before. However, notice that our goal is not to extract the commas but instead to extract everything else. For this, we use the argument `invert = TRUE` to output both halves of the string that was split by a comma. Notice that each set of two substrings are grouped as their own lists.
```{r}
regexpr(",", sessions$author_institution)
regmatches(sessions$author_institution,
regexpr(",", sessions$author_institution), invert = TRUE)
split_author_inst <- regmatches(sessions$author_institution,
regexpr(",", sessions$author_institution), invert = TRUE)
```
Then, extract the first sub-element of each list of 2 - these are our authors. We'll need to use `unlist()` to remove the outer list structure.
```{r}
unlist(lapply(split_author_inst, `[[`, 1))
sessions$author <- unlist(lapply(split_author_inst, `[[`, 1))
```
Lastly, use `gsub()` to clean things up. We notice that there is sometimes a space after the name followed by an optional period followed by even more whitespace - we replace this with an empty string to remove it. Reassign this cleaned vector to the column.
```{r}
gsub("\\s*[.]?\\s*$", "", sessions$author)
sessions$author <- gsub("\\s*[.]?\\s*$", "", sessions$author)
```
Institution is a little more tricky because sometimes there isn't one (those lists of 2 above are sometimes lists of 1). For each row, let's see how many elements are in each list - this will tell us if we can in fact extract an institution or if we are to put NA in that cell instead. The `inst` vector stores the second element in the list if there is one to store, or an NA if not. We make this vector our new institution column.
**Here, why don't we need to wait using Sys.sleep() ?**
```{r}
sapply(split_author_inst, length)
author_inst_length <- sapply(split_author_inst, length)
inst <- c()
for(i in 1:length(author_inst_length)){
inst[i] <- ifelse(author_inst_length[i]==2, split_author_inst[[i]][2], NA)
}
sessions$institution <- inst
inst
```
Finally, we clean the column. Notice that there is a space at the beginning of each string and perhaps a period or some more whitespace at the end - remove with `gsub()`.
```{r}
sessions$institution <- gsub("^\\s*|[.]?\\s*[.]?$", "", sessions$institution)
```
For the place column, we use `strsplit()` to split on the dot delimiter so that we have a list structure we recognize. Let's take the second sub-elements in each list and make those our location column, then clean any whitespace at the beginning of the strings.
```{r}
strsplit(sessions$date_place, "•")
split_place <- strsplit(sessions$date_place, "•")
sapply(split_place, '[[', 2)
sessions$location <- sapply(split_place, '[[', 2)
gsub("^\\s*", "", sessions$location)
sessions$location <- gsub("^\\s*", "", sessions$location)
```
Take the first sub-elements of the `split_place` list above and make another vector called `date_time`. Then use `strsplit()` on the `/` delimiter to split the strings in two.
```{r}
sapply(split_place, '[[', 1)
date_time <- sapply(split_place, '[[', 1)
strsplit(date_time, "/")
split_date_time <- strsplit(date_time, "/")
```
And finally, like we've been doing all along, make the first sub-element in the list the date column and the second sub-element the time column. Remove whitespace at the beginning and end of each string with `gsub()`.
```{r}
sapply(split_date_time, '[[', 1)
sessions$date <- sapply(split_date_time, '[[', 1)
gsub("^\\s*|\\s*$", "", sessions$date)
sessions$date <- gsub("^\\s*|\\s*$", "", sessions$date)
sapply(split_date_time, '[[', 2)
sessions$time <- sapply(split_date_time, '[[', 2)
gsub("^\\s*|\\s*$", "", sessions$time)
sessions$time <- gsub("^\\s*|\\s*$", "", sessions$time)
```
Subset the columns in the order desired to create the final version of the data frame.
```{r}
sessions <- sessions[,c(1:3, 7:11, 6)]
View(sessions)
```
If you want to export this data frame or subsets thereof to a CSV file, use the `write.csv()` function.
```{r}
# Save as csv:
write.csv(sessions[sessions$poster==0,], file="paa2019_sessions.csv")
write.csv(sessions[sessions$poster==1,], file="paa2019_posters.csv")
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment