Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save PaulieGillett/c7daa503a001040c16b190f7025b5320 to your computer and use it in GitHub Desktop.
Save PaulieGillett/c7daa503a001040c16b190f7025b5320 to your computer and use it in GitHub Desktop.
R: scrape multiple pages with XML and readHTMLTable
library(XML)
library(plyr)
base.url <- "http://www.ttmeiju.com/meiju/Movie.html?page"
GetTable <- function(page.number) {
full.url <- paste(base.url, page.number, sep = "=")
doc <- htmlParse(full.url, encoding = "GBK")
node <- getNodeSet(doc, "//table")[[2]]
last.row <- xmlSize(node) - 1
table <- readHTMLTable(node,
header = TRUE,
skip.rows = c(1, last.row),
trim = TRUE,
stringsAsFactor = FALSE)}
ttm.tables <- ldply(1:10, GetTable, .progress = "text")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment