Skip to content

Instantly share code, notes, and snippets.

@thirdwing
Created December 21, 2015 20:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thirdwing/7a0bf107cad6f7f41927 to your computer and use it in GitHub Desktop.
Save thirdwing/7a0bf107cad6f7f41927 to your computer and use it in GitHub Desktop.
require(RCurl)
require(XML)
webpage <- getURL("https://en.wikipedia.org/wiki/N-gram")
webpage <- readLines(tc <- textConnection(webpage)); close(tc)
pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)
x <- xpathSApply(pagetree, "//*/p", xmlValue)
x <- unlist(strsplit(x, "\n"))
x <- gsub("\t","",x)
x <- sub("^[[:space:]]*(.*?)[[:space:]]*$", "\\1", x, perl=TRUE)
x <- x[!(x %in% c("", "|"))]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment