Skip to content

Instantly share code, notes, and snippets.

@voltek62
Created February 5, 2019 19:56
Show Gist options
  • Save voltek62/000e36844118234a658a18947091eb13 to your computer and use it in GitHub Desktop.
Save voltek62/000e36844118234a658a18947091eb13 to your computer and use it in GitHub Desktop.
Extract only the main textual content from an HTML page
#autoinstall packages
packages <- c("rJava", "boilerpipeR", "httr")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}
# Enjoy learning ? https://dataseolabs.com
# configure your jre
Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre1.8.0_181') # for 64-bit version
# load your libraries
library(rJava)
library(boilerpipeR)
library(httr)
# your url
url <- "https://en.wikipedia.org/wiki/Application_programming_interface"
# use GET method
req <- GET(url)
# extract html
html <- content(req, as = "text", encoding = "UTF-8")
# extract main content
txt <- ArticleSentencesExtractor(html)
print(txt)
# write the result into txt file
write.table(txt,"result.txt",sep="",row.names=FALSE,col.names=FALSE,quote=FALSE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment