Skip to content

Instantly share code, notes, and snippets.

@tts
Created July 5, 2012 12:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tts/3053494 to your computer and use it in GitHub Desktop.
Save tts/3053494 to your computer and use it in GitHub Desktop.
Parsing a HTML page for getting a list of Twitter screen names
########################################################################################
#
# Make a list of Twitter screen names from
# http://yle.fi/extrem/artikel/musiknoje/44988-Har-ar-svenskfinlands-basta-twittrare
#
# AFAIK there is no Twitter list, so we need HTML parsing
#
# Tuija Sonkkila
# 2012-07-05
library(RCurl)
library(RJSONIO)
library(XML)
url <- "http://yle.fi/extrem/artikel/musiknoje/44988-Har-ar-svenskfinlands-basta-twittrare"
d <- getURL(url)
doc <- htmlParse(d)
# Store into a list (src) all href attributes of those link elements which refer to Twitter
# (there are no extra, unrelated links on the page)
src <- xpathApply(doc, "//a[starts-with(@href, 'https://twitter.com/')]", xmlGetAttr, "href")
# Apply a find&replace function over all list elements, to get the screen names only
ppl <- lapply(src, function(x) gsub("https://twitter.com/", "", x))
# During the text mining phase later on, I got errors that I assumed where related to
# character encoding of the corpus (latin1). This conversion helped:
#
# tweets.utf8 <- iconv(tweets, "latin1", "UTF-8")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment