Created
July 5, 2012 12:46
-
-
Save tts/3053494 to your computer and use it in GitHub Desktop.
Parsing a HTML page for getting a list of Twitter screen names
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
######################################################################################## | |
# | |
# Make a list of Twitter screen names from | |
# http://yle.fi/extrem/artikel/musiknoje/44988-Har-ar-svenskfinlands-basta-twittrare | |
# | |
# AFAIK there is no Twitter list, so we need HTML parsing | |
# | |
# Tuija Sonkkila | |
# 2012-07-05 | |
library(RCurl) | |
library(RJSONIO) | |
library(XML) | |
url <- "http://yle.fi/extrem/artikel/musiknoje/44988-Har-ar-svenskfinlands-basta-twittrare" | |
d <- getURL(url) | |
doc <- htmlParse(d) | |
# Store into a list (src) all href attributes of those link elements which refer to Twitter | |
# (there are no extra, unrelated links on the page) | |
src <- xpathApply(doc, "//a[starts-with(@href, 'https://twitter.com/')]", xmlGetAttr, "href") | |
# Apply a find&replace function over all list elements, to get the screen names only | |
ppl <- lapply(src, function(x) gsub("https://twitter.com/", "", x)) | |
# During the text mining phase later on, I got errors that I assumed where related to | |
# character encoding of the corpus (latin1). This conversion helped: | |
# | |
# tweets.utf8 <- iconv(tweets, "latin1", "UTF-8") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment