Skip to content

Instantly share code, notes, and snippets.

@vu3jej
Created May 5, 2015 19:35
Show Gist options
  • Save vu3jej/6ef795ff74d20924c327 to your computer and use it in GitHub Desktop.
Save vu3jej/6ef795ff74d20924c327 to your computer and use it in GitHub Desktop.
R Script to Parse, Clean and Write TRAI emails to a CSV file
## Libraries
require(XML)
require(stringr)
## Read in the file
doc <- htmlTreeParse(file = '27 March to 10 April OTT.mbox.html', useInternalNodes = TRUE)
## Do some random crazy stuff
mail.list <- sapply(getNodeSet(doc, "//td[2]"), xmlValue)
mail.list.matrix <- str_split_fixed(string = mail.list, pattern = ' <', n = 2)
mail.list.df <- as.data.frame(x = mail.list.matrix)
mail.list.df <- mail.list.df[-c(1, 2),]
## Cleaning up the emails
mail.list.df$V2 <- sub(pattern = "\\([a-z]+\\)", replacement = "@", x = mail.list.df$V2)
mail.list.df$V2 <- sub(pattern = "\\([a-z]+\\)", replacement = ".", x = mail.list.df$V2)
mail.list.df$V2 <- sub(pattern = ">", replacement = "", x = mail.list.df$V2)
## Naming the columns properly
colnames(mail.list.df)[1] <- "name"
colnames(mail.list.df)[2] <- "email"
## Write the names and emails to a csv file
write.table(x = mail.list.df, file = "TRAI-Email-list.csv", sep = ',', quote = FALSE, row.names = FALSE, append = TRUE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment