Skip to content

Instantly share code, notes, and snippets.

@talegari
Last active September 12, 2017 05:43
Show Gist options
  • Save talegari/5bb6ac43c3442038b9c03ebc845a1ee0 to your computer and use it in GitHub Desktop.
Save talegari/5bb6ac43c3442038b9c03ebc845a1ee0 to your computer and use it in GitHub Desktop.
Read 20 Newsgroups data in R as a datatable (dataframe)
# Read 20newsgroups data as a datatable (dataframe)
# Author: Srikanth KS
# license: GPL-3
#
# download data from here:
# https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/20_newsgroups.tar.gz
# extract it and provide its location to `baseDir` on line 9
baseDir = "Downloads/20_newsgroups"
newsGroupNames = list.files(baseDir, full.names = TRUE)
readText = function(directory) {
textFileNames = list.files(directory, full.names = TRUE)
text = vapply(textFileNames
, function(x) paste(readLines(x), collapse = " ")
, character(1)
)
data.table::data.table(newsgroup = basename(directory)
, fileName = basename(textFileNames)
, text = text
)
}
news20 = data.table::rbindlist(lapply(newsGroupNames, readText))
# to convert news20 into a dataframe, run: `data.table::setDF(news20)`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment