Skip to content

Instantly share code, notes, and snippets.

@kbenoit
Last active February 15, 2023 10:10
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save kbenoit/ea1cf28c0786650d29a3f9906128a8eb to your computer and use it in GitHub Desktop.
Save kbenoit/ea1cf28c0786650d29a3f9906128a8eb to your computer and use it in GitHub Desktop.
Create a quanteda dictionary from the LIWC dictionary poster pdf
# you need Xpdf for this: https://www.xpdfreader.com/download.html
# and the current working directory should be the location of the file
#
# also requires quanteda and stringi
#' Create a dictionary from a LIWC dictionary poster pdf
#'
#' Creates a \pkg{quanteda} dictionary from a LIWC dictionary output from the
#' LIWC software as a "dictionary poster" pdf. Currently tested with LIWC 2015
#' v1.5.0, available for purchase from \url{http://http://liwc.wpengine.com/}.
#' The poster files can be output using Dictionary -> Export Internal
#' Dictionaries from the LIWC2015 application menu.
#'
#' Currently only works with the \code{LIWC2015 dictionary poster.pdf} file.
#' @param file the filename of the LIWC dictionary poster pdf to be read
#' @return a \pkg{quanteda} \link[quanteda]{dictionary}
#' @export
#' @examples
#' \dontrun{
#' data_dictionary_liwc2015eng <- readliwc("~/Desktop/LIWC2015 dictionary poster.pdf")
#' names(data_dictionary_liwc2015eng)
#' data_dictionary_liwc2015eng["Assent"]
#' data_dictionary_liwc2015eng["Netspeak"]
#' }
readliwc <- function(file) {
if (Sys.info()[["sysname"]] == "Darwin") {
file <- stringi::stri_replace_all_fixed(file, " ", "\\ ")
}
dict <- system2("pdftotext", args = c("-layout", "-r 600", "-nopgbrk", file, "-"), stdout = TRUE)
# get category names
cats <- as.character(tokens(dict[4]))
# remove first three lines
dict <- dict[-c(1:4)]
# get fixed column locations
columns <- as.data.frame(stringi::stri_locate_all_regex(dict, "\\s[\\w\\p{P}]")[[1]])
colwidths <- c(columns$end, max(nchar(dict)) + 1) - c(1, columns$end)
dicttable <- utils::read.fwf(textConnection(dict), widths = colwidths, stringsAsFactors = FALSE)
colwidths <- c(4, rep(1, 14), 7, 5, 3, rep(1, 3), 10, rep(5, 2), 1, 2, 1, 5, rep(1, 4), 6,
2, rep(1, 2), 2, rep(1, 2), 3, rep(1, 3), 5, rep(2, 2), 1, 2, 8, rep(2, 2), 4, rep(1, 2),
rep(3, 2), 1, 7, 3, 3, 2, 3, 2, 1, 2, 2, 1, 3, 1, 2, rep(1, 3))
liwclist <- wrapcols(dicttable, cats, colwidths)
quanteda::dictionary(liwclist)
}
wrapcols <- function(input, keynames, colwidths) {
stopifnot(length(keynames) == length(colwidths))
tmp <- as.vector(as.matrix(input))
tmp <- stringi::stri_trim_both(tmp)
output <- split(tmp, rep(keynames, colwidths * nrow(input)))
output <- lapply(output, function(x) x[x != "" & !is.na(x)])
output
}
@grayakagr
Copy link

Hi, thx for the code, but it didn't run for me, I received an error message 'length(keynames) == length(colwidths) is not TRUE'. I read in the pdf file using tm package, and I realized there should be something wrong with the original file (some texts can't be selected, thus only fragments of the texts were read in). I was wondering if that's the problem in my case and if so how can I get a usable pdf file? Thank you very much!

@chumpblocckami
Copy link

Thanks for sharing. It worked for me, but it parse only 73 class instead of the full 125.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment