Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active July 18, 2022 03:48
Show Gist options
  • Save benmarwick/11333467 to your computer and use it in GitHub Desktop.
Save benmarwick/11333467 to your computer and use it in GitHub Desktop.
Convert PDFs to text files or CSV files (DfR format) with R
# Here are a few methods for getting text from PDF files. Do read through
# the instructions carefully! NOte that this code is written for Windows 7,
# slight adjustments may be needed for other OSs
# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
# now there are a few options...
############### PDF (image of text format) to TXT ##########
# This is for is your PDF is an image of text, this is the case
# if you open the PDF in a PDF viewer and you cannot select
# words or lines with your cursor.
##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# https://code.google.com/p/tesseract-ocr/
# and a copy of ImageMagick: http://www.imagemagick.org/
# and a copy of pdftoppm on your computer!
# Download: http://www.foolabs.com/xpdf/download.html
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...
# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames
sapply(myfiles, FUN = function(i){
file.rename(from = i, to = paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})
# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
# where are the txt files you just made?
dest # in this folder
# And now you're ready to do some text mining on the text files
############### PDF (text format) to TXT ###################
##### Wait! #####
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details: https://en.wikipedia.org/wiki/Pdftotext
# Download: http://www.foolabs.com/xpdf/download.html
# If you have a PDF with text, ie you can open the PDF in a
# PDF viewer and select text with your curser, then use these
# lines to convert each PDF file that is named in the vector
# into text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )
# where are the txt files you just made?
dest # in this folder
# And now you're ready to do some text mining on the text files
############### PDF to CSV (DfR format) ####################
# or if you want DFR-style csv files...
# read txt files into R
mytxtfiles <- list.files(path = dest, pattern = "txt", full.names = TRUE)
library(tm)
mycorpus <- Corpus(DirSource(dest, pattern = "txt"))
# warnings may appear after you run the previous line, they
# can be ignored
mycorpus <- tm_map(mycorpus, removeNumbers)
mycorpus <- tm_map(mycorpus, removePunctuation)
mycorpus <- tm_map(mycorpus, stripWhitespace)
mydtm <- DocumentTermMatrix(mycorpus)
# remove some OCR weirdness
# words with more than 2 consecutive characters
mydtm <- mydtm[,!grepl("(.)\\1{2,}", mydtm$dimnames$Terms)]
# get each doc as a csv with words and counts
for(i in 1:nrow(mydtm)){
# get word counts
counts <- as.vector(as.matrix(mydtm[1,]))
# get words
words <- mydtm$dimnames$Terms
# combine into data frame
df <- data.frame(word = words, count = counts,stringsAsFactors = FALSE)
# exclude words with count of zero
df <- df[df$count != 0,]
# write to CSV with original txt filename
write.csv(df, paste0(mydtm$dimnames$Docs[i],".csv"), row.names = FALSE)
}
# and now you're ready to work with the csv files
############### PDF to TXT (all text between two words) ####
## Below is about splitting the text files at certain characters
## can be skipped...
# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})
# Write abstracts into separate txt files...
# write abstracts as txt files
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts), function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
# And now you're ready to do some text mining on the txt
# originally on http://stackoverflow.com/a/21449040/1036500
@Patrikios
Copy link

Could you possibly include the OCR procedure in Linux OS as the function shell() seems not working in Linux for me. R returns:
Error in FUN(c("directory", :
could not find function "shell"

@CabritoIncognito
Copy link

Ben, Your code works great for the pdftotext on my machine! However, when I use the pdftoppm code for image-like PDFs, I'm getting an error that identifies 'pdftoppm' is not recognized as an internal or external command. Did you have to make any changes to the config.xpdfrc config file for pdftoppm to work or do you have any other recommendations on the issue I'm encountering?
Thanks,
Justin

@whisere
Copy link

whisere commented Sep 22, 2015

Can we add weight to the csv files? the current DfR file has word, count and weight?
Thanks!

@felixhaass
Copy link

@CabritoIncognito: Make sure you provide the full path to the pdftoppm.exe in the shell() command. Also, the path to pdftopmm.exemay not contain any spaces (such as in C:/Program Files/). Then it should work fine.

@randNormal
Copy link

Thx for this awesome script. I am struggling to get the conversion process running.

Please see my SO question here:

http://stackoverflow.com/questions/34038507/cmd-running-a-program-gives-me-only-the-help-page?noredirect=1#comment55832796_34038507

What is the parameter ocrbook for in pdftopmm?

Thx in advance for your reply!

@jvillacampageoblink
Copy link

Patrikios in Unix yo can use sytem instead of shell( shQuote(...
Please, be carrefull with the parenthesis

ORIGINAL
shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))

convert ppm to tif ready for tesseract

shell(shQuote(paste0("convert *.ppm ", i, ".tif")))

convert tif to text file

shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))

delete tif file

file.remove(paste0(i, ".tif" ))

UNIX
system( paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook") )

convert ppm to tif ready for tesseract

system(paste0("convert *.ppm ", i, ".tif"))

convert tif to text file

system(paste0("tesseract ", i, ".tif ", i, " -l spa"))

delete tif file

@jeanoesque
Copy link

Is it possible to use this for other languages? Specifically I am working with Korean and Chinese (traditional).
Thank you!

@ashishmbi
Copy link

hi,
I have some data in excels and i am having one pdf file. so i want to paste the some text from excel to pdf. because in pdf file some text are missing like patient name, dob. how can we do,,any idea,,i am trying to do this by reading pdf file in txt formate and trying to some lookup function. but its not working,,

@saeedeasadi
Copy link

Hi. Thanks for your helpful code. Is it possible to use the first approach, even when the pdf is not an image of text format? I mean I have a text format pdf but I want to act with it like an image of text and use the first approach.

@edselpogi
Copy link

Hi. Thank you for your code. I am new to R and I am getting an error
'pdftoppmC:' is not recognized as an internal or external command,
operable program or batch file.
Invalid Parameter - C:
'tesseractC:' is not recognized as an internal or external command,

Hope you could help. Thanks!

@koka0901
Copy link

Hi. The code worked great. I used the "PDF (text format) to TXT".

Issue: I am not able to see the newly created txt files in my directory (dest) or anywhere in the folders.

@mbgalloway
Copy link

this doesn't work. Naming it OCRBOOK makes the file name OCRBOOK-000001.ppm and then when ImageMagick needs to do its thing, it looks for the original file name.ppm which obviously doesn't exist. How to fix?

Also, the xPDF related syntax appears to be wrong, the options need to go right after the exe is called.
shell(shQuote(paste0("pdftopng -f 1 -l 10 -r 600 -gray ", i, " whatever")))

@cidemarg
Copy link

Hi. The code worked great. I used the "PDF (text format) to TXT".

Issue: I am not able to see the newly created txt files in my directory (dest) or anywhere in the folders.

Dear koka0901;
I also have the same problem. Have you solved the issue ?

@regmin
Copy link

regmin commented Aug 8, 2020

I am getting the following error when trying to convert from pdf to txt, Does anyone know why? Thanks in advance..
Syntax Error (4021433): Unknown filter 'DCTDecode'
Syntax Error (3525089): Unknown filter 'DCTDecode'
Syntax Error (2849935): Unknown filter 'DCTDecode'
Syntax Error (2187540): Unknown filter 'DCTDecode'
Syntax Error (1489650): Unknown filter 'DCTDecode'
Invalid Parameter - C:
'tesseract' is not recognized as an internal or external command,
operable program or batch file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment