Skip to content

Instantly share code, notes, and snippets.

@harigopalakrishna
Created September 26, 2014 16:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save harigopalakrishna/bc969625e676a8f255a2 to your computer and use it in GitHub Desktop.
Save harigopalakrishna/bc969625e676a8f255a2 to your computer and use it in GitHub Desktop.
Extracting text from PDF file using iText and qualifying the document for OCR processing
/*
Extracts text from PDF using iText libraries.
If no text is found, it could be a document with images or may be a scanned pdf
NOTE: This logic works for SINGLE PAGE PDF
*/
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
public class PdfText {
public static void main(String args[]){
try{
PdfReader reader = new PdfReader("/path/pdffilename.pdf");
String text=PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println(text);
//Check if the document is scanned pdf
if(text.isEmpty()){
System.out.println("Eligible for Ocr");
}else{
System.out.println("Not Eligible for Ocr");
}
}catch(Exception e){
e.printStackTrace();
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment