Skip to content

Instantly share code, notes, and snippets.

Last active June 15, 2024 15:45
Show Gist options
  • Save LorisBachert/7b9ac408d4564caaabef to your computer and use it in GitHub Desktop.
Save LorisBachert/7b9ac408d4564caaabef to your computer and use it in GitHub Desktop.
Using Apache TIKA to extract the following formats: DOC, DOCX, PPT, PPTX, XLS, XLSX, PDF, JPG, PNG, TXT Note: Tesseract must be installed in order to get JPG and PNG extraction working.
* Uses Tikas {@link AutoDetectParser} to extract the text of a file.
* @param document
* @return The text content of a file
public String extractTextOfDocument(File file) throws Exception {
InputStream fileStream = new FileInputStream(file);
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
// To parse images in files those lines are needed
ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); // need to add this to make sure
// recursive parsing happens!
try {
parser.parse(fileStream, handler, metadata, parseContext);
String text = handler.toString();
if (text.trim().isEmpty()) {
logger.warn("Could not extract text of '" + document.getName() + "'");
} else {
logger.debug("Successfully extracted the text of '" + document.getName() + "'");
return text;
} catch (IOException | SAXException | TikaException e) {
throw new Exception("TIKA was not able to exctract text of file '" + document.getName() + "'", e);
} finally {
try {
} catch (IOException e) {
throw new Exception(e);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment