Skip to content

Instantly share code, notes, and snippets.

@johnmiedema
Last active August 29, 2015 13:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save johnmiedema/10797355 to your computer and use it in GitHub Desktop.
Save johnmiedema/10797355 to your computer and use it in GitHub Desktop.
Use Apache Tika to extract metadata and convert different content types into plain text
//Use Apache Tika to extract metadata and convert different content types into plain text
//'Whatson' blog series at johnmiedema.com
//http://johnmiedema.com/?tag=whatson
//source documents include different content types
processDocument("resources/mobydick.htm");
processDocument("resources/robinsoncrusoe.txt");
processDocument("resources/callofthewild.pdf");
private static void processDocument(String pathfilename) {
try {
InputStream input = new FileInputStream(new File(pathfilename));
//Apache Tika
ContentHandler textHandler = new BodyContentHandler(10*1024*1024);
Metadata meta = new Metadata();
Parser parser = new AutoDetectParser(); //handles documents in different formats:
ParseContext context = new ParseContext();
parser.parse(input, textHandler, meta, context);
//extract metadata
System.out.println("Title: " + meta.get(DublinCore.TITLE));
//content is plain text
System.out.println("Body: " + textHandler.toString());
}
catch (Exception ex) {
System.out.println(ex.getMessage());
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment