Skip to content

Instantly share code, notes, and snippets.

@atomotic
Created November 13, 2021 12:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save atomotic/b401eee3fcb7e8e841e53dcf9abe1b8f to your computer and use it in GitHub Desktop.
Save atomotic/b401eee3fcb7e8e841e53dcf9abe1b8f to your computer and use it in GitHub Desktop.
indexing epub content into solr

indexing epub content into solr

solr schema

  • 1 document per chapter, then collapse
  • multivalued fields: chapter_title and chapter_text, keeping order.

text extraction

how to extract structured text from epub

  • tika
  • pandoc epub->tei or epub->docbook
  • custom epub reader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment