Skip to content

Instantly share code, notes, and snippets.

@obahareth
Last active January 3, 2021 13:47
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save obahareth/60f3cc84b15c76980fb18ee701e0687c to your computer and use it in GitHub Desktop.
Save obahareth/60f3cc84b15c76980fb18ee701e0687c to your computer and use it in GitHub Desktop.
Using Tika through Yomu to turn a PDF into nicely formatted HTML, parsing the HTML using Nokogiri, and then using Algolia to get the PDF contents indexed and searchable. See https://medium.com/@obahareth/indexing-pdf-or-other-file-contents-for-searching-b2499c23568f
require "nokogiri"
require "yomu"
require "algoliasearch"
def invalid_paragraph?(str)
disallowed_strings = [ "", " ", "\n", " \n" ]
disallowed_strings.include?(str)
end
def get_pdf_paragraphs(filename)
yomu = Yomu.new(filename)
paragraphs = []
doc = Nokogiri::HTML(yomu.html)
page = 0
doc.css('.page').each do |node|
node.css('p').each do |paragraph|
paragraph_text = paragraph.inner_text
next if invalid_paragraph?(paragraph_text)
paragraphs << { text: paragraph_text, page: page }
end
page += 1
end
paragraphs
end
paragraphs = get_pdf_paragraphs("dracula-shortened.pdf")
Algolia.init(application_id: 'xxxx', api_key: 'xxxx')
index = Algolia::Index.new("books")
index.add_objects(paragraphs)
index.set_settings({ "searchableAttributes" => ["text"] })
@obahareth
Copy link
Author

invalid_paragraph? could be written in a more performant way using regexes, but I wanted something easy for all readers to understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment