Last active
January 3, 2021 13:47
-
-
Save obahareth/60f3cc84b15c76980fb18ee701e0687c to your computer and use it in GitHub Desktop.
Using Tika through Yomu to turn a PDF into nicely formatted HTML, parsing the HTML using Nokogiri, and then using Algolia to get the PDF contents indexed and searchable. See https://medium.com/@obahareth/indexing-pdf-or-other-file-contents-for-searching-b2499c23568f
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require "nokogiri" | |
require "yomu" | |
require "algoliasearch" | |
def invalid_paragraph?(str) | |
disallowed_strings = [ "", " ", "\n", " \n" ] | |
disallowed_strings.include?(str) | |
end | |
def get_pdf_paragraphs(filename) | |
yomu = Yomu.new(filename) | |
paragraphs = [] | |
doc = Nokogiri::HTML(yomu.html) | |
page = 0 | |
doc.css('.page').each do |node| | |
node.css('p').each do |paragraph| | |
paragraph_text = paragraph.inner_text | |
next if invalid_paragraph?(paragraph_text) | |
paragraphs << { text: paragraph_text, page: page } | |
end | |
page += 1 | |
end | |
paragraphs | |
end | |
paragraphs = get_pdf_paragraphs("dracula-shortened.pdf") | |
Algolia.init(application_id: 'xxxx', api_key: 'xxxx') | |
index = Algolia::Index.new("books") | |
index.add_objects(paragraphs) | |
index.set_settings({ "searchableAttributes" => ["text"] }) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
invalid_paragraph?
could be written in a more performant way using regexes, but I wanted something easy for all readers to understand.