This script is a basic "proof-of-concept" implementation of a parsing tool that can "unroll" or flatten an HTML document into a simple JSON representation. Inspired by the Prosemirror document model.
There are two things which allow for the simplified output:
- We don't care about every HTML element, just a limited "whitelist"
- Text-level elements are represented as a linear sequence (in a real-world version,
each text item would need some kind of
attributes
orproperties
field to indicate links, images, bold + italic, superscript/subscript, etc.
- Ruby (any recent version)
- Nokogiri gem
To see some output, run this script with a path to an HTML document as an argument:
ruby parser.rb "book/OEBPS/Badi_9781781680308_epub_c04_r1.htm"
This will give you a giant string of JSON which can be pasted into a tool like this one
for inspection. Or load
the script into a pry session to peek under the hood.