This script is a basic "proof-of-concept" implementation of a parsing tool that can "unroll" or flatten an HTML document into a simple JSON representation. Inspired by the Prosemirror document model.
There are two things which allow for the simplified output:
- We don't care about every HTML element, just a limited "whitelist"
- Text-level elements are represented as a linear sequence (in a real-world version,
each text item would need some kind of
propertiesfield to indicate