Skip to content

Instantly share code, notes, and snippets.

@greim
Last active June 22, 2020 22:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save greim/3b11e619fc4aaf73098427e64576c1af to your computer and use it in GitHub Desktop.
Save greim/3b11e619fc4aaf73098427e64576c1af to your computer and use it in GitHub Desktop.
HTML Parsing Primer

How HTML Parsing Works

An HTML parser scans through an input string, starting at the beginning:

<div>hello<br>world</div>
│
└─ scanning begins here

While scanning, it sees parts of the HTML structure, like opening tags, attributes, closing tags, and bits of text.

<div data-foo="bar">hello<br>world</div>
└────────┬─────────┘
         └─ open div { 'data-foo': 'bar' }
<div data-foo="bar">hello<br>world</div>
                    └─┬─┘
                      └─ text "hello"
<div data-foo="bar">hello<br>world</div>
                         └┬─┘
                          └─ open br, close br
<div data-foo="bar">hello<br>world</div>
                             └─┬─┘
                               └─ text "world"
<div data-foo="bar">hello<br>world</div>
                                  └─┬──┘
                                    └─ close div

The parser encapsulates the details of knowing what's what inside the input string. It then produces a token stream, which is simply a series of objects reflecting what it finds during the scan, as shown above.

NOTE: If the input is malformed, it's the parser's job to handle that somehow, either by throwing or correcting the problem in the token stream, such that the stream represents a well-formed HTML document. For example, each opening tag matched by a closing tag, etc.

This is called event-based or stream-based parsing, since it doesn't build any kind of DOM or AST data structure in memory, which makes it performant for certain applications. In JavaScript, the iterator protocol (or the async iterator protocol) provides a straightforward mechanism to expose this kind of API to your program. For example, html-tokenizer/parser provides a synchronous iterator API:

import Parser from 'html-tokenizer/parser';

for (const token of Parser.parse(html)) {
  switch (token.type) {
    case 'open': {} // parser saw opening tag and attrs
    case 'text': {} // parser saw text node
    case 'close': {} // parser saw closing tag
    case 'comment': {} // parser saw comment
  }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment