An HTML parser scans through an input string, starting at the beginning:
<div>hello<br>world</div>
│
└─ scanning begins here
While scanning, it sees parts of the HTML structure, like opening tags, attributes, closing tags, and bits of text.
<div data-foo="bar">hello<br>world</div>
└────────┬─────────┘
└─ open div { 'data-foo': 'bar' }
<div data-foo="bar">hello<br>world</div>
└─┬─┘
└─ text "hello"
<div data-foo="bar">hello<br>world</div>
└┬─┘
└─ open br, close br
<div data-foo="bar">hello<br>world</div>
└─┬─┘
└─ text "world"
<div data-foo="bar">hello<br>world</div>
└─┬──┘
└─ close div
The parser encapsulates the details of knowing what's what inside the input string. It then produces a token stream, which is simply a series of objects reflecting what it finds during the scan, as shown above.
NOTE: If the input is malformed, it's the parser's job to handle that somehow, either by throwing or correcting the problem in the token stream, such that the stream represents a well-formed HTML document. For example, each opening tag matched by a closing tag, etc.
This is called event-based or stream-based parsing, since it doesn't build any kind of DOM or AST data structure in memory, which makes it performant for certain applications. In JavaScript, the iterator protocol (or the async iterator protocol) provides a straightforward mechanism to expose this kind of API to your program. For example, html-tokenizer/parser
provides a synchronous iterator API:
import Parser from 'html-tokenizer/parser';
for (const token of Parser.parse(html)) {
switch (token.type) {
case 'open': {} // parser saw opening tag and attrs
case 'text': {} // parser saw text node
case 'close': {} // parser saw closing tag
case 'comment': {} // parser saw comment
}
}