Skip to content

Instantly share code, notes, and snippets.

@errzey
Last active August 29, 2015 14:01
Show Gist options
  • Save errzey/abdc64f1183a741fb255 to your computer and use it in GitHub Desktop.
Save errzey/abdc64f1183a741fb255 to your computer and use it in GitHub Desktop.
Writing efficient parsers for streaming data in C
C is known as one of the worst languages to write any type of parsers.
Developers usually take one of three roads for parsing data:
1. Lexical Analysis & LALR (lex / yacc) utilizing a BNF syntax input.
2. Gobs of strstr, strtok, strcmp, and loads of conditionals.
3. Regular expressions, which usually involve the same logic in #2
Utilizing Lex & Yacc can be one of the more strict and specific methods of
parsing. Though it has some shortcomings; for example, the generated code is
usually very complex, and in most cases, not thread-safe. While the BNF language
is sane and logical, the act of creating it can be daunting, not only for the
experienced, but for the next developer having to maintain your code.
Using various C str functions is by far the worst method, for not only performance, but
readability and maintainability. There is nothing worse than seeing a function
with fifty strtok's and 100 conditionals.
Regular expressions double the complexity and maintainability of a parser due to
the fact that the regular expression language itself can be impossible to read,
and in addition, usually incorporates a large number of C str functions and
conditionals around matched data.
In almost all cases, a developer will write parsers that deal with a single
block of data that can be fully parsed in one call. When dealing with streamed
data (e.g., network), this can cause problems. The work-around is usually in the
form of buffer-until-ready. That is, keep reading until a block of data can
be fully parsed.
In this presentation I will discuss how to write parsers which are resilient to
complexity, resource utilization, and overall maintainability. We will discuss
the initial thought process, from specification to expectations, prototyping,
to finalized implementation. Using these techniques, little to no system
resources are utilized, and parsing can be paused and resumed at any time
without interrupting the finalized structure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment