wraithan/streaming-parser.md

## streaming-parser.md

      
    Raw
  

              streaming-parser.md
            
          
    Streaming Parsers in Strongly Typed Languages

I've been building a number of parsers in Rust lately
while studying or doing code challenges. One of my side projects that involved
parsing is weechat-notifier. I
set about building the parser in the most intuitive way to me as a primarily
JavaScript developer these days.
Before we get into the mistakes I made, lets talk about the protocol I'm
parsing. The
WeeChat relay protocol
is an interesting one. It has a set of primitives and uses those to dynamically
build up types. This means you can parse it without any knowledge of the
possible data structures. This means libraries can be built in robust ways to
support future versions of WeeChat without having to update themselves!
There are many other protocols that carry their data structure metadata on the
wire with them, but this was the first time I'd built a parser for one of them
in a typed language. I set about building up an enum of the primitives,
getting tests passing for them, then realizing from there the parsing was nearly
done. Since the data was positional in the structures, I could very simply throw
it in a Vec and be done!
These are the types I came up with and have a working implementation of:
#[derive(Debug)]
pub struct WeechatMessage {
    pub id: String,
    pub data: Vec<WeechatData>,
}


#[derive(PartialEq, Eq, Clone, Debug)]
pub enum WeechatData {
    Char(char),
    Int(i32),
    Long(i64),
    String(String),
    StringNull,
    Buffer(String),
    BufferNull,
    Pointer(String),
    Time(String),
    Array(Vec<WeechatData>),
    Hdata(String, Vec<WeechatData>, Vec<HashMap<String, WeechatData>>),
}
Pretty simple types! WeechatMessage is just simple struct and WeechatData
has the minimal set of types to represent the primitives. Unfortunately the use
of multiple Vec and a HashMap means a lot of checked access in Rust. Code
using the resulting data structures was very cumbersome to write, requiring a
lot of double checking the protocol and the type system didn't really help me at
all.
I'm sure the more experienced typed programmers are shaking their head
knowingly, or hissing at the dynamic kids on their lawns or whatever they do for
fun. Honestly this kinda killed the project for me for a couple months. I built
up a whole parser and I needed to throw away so much code and build it to use
concrete types so users wouldn't be so burdened. It also made me sad that I'd
have to give up future compatibility.
The thought came that I could have an Unknown type and have it use the dynamic
structure while having concrete types being emitted. This idea got shattered
when I realized I'd be bumping major version every time I moved a type from
Unknown into a concrete type. I didn't want to place a different more
treadmill like burden on my users either.
This morning as I sat down to my normal Saturday hacking sessions at my local
cafe I realized I had a better solution. Since the messages all had names, I
could have the parser be instantiated with an optional list of message names to
be parsed in the dynamic style. This means users who opt into messages types
that aren't fully supported yet don't get burned when I update the library.
Thinking about this more, the parser could take two lists, a concrete and a
dynamic. Parsing and emitting only the message types specified. Also this means
I get to keep my dynamic parser and just build up a concrete parser along side
of it sharing in the lower level parts.
Thoughts and comments are very welcome! I'm still learning to let go of habits
built up from years of Python and JavaScript development and could always use
pointers.