Skip to content

Instantly share code, notes, and snippets.

@wraithan
Last active January 11, 2016 14:49
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wraithan/354d666f731271b66cec to your computer and use it in GitHub Desktop.
Save wraithan/354d666f731271b66cec to your computer and use it in GitHub Desktop.
Thoughts on streaming (protocol) parsers in strongly typed languages

Streaming Parsers in Strongly Typed Languages

I've been building a number of parsers in Rust lately while studying or doing code challenges. One of my side projects that involved parsing is weechat-notifier. I set about building the parser in the most intuitive way to me as a primarily JavaScript developer these days.

Before we get into the mistakes I made, lets talk about the protocol I'm parsing. The WeeChat relay protocol is an interesting one. It has a set of primitives and uses those to dynamically build up types. This means you can parse it without any knowledge of the possible data structures. This means libraries can be built in robust ways to support future versions of WeeChat without having to update themselves!

There are many other protocols that carry their data structure metadata on the wire with them, but this was the first time I'd built a parser for one of them in a typed language. I set about building up an enum of the primitives, getting tests passing for them, then realizing from there the parsing was nearly done. Since the data was positional in the structures, I could very simply throw it in a Vec and be done!

These are the types I came up with and have a working implementation of:

#[derive(Debug)]
pub struct WeechatMessage {
    pub id: String,
    pub data: Vec<WeechatData>,
}


#[derive(PartialEq, Eq, Clone, Debug)]
pub enum WeechatData {
    Char(char),
    Int(i32),
    Long(i64),
    String(String),
    StringNull,
    Buffer(String),
    BufferNull,
    Pointer(String),
    Time(String),
    Array(Vec<WeechatData>),
    Hdata(String, Vec<WeechatData>, Vec<HashMap<String, WeechatData>>),
}

Pretty simple types! WeechatMessage is just simple struct and WeechatData has the minimal set of types to represent the primitives. Unfortunately the use of multiple Vec and a HashMap means a lot of checked access in Rust. Code using the resulting data structures was very cumbersome to write, requiring a lot of double checking the protocol and the type system didn't really help me at all.

I'm sure the more experienced typed programmers are shaking their head knowingly, or hissing at the dynamic kids on their lawns or whatever they do for fun. Honestly this kinda killed the project for me for a couple months. I built up a whole parser and I needed to throw away so much code and build it to use concrete types so users wouldn't be so burdened. It also made me sad that I'd have to give up future compatibility.

The thought came that I could have an Unknown type and have it use the dynamic structure while having concrete types being emitted. This idea got shattered when I realized I'd be bumping major version every time I moved a type from Unknown into a concrete type. I didn't want to place a different more treadmill like burden on my users either.

This morning as I sat down to my normal Saturday hacking sessions at my local cafe I realized I had a better solution. Since the messages all had names, I could have the parser be instantiated with an optional list of message names to be parsed in the dynamic style. This means users who opt into messages types that aren't fully supported yet don't get burned when I update the library.

Thinking about this more, the parser could take two lists, a concrete and a dynamic. Parsing and emitting only the message types specified. Also this means I get to keep my dynamic parser and just build up a concrete parser along side of it sharing in the lower level parts.

Thoughts and comments are very welcome! I'm still learning to let go of habits built up from years of Python and JavaScript development and could always use pointers.

@steveklabnik
Copy link

Code using the resulting data structures was very cumbersome to write, requiring a lot of double checking the protocol and the type system didn't really help me at all.

I'd be interested in hearing more about the specifics here. Did you know about entry? It can sometimes help.

@wraithan
Copy link
Author

The problem here was more that it was all about "3rd value is channel, 4th value is message" etc etc. Where I either used array indexing which is could easily throw an exception if something went wrong, or I have to do a if let Some(value) = message.data.nth(3) {... then if was a WeechatData::Hdata I have to match on that, then index into Vec of HashMap, then index into the HashMap.

Basically because I was dynamically looking up all the fields and I wanted to build the client library in a safe way for users, I had to do a lot of checking (at the very least I'd have to throw a lot of try! on everything. If instead of I had more concrete structs for each message type, then the client KNOWS that every field has been instantiated, making it able to be indexed into without doing safety checks everywhere.

@steveklabnik
Copy link

Ah, I see. that makes sense 👍

@wraithan
Copy link
Author

What I meant about the type system not helping me, was that I had to go look up every message type in the docs, if I was off by one I'd still get data, but (hopefully) the wrong datatype so I'd get an error. If it was a struct there would be no ambiguity whether I accessed the right field and what datatype that field is supposed to have. I can just have one check when the parser emits the parsed object if it is the type I expect and not do it everywhere throughout all my code.

@experquisite
Copy link

I am investigating a similar problem; I want the final results to be in properly typed structs, but I don't want to write the codec code for each struct separately; I want it derived from the struct declaration. I am hoping to find or invent something as convenient as https://github.com/scodec/scodec or https://github.com/travisbrown/circe (though those are two quite different projects).

@wraithan
Copy link
Author

@experquisite is https://github.com/serde-rs/serde closer to what you are looking for?

@RAOF
Copy link

RAOF commented Jan 10, 2016

Compile-time reflection (as implemented in Rust via proceedural macros, IIRC; available in unstable form on nightly) is pretty much what you're after there.

@the-kenny
Copy link

Hey! I'm very interested in this approach of a parser. I recently wrote a rust implementation of weechat-relay myself, but it's a rather dumb and static implementation without using parser combinators etc.

@wraithan
Copy link
Author

Neat @the-kenny! I'm hoping to have the static types side of my parser done in the next weekend or so (don't have much energy for coding after work on weekdays). After that, since I'll have experienced writing a parser from scratch for a binary protocol in a dynamic and a static way, I'll probably move over to https://github.com/Geal/nom . This is very much a learning project for me, but I should be able to 1.0 the parser before I finish moving over to nom, since it should just be internals changing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment