vayn/parser.rs

## parser.rs
#![allow(dead_code)]

#[derive(Clone, Debug, PartialEq, Eq)]
struct Element {
    name: String,
    attributes: Vec<(String, String)>,
    children: Vec<Element>,
}

/// Time For A Trait
/// https://bodil.lol/parser-combinators/#time-for-a-trait
///
/// You might have noticed by now that we keep repeating the shape of the parser
/// type signature:
///
/// Fn(&str) -> Result<(&str, Output), &str>
///
/// You may be getting as sick of reading it written out full like that as I'm
/// getting of writing it, so I think it's time to introduce a trait, to make
/// things a little more readable, and to let us add some extensibility to our
/// parsers.
///
/// But first of all, let's make a type alias for that return type we keep
/// using:
type ParserResult<'a, Output> = Result<(&'a str, Output), &'a str>;
/// So that now, instead of typing that monstrosity out all the time, we can
/// just type `ParseResult<String>` or similar. We've added a lifetime there,
/// because the type declaration requires it, but a lot of the time the Rust
/// compiler should be able to infer it for you. As a rule, try leaving the
/// lifetime out and see if rustc gets upset, then just put it in if it does.
///
/// The lifetime 'a, in this case, refers specifically to the lifetime of the
/// input.
///
/// Now, for the trait. We need to put the lifetime in here as well, and when
/// you're using the trait the lifetime is usually always required. It's a bit
/// of extra typing, but it beats the previous version.
trait Parser<'a, Output> {
    fn parse(&self, input: &'a str) -> ParserResult<'a, Output>;

    fn map<F, NewOutput>(self, map_fn: F) -> BoxedParser<'a, NewOutput>
    where
        Self: Sized + 'a,
        Output: 'a,
        NewOutput: 'a,
        F: Fn(Output) -> NewOutput + 'a,
    {
        BoxedParser::new(map(self, map_fn))
    }

    fn pred<F>(self, pred_fn: F) -> BoxedParser<'a, Output>
    where
        Self: Sized + 'a,
        Output: 'a,
        F: Fn(&Output) -> bool + 'a,
    {
        BoxedParser::new(pred(self, pred_fn))
    }

    /// Are You Going To Say The M Word Or Do I Have To?
    /// https://bodil.lol/parser-combinators/#are-you-going-to-say-the-m-word-or-do-i-have-to
    ///
    /// Remember we talked about how the `map` pattern is called a "functor" on
    /// Planet Haskell?
    ///
    /// The `and_then` pattern is another thing you see a lot in Rust, in
    /// generally the same places as `map`. It's called `flat_map` on Iterator,
    /// but it's the same pattern as the rest.
    ///
    /// The fancy word for it is "monad." If you've got a thing Thing<A>, and
    /// you have an and_then function available that you can pass a function
    /// from A to Thing<B> into, so that now you have a new Thing<B> instead,
    /// that's a monad.
    ///
    /// The function might get called instantly, like when you have an
    /// Option<A>, we already know if it's a Some(A) or a None, so we apply the
    /// function directly if it's a Some(A), giving us a Some(B).
    ///
    /// It might also be called lazily. For instance, if you have a Future<A>
    /// that is still waiting to resolve, instead of and_then immediately
    /// calling the function to create a Future<B>, instead it creates a new
    /// Future<B> which contains both the Future<A> and the function, and which
    /// then waits for Future<A> to finish. When it does, it calls the function
    /// with the result of the Future<A>, and Bob's your uncle1, you get your
    /// Future<B> back. In other words, in the case of a Future you can think of
    /// the function you pass to and_then as a callback function, because it
    /// gets called with the result of the original future when it completes.
    /// It's also a little more interesting than that, because it returns a new
    /// Future, which may or may not have already been resolved, so it's a way
    /// to chain futures together.
    ///
    /// As with functors, though, Rust's type system isn't currently capable of
    /// expressing monads, so let's only note that this pattern is called a
    /// monad, and that, rather disappointingly, it's got nothing at all to do
    /// with burritos, contrary to what they say on the internets, and move on.
    fn and_then<F, NextParser, NewOutput>(self, f: F) -> BoxedParser<'a, NewOutput>
    where
        Self: Sized + 'a,
        Output: 'a,
        NewOutput: 'a,
        NextParser: Parser<'a, NewOutput> + 'a,
        F: Fn(Output) -> NextParser + 'a,
    {
        BoxedParser::new(and_then(self, f))
    }
}

/// To make this even easier, we can actually implement this trait for any
/// function that matches the signature of a parser:
impl<'a, F, Output> Parser<'a, Output> for F
where
    F: Fn(&'a str) -> ParserResult<Output>,
{
    fn parse(&self, input: &'a str) -> ParserResult<'a, Output> {
        self(input)
    }
}

/// To Infinity And Beyond
/// https://bodil.lol/parser-combinators/#to-infinity-and-beyond
///
/// enum List<A> {
///     Cons(A, List<A>),
///     Nil,
/// }
///
/// To which rustc will, very sensibly, object that your recursive type List<A>
/// has an infinite size, because inside every List::<A>::Cons is another
/// List<A>, and that means it's List<A>s all the way down into infinity. As far
/// as rustc is concerned, we're asking for an infinite list, and we're asking it
/// to be able to allocate an infinite list.
///
/// In many languages, an infinite list isn't a problem in principle for the type
/// system, and it's actually not for Rust either. The problem is that in Rust,
/// as mentioned, we need to be able to allocate it, or, rather, we need to be
/// able to determine the size of a type up front when we construct it, and when
/// the type is infinite, that means the size must be infinite too.
///
/// The solution is to employ a bit of indirection. Instead of our List::Cons
/// being an element of A and another list of A, instead we make it an element of
/// A and a pointer to a list of A. We know the size of a pointer, and it's the
/// same no matter what it points to, and so our List::Cons now has a fixed and
/// predictable size no matter the size of the list. And the way to turn an owned
/// thing into a pointer to an owned thing on the heap, in Rust, is to Box it.
///
/// enum List<A> {
///     Cons(A, Box<List<A>>),
///     Nil,
/// }
///
/// Another interesting feature of Box is that the type inside it can be
/// abstract. This means that instead of our by now incredibly complicated
/// parser function types, we can let the type checker deal with a very succinct
/// Box<dyn Parser<'a, A>> instead.
///
/// That sounds great. What's the downside? Well, we might be losing a cycle or
/// two to having to follow that pointer, and it could be that the compiler
/// loses some opportunities to optimise our parser. But recall Knuth's
/// admonition about premature optimisation: it's going to be fine. You can
/// afford those cycles.
///
/// So let's proceed to implement Parser for a boxed parser function in addition
/// to the bare functions we've been using so far.
struct BoxedParser<'a, Output> {
    parser: Box<dyn Parser<'a, Output> + 'a>,
}

impl<'a, Output> BoxedParser<'a, Output> {
    fn new<P>(parser: P) -> Self
    where
        P: Parser<'a, Output> + 'a,
    {
        BoxedParser {
            parser: Box::new(parser),
        }
    }
}

impl<'a, Output> Parser<'a, Output> for BoxedParser<'a, Output> {
    fn parse(&self, input: &'a str) -> ParserResult<'a, Output> {
        self.parser.parse(input)
    }
}

/*
fn match_literal(expected: &'static str)
    -> impl Fn(&str) -> Result<(&str, ()), &str>
{
    move |input| match input.get(0..expected.len()) {
        Some(next) if next == expected => {
            Ok((&input[expected.len()..], ()))
        }
        _ => Err(input),
    }
}
*/
fn match_literal<'a>(expected: &'static str) -> impl Parser<'a, ()> {
    move |input: &'a str| match input.get(0..expected.len()) {
        Some(next) if next == expected => Ok((&input[expected.len()..], ())),
        _ => Err(input),
    }
}

#[test]
fn literal_parser() {
    let parse_joe = match_literal("Hello Joe!");
    assert_eq!(Ok(("", ())), parse_joe.parse("Hello Joe!"));
    assert_eq!(
        Ok((" Hello Robert!", ())),
        parse_joe.parse("Hello Joe! Hello Robert!")
    );
    assert_eq!(Err("Hello Mike!"), parse_joe.parse("Hello Mike!"))
}

// fn identifier(input: &str) -> Result<(&str, String), &str> {
fn identifier(input: &str) -> ParserResult<String> {
    let mut matched = String::new();
    let mut chars = input.chars();

    match chars.next() {
        Some(next) if next.is_alphabetic() => matched.push(next),
        _ => return Err(input),
    }

    while let Some(next) = chars.next() {
        if next.is_alphanumeric() || next == '-' {
            matched.push(next);
        } else {
            break;
        }
    }

    let next_index = matched.len();
    Ok((&input[next_index..], matched))
}

#[test]
fn identifier_parser() {
    assert_eq!(
        Ok(("", "i-am-an-identifier".to_string())),
        identifier("i-am-an-identifier")
    );
    assert_eq!(
        Ok((" entirely an identifier", "not".to_string())),
        identifier("not entirely an identifier")
    );
    assert_eq!(
        Err("!not at all an identifier"),
        identifier("!not at all an identifier")
    );
}

/*
fn pair<P1, P2, R1, R2>(parser1: P1, parser2: P2)
    -> impl Fn(&str) -> Result<(&str, (R1, R2)), &str>
where
    P1: Fn(&str) -> Result<(&str, R1), &str>,
    P2: Fn(&str) -> Result<(&str, R2), &str>,
{
    move |input| match parser1(input) {
        Ok((next_input, result1)) => match parser2(next_input) {
            Ok((final_input, result2)) => Ok((final_input, (result1, result2))),
            Err(err) => Err(err),
        },
        Err(err) => Err(err),
    }
}

fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2)
    -> impl Parser<'a, (R1, R2)>
where
    P1: Parser<'a, R1>,
    P2: Parser<'a, R2>,
{
    move |input| match parser1.parse(input) {
        Ok((next_input, result1)) => match parser2.parse(next_input) {
            Ok((final_input, result2)) => Ok((final_input, (result1, result2))),
            Err(err) => Err(err),
        }
        Err(err) => Err(err),
    }
}
*/
fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)>
where
    P1: Parser<'a, R1>,
    P2: Parser<'a, R2>,
{
    move |input| {
        parser1.parse(input).and_then(|(next_input, result1)| {
            parser2
                .parse(next_input)
                .map(|(final_input, result2)| (final_input, (result1, result2)))
        })
    }
}
/// It looks very neat, but there's a problem: parser2.map() consumes parser2 to
/// create the wrapped parser, and the function is a Fn, not a FnOnce, so it's
/// not allowed to consume parser2, just take a reference to it. Rust Problems,
/// in other words.
///
/// fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2)
///     -> impl Parser<'a, (R1, R2)>
/// where
///     P1: Parser<'a, R1> + 'a,
///     P2: Parser<'a, R2> + 'a,
///     R1: 'a + Clone,
///     R2: 'a,
/// {
///     parser1.and_then(move |result1| parser2.map(move |result2| {
///         (result1.clone(), result2)
///     }))
/// }

#[test]
fn pair_combinator() {
    let tag_opener = pair(match_literal("<"), identifier);
    assert_eq!(
        Ok(("/>", ((), "my-first-element".to_string()))),
        tag_opener.parse("<my-first-element/>")
    );
    assert_eq!(Err("oops"), tag_opener.parse("oops"));
    assert_eq!(Err("!oops"), tag_opener.parse("<!oops"));
}

/// Enter The Functor
/// https://bodil.lol/parser-combinators/#enter-the-functor
///
/// Actually, let's indulge ourselves and shorten this function a bit, because
/// this map thing turns out to be a common pattern that Result actually
/// implements too.
///
/// This pattern is what's called a "functor" in Haskell and its mathematical
/// sibling, category theory. If you've got a thing with a type A in it, and you
/// have a map function available that you can pass a function from A to B into
/// to turn it into the same kind of thing but with the type B in it instead,
/// that's a functor. You see this a lot of places in Rust, such as in Option,
/// Result, Iterator and even Future, without it being explicitly named as such.
/// And there's a good reason for that: you can't really express a functor as a
/// generalised thing in Rust's type system, because it lacks higher kinded
/// types, but that's another story, so let's just note that these are functors,
/// and you just need to look for the map function to spot one.
///
/// fn map<P, F, A, B>(parser: P, map_fn: F)
///     -> impl Fn(&str) -> Result<(&str, B), &str>
/// where
///     P: Fn(&str) -> Result<(&str, A), &str>,
///     F: Fn(A) -> B,
/// {
///     move |input| match parser(input) {
///         Ok((next_input, result)) => Ok((next_input, map_fn(result))),
///         Err(err) => Err(err),
///     }
/// }
///
/// fn map<P, F, A, B>(parser: P, map_fn: F)
///     -> impl Fn(&str) -> Result<(&str, B), &str>
/// where
///     P: Fn(&str) -> Result<(&str, A), &str>,
///     F: Fn(A) -> B,
/// {
///     move |input|
///         parser(input)
///             .map(|(next_input, result)| (next_input, map_fn(result)))
/// }
fn map<'a, P, F, A, B>(parser: P, map_fn: F) -> impl Parser<'a, B>
where
    P: Parser<'a, A>,
    F: Fn(A) -> B,
{
    move |input| {
        parser
            .parse(input)
            .map(|(next_input, result)| (next_input, map_fn(result)))
    }
}

fn left<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, R1>
where
    P1: Parser<'a, R1> + 'a,
    P2: Parser<'a, R2> + 'a,
    R1: 'a,
    R2: 'a,
{
    // map(pair(parser1, parser2), |(left, _right)| left)
    pair(parser1, parser2).map(|(left, _right)| left)
}

fn right<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, R2>
where
    P1: Parser<'a, R1> + 'a,
    P2: Parser<'a, R2> + 'a,
    R1: 'a,
    R2: 'a,
{
    // map(pair(parser1, parser2), |(_left, right)| right)
    pair(parser1, parser2).map(|(_left, right)| right)
}

#[test]
fn right_combinator() {
    let tag_opener = right(match_literal("<"), identifier);
    assert_eq!(
        Ok(("/>", "my-first-element".to_string())),
        tag_opener.parse("<my-first-element/>")
    );
    assert_eq!(Err("oops"), tag_opener.parse("oops"));
    assert_eq!(Err("!oops"), tag_opener.parse("<!oops"));
}

fn zero_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>>
where
    P: Parser<'a, A>,
{
    move |mut input| {
        let mut result = Vec::new();

        while let Ok((next_input, next_item)) = parser.parse(input) {
            input = next_input;
            result.push(next_item);
        }

        Ok((input, result))
    }
}

#[test]
fn zero_or_more_combinator() {
    let parser = zero_or_more(match_literal("ha"));
    assert_eq!(Ok(("", vec![(), (), ()])), parser.parse("hahaha"));
    assert_eq!(Ok(("ahah", vec![])), parser.parse("ahah"));
    assert_eq!(Ok(("", vec![])), parser.parse(""));
}

fn one_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>>
where
    P: Parser<'a, A>,
{
    move |mut input| {
        let mut result = Vec::new();

        if let Ok((next_input, first_item)) = parser.parse(input) {
            input = next_input;
            result.push(first_item);
        } else {
            return Err(input);
        }

        while let Ok((next_input, next_item)) = parser.parse(input) {
            input = next_input;
            result.push(next_item);
        }

        Ok((input, result))
    }
}
/// Ownership problem
/// https://bodil.lol/parser-combinators/#one-or-more
///
/// Here, we run into Rust Problems, and I don't even mean the problem of not
/// having a cons method for Vec, but I know every Lisp programmer reading that
/// bit of code was thinking it. No, it's worse than that: it's ownership.
///
/// We own that parser, so we can't go passing it as an argument twice, the
/// compiler will start shouting at you that you're trying to move an already
/// moved value. So can we make our combinators take references instead? No, it
/// turns out, not without running into another whole set of borrow checker
/// troubles - and we're not going to even go there right now. And because these
/// parsers are functions, they don't do anything so straightforward as to
/// implement Clone, which would have saved the day very tidily, so we're stuck
/// with a constraint that we can't repeat our parsers easily in combinators.
///
/// fn one_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>>
/// where
///     P: Parser<'a, A>,
/// {
///     map(pair(parser, zero_or_more(parser)), |(head, mut tail)| {
///         tail.insert(0, head);
///         tail
///     })
/// }

#[test]
fn one_or_more_combinator() {
    let parser = one_or_more(match_literal("ha"));
    assert_eq!(Ok(("", vec![(), (), ()])), parser.parse("hahaha"));
    assert_eq!(Err("ahah"), parser.parse("ahah"));
    assert_eq!(Err(""), parser.parse(""));
}

fn any_char(input: &str) -> ParserResult<char> {
    match input.chars().next() {
        Some(next) => Ok((&input[next.len_utf8()..], next)),
        _ => Err(input),
    }
}

fn pred<'a, P, A, F>(parser: P, predicate: F) -> impl Parser<'a, A>
where
    P: Parser<'a, A>,
    F: Fn(&A) -> bool,
{
    move |input| {
        if let Ok((next_input, value)) = parser.parse(input) {
            if predicate(&value) {
                return Ok((next_input, value));
            }
        }
        Err(input)
    }
}

#[test]
fn predicate_combinator() {
    let parser = pred(any_char, |c| *c == 'o');
    assert_eq!(Ok(("mg", 'o')), parser.parse("omg"));
    assert_eq!(Err("lol"), parser.parse("lol"));
}

fn whitespace_char<'a>() -> impl Parser<'a, char> {
    // pred(any_char, |c| c.is_whitespace())
    any_char.pred(|c| c.is_whitespace())
}

fn space1<'a>() -> impl Parser<'a, Vec<char>> {
    one_or_more(whitespace_char())
}

fn space0<'a>() -> impl Parser<'a, Vec<char>> {
    zero_or_more(whitespace_char())
}

fn quoted_string<'a>() -> impl Parser<'a, String> {
    right(
        match_literal("\""),
        left(
            zero_or_more(any_char.pred(|c| *c != '"')),
            match_literal("\""),
        ),
    )
    .map(|chars| chars.into_iter().collect())
}

#[test]
fn quoted_string_parser() {
    assert_eq!(
        Ok(("", "Hello Joe!".to_string())),
        quoted_string().parse("\"Hello Joe!\"")
    );
}

fn attribute_pair<'a>() -> impl Parser<'a, (String, String)> {
    pair(identifier, right(match_literal("="), quoted_string()))
}

fn attributes<'a>() -> impl Parser<'a, Vec<(String, String)>> {
    zero_or_more(right(space1(), attribute_pair()))
}

#[test]
fn attribute_parser() {
    assert_eq!(
        Ok((
            "",
            vec![
                ("one".to_string(), "1".to_string()),
                ("two".to_string(), "2".to_string())
            ]
        )),
        attributes().parse(" one=\"1\" two=\"2\"")
    );
}

fn element_start<'a>() -> impl Parser<'a, (String, Vec<(String, String)>)> {
    right(match_literal("<"), pair(identifier, attributes()))
}

fn single_element<'a>() -> impl Parser<'a, Element> {
    // map(
    //     left(element_start(), match_literal("/>")),
    //     |(name, attributes)| Element {
    //         name,
    //         attributes,
    //         children: vec![],
    //     },
    // )
    left(element_start(), match_literal("/>")).map(|(name, attributes)| Element {
        name,
        attributes,
        children: vec![],
    })
}

#[test]
fn single_element_parser() {
    assert_eq!(
        Ok((
            "",
            Element {
                name: "div".to_string(),
                attributes: vec![("class".to_string(), "float".to_string())],
                children: vec![]
            }
        )),
        single_element().parse("<div class=\"float\"/>")
    );
}

fn open_element<'a>() -> impl Parser<'a, Element> {
    left(element_start(), match_literal(">")).map(|(name, attributes)| Element {
        name,
        attributes,
        children: vec![],
    })
}

fn either<'a, P1, P2, A>(parser1: P1, parser2: P2) -> impl Parser<'a, A>
where
    P1: Parser<'a, A>,
    P2: Parser<'a, A>,
{
    move |input| match parser1.parse(input) {
        ok @ Ok(_) => ok,
        Err(_) => parser2.parse(input),
    }
}

fn element<'a>() -> impl Parser<'a, Element> {
    whitespace_wrap(either(single_element(), parent_element()))
}

fn close_element<'a>(expected_name: String) -> impl Parser<'a, String> {
    right(match_literal("</"), left(identifier, match_literal(">")))
        .pred(move |name| name == &expected_name)
}

/// Remember I said we'd get back to and_then later? Well, later is here. The
/// combinator we need is, in fact, and_then: we need something that takes a
/// parser, and a function that takes the result of a parser and returns a new
/// parser, which we'll then run. It's a bit like pair, except instead of just
/// collecting both results in a tuple, we thread them through a function. It's
/// also just how and_then works with Results and Options, except it's a bit
/// easier to follow because Results and Options don't really do anything,
/// they're just things that hold some data (or not, as the case may be).
///
/// So let's try writing an implementation for it.
fn and_then<'a, P, F, A, B, NextP>(parser: P, f: F) -> impl Parser<'a, B>
where
    P: Parser<'a, A>,
    NextP: Parser<'a, B>,
    F: Fn(A) -> NextP,
{
    move |input| match parser.parse(input) {
        Ok((next_input, result)) => f(result).parse(next_input),
        Err(err) => Err(err),
    }
}

fn parent_element<'a>() -> impl Parser<'a, Element> {
    open_element().and_then(|el| {
        left(zero_or_more(element()), close_element(el.name.clone())).map(move |children| {
            let mut el = el.clone();
            el.children = children;
            el
        })
    })
}

fn whitespace_wrap<'a, P, A>(parser: P) -> impl Parser<'a, A>
where
    P: Parser<'a, A> + 'a,
    A: 'a,
{
    right(space0(), left(parser, space0()))
}

#[test]
fn xml_parser() {
    let doc = r#"
        <top label="Top">
            <semi-bottom label="Bottom"/>
            <middle>
                <bottom label="Another bottom"/>
            </middle>
        </top>"#;
    let parsed_doc = Element {
        name: "top".to_string(),
        attributes: vec![("label".to_string(), "Top".to_string())],
        children: vec![
            Element {
                name: "semi-bottom".to_string(),
                attributes: vec![("label".to_string(), "Bottom".to_string())],
                children: vec![],
            },
            Element {
                name: "middle".to_string(),
                attributes: vec![],
                children: vec![Element {
                    name: "bottom".to_string(),
                    attributes: vec![("label".to_string(), "Another bottom".to_string())],
                    children: vec![],
                }],
            },
        ],
    };
    assert_eq!(Ok(("", parsed_doc)), element().parse(doc));
}

#[test]
fn mismatched_closing_tag() {
    let doc = r#"
        <top>
            <bottom/>
        </middle>"#;
    assert_eq!(Err("</middle>"), element().parse(doc));
}