October 24th 2020
Kirk Shillingford
Tl;dr This is a short write up on how I used the elm-parser library to solve a coding challenge on exercism.io. It took me a lot more time than I anticipated because there's not many resources on parsing in Elm at the moment so I'm hoping this article can add to the available resources for others. To skip directly to the implementation, click here to see the breakdown of how I did it. To see the full code and tests, click here.
So this weekend I tried the Word-Count problem on exercism as part of a weekly challenge I'm doing with some friends. We've just started, but it's been fun so far as a chance to sharpen my Elm skills, but also to introduce others to the language. I highly recommend it.
The details of the challenge are below, but essentially, it requires extracting some information out of a string of text.
My initial and immediate thought was to use regular expressions since that's the way I've always done pattern matching on strings before. Imagine my surprise though when I looked up the elm-regex library to see the following line as the first thing in the library Readme.md
:
"Generally speaking, it will be easier and nicer to use a parsing library like elm/parser instead of this."
What kind of library tells you to go use another library?
Well, the Elm community is very intent on two things; trying to get things right, but also being practical. And sometimes that means you have to provide a solution while acknowledging that it's not the best solution. The author goes on to explain why he thinks elm-regex shouldn't be your first choice for most things and it was certainly an interesting argument. It's definitely worth a read in the original doc but a crude summary would be:
- In general, regex is difficult to understand, maintain, and test
- The relative ease of use of regex means it is frequently used for applications it was never designed to handle.
- He'd rather the elm ecosystem evolve to have specific parser libraries for handling common validation like passwords, forms, etc.
I think he raises a valid point. Regex is simple, fast, and powerful, but there are disadvantages to using a string of text to validate other strings of text.
Can you guess what this regex does?
/^[a-z0-9_-]{6,18}$/
It's a common pattern for checking for valid passwords. The valiation is described as:
We begin by telling the parser to find the beginning of the string (^), followed by any lowercase letter (a-z), number (0-9), an underscore, or a hyphen. Next, {6,18} makes sure that are at least 6 of those characters, but no more than 18. Finally, we want the end of the string ($).
And that's... fine? But is that obvious from the regex expression? Maybe, if you're familiar. But if you're not familiar with regex, can you tell if you've typed the string correctly? If you are familiar with regex and you make a mistake, would you be able to spot it?
Regex can't tell you if it's wrong or right. So we're left to unit tests as the only real safeguard against error and that can be unwieldy.
So considering all this I thought, why not give elm-parser a try. Surely it can't be too bad.*
*It was actually quite bad 😬, but that's not really the library's fault. The Elm-parser library is pretty reasonable, once you know what you're doing, and I did not. It took me several hours to get a working solution and that involves innumerable backtracks, edits, and reading, re-reading, and re-re-readings of this article by Alex Korban (bless you Alex) and the docs because I was attempting to both learn a library and learn the concept of parsing and solve a specific problem. Maybe not the best idea but here we are.
The rest of this article will be an attempt to explain my methodology had I known what I was doing from the start, because if I tried to take you through my actual path, this would become 50 pages long.
Let's examine the actual program requirements.
Given a phrase, count the occurrences of each word in that phrase.
For the purposes of this exercise you can expect that a word will always be one of:
- A number composed of one or more ASCII digits (ie "0" or "1234") OR
- A simple word composed of one or more ASCII letters (ie "a" or "they") OR
- A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
When counting words you can assume the following rules:
- The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word)
- The count is unordered; the tests will ignore how words and counts are ordered
- Other than the apostrophe in a contraction all forms of punctuation are ignored
- The words can be separated by any form of whitespace (ie "\t", "\n", " ")
At the end, we expose a single function WordCount
with the type signature:
WordCount : String -> Dict String Int
which means that the function will accept some string of text and return a dictionary with keys of strings and values of integers, representing our words and the frequency of their appearances respectively. How we implement this is left entirely up to us.
Let's examine the problem. Our program needs to create a function that accepts a string of characters, and returns a Dictionary with the words from the string as keys, and the frequency of their appearances as values.
We can break down our program into two distinct steps:
- Parse the string into a list of valid words
- Transform the list of words into our dictionary
A parser can be defined as "a computer program that breaks down text into recognized strings of characters for further analysis". Parsing can be seen as a alternative control flow to validating. Instead of checking whether data is in a state than can be utilized by your program/process, parsing attempts to transform the incoming data into that valid state; if the data cannot be transformed, the data was invalid and the parsing fails.
This is a powerful technique because it ensures that after the parsing step you either have:
- Data that is exactly the shape you need for your business logic (and you dont need to validate it without your program logic again)
- A respresentation of an incompatible object that your program can deal with before it can cause errors in your internal logic.
Parsers can be broken broken down into three distinct steps/parts:
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
For this example, our main focus is the tokenizing and lexing as we're not actually constructing an abstract syntax tree. We're just directly consuming the list of valid words returned by our lexer.
So like, how do we do that?
type alias Contraction =
Maybe String
type ValidWord
= Numeric String
| Alphabetic String Contraction
First, let's use the type system to define the type of tokens we want to returns from our parser. Accroding to the instructions, we can expect two types of valid words. One is a sequence of all numeric characters, and one is a sequence of all alphabetic characters, which may or may not have a single apostraphe. Our type ValidWord
will represent these two options. Note how the Alphabetic
branch takes two arguments; the first is a string representing a sequence of characters. The second, Contraction
is a Maybe String
. Words with apostrophes will therefore have contractions of Just somestring
while words without will have a contraction of Nothing
.
isIgnorable : Char -> Bool
isIgnorable =
not << Char.isAlphaNum
numericWord : Parser ValidWord
numericWord =
succeed Numeric
|. chompWhile isIgnorable
|= (getChompedString <|
succeed identity
|. chompIf Char.isDigit
|. chompWhile Char.isDigit
)
|. chompWhile isIgnorable
Okay, I know this is a lot. I'm going to try my best to explain what's going on here.
isIgnorable
is just a helper function that returns true if a character isn't alpahbetic or numeric. We're going to use it when we need to check for spaces and random symbols between wordsnumericWord
is our ValidWord Parser. Specifically it tries to create theNumeric
branch of theValidWord
type. It does not handle theAlphabetic
case (we'll have another parser for that).- succeed is the first function we see from the parser library and for now we can just treat it as function that helps us start off a definition of a sequence of parsing steps
- |. is helper function from the parsing library that lets you parse part of the stream without passing back the contents. It's great for when you want to ensure something exists, but you don't want to pass it to your token constructors
- chompWhile does exactly what it sounds like. It will consume a stream of text as long as the condition passed into it is satisfied. Combined with (|.) we're essentially saying "there could be some non alphanumeric characters at the beginning, we don't care about them
- |= is the opposite of
|.
It does pass whatever is the result of the parser that comes after it - getChompedString - typically chompers don't return anything. They just consume and check.
getChompedString
turns the result of whatever you've chomped as aString
. (Note that it took me forever to figure out that this was the best way to get back weirdly defined strings like this. It's not the only way - see Parser.variable - but it is the best way for this particular situation). We passgetChompedString
another subcategory of parsers that is actually what we want to consume to make our Numeric tokenidentity
is a common helper function in functional programming that just returns whatever is passed into it.- chompIf - chomps only the char matches the following function exactly. In this case we want to chomp only if the character is a digit.
- We follow it up with a
chompWhile
to say "after identifying at least one digit, you can a sequence of as many more digits as you like
- Finally, we add one more check for non alphanumeric characters.
- Altogether we can read this Parser as:
"Hey, check if there are some nonalphanumeric characters at the start of this thing. I don't care how many. Then, you should see at least one digit. Grab that sequence of digits. Then you should see some more nonalphanumeric characters. The End" ❤️
That was a lot! But essentially now we have a Parser than can handle numeric valid words.
apostrophe : Char
apostrophe =
Char.fromCode 39
contraction : Parser (Maybe String)
contraction =
(getChompedString <|
succeed identity
|. chompIf ((==) apostrophe)
|. chompIf Char.isAlpha
|. chompWhile Char.isAlpha
)
|> Parser.map Just
alphabeticWord : Parser ValidWord
alphabeticWord =
succeed Alphabetic
|. chompWhile isIgnorable
|= (getChompedString <|
succeed identity
|. chompIf Char.isAlphaNum
|. chompWhile Char.isAlpha
)
|= oneOf [ backtrackable contraction, succeed Nothing ]
|. chompWhile isIgnorable
"Oh no, it's even more complicated than the numberic one." If you're thinking that, you are right. But the basic technique is the same. The main difference is, Alphabetic ValidWords accept two pieces of data; a first string for alphabetic text before an apostrophe, and a second Maybe String
that represents that second piece of a conjuction. Let's go over the things here that are different from the Numeric version.
apostrophe
- First, we can see we've defined a constant called apostrophe, which represents the apostrople character. I did this because Elm represents characters using apostophes/single quotes. So"n"
is theString
n but'n'
is theChar
n. That's typically fine, but in the case of the apostrophe itself we see that it'sChar
representation would have to be'''
which the compiler didn't seem to like. 😬 So we used the code point reference number from the unicode library to create an apostrophe character. There may be other ways to do this, but this seemed clear and reasonable at the time.contraction
is the Parser for the contraction part of our valid word. Note that the first thing it checks is if the first character is an apostrophe.- Then we check if the apostrophe is followed by an alphabetic character. This means that trailing apostrophes are rejected.
- Parser.map is a helper function with the type signature
Parser.map : (a -> b) -> Parser a -> Parser b
. It allows us to tranform values inside a parser and return a new Parser of the transformed value. In this case, we're wrapping theString
fromgetChompedString
into aMaybe
value. - Altogether we can interpret contraction as: "Hey, check if there's an apostrophe, followed by at least one alphabetic character. Wrap this in a Maybe."
alphabeticWord
is the main function for producing a parser forAlphabetic
branch of theValidWord
type. Its structure is similar tonumericWord
, checking for some whitespace, then looking for a sequence of alphabetic characters- oneOf is a function we haven't seen before.
oneOf
applies a list of parsers to a sequence, trying them one by one until one is accepted.- backtrackable is a helper function in the parser library that says that if a parser fails, jump back to where it began for the next part of the parser.
- altogther, oneOf tries our contraction parser and if it fails (because no contraction exists), it returns a Parser of the other case of our maybe
Nothing
.
- Note that alphabetic word has two
|=
operators, each one taking an input forAlphabetic
constructor. We can read the entire parser as: "Hey, check if there's some nonalphanumeric characters. Then, grab a sequence of at least one alphabetic charcter. Then if you see a sequence of an apostrophe followed by more alphabetic characters, returnJust
that, otherwise, returnNothing
."
- oneOf is a function we haven't seen before.
To be honest, I feel like it's hard to explain these concepts if you're not starting from first principles, so I can only hope that the examples plus the write up have enough context to help people along.
Before we tie the knot on our Parsers, lets jump over for a second to talk about converting a List of Valid Words to Dictionary of String keys and Integers
validWordToString : ValidWord -> String
validWordToString vw =
case vw of
Numeric word ->
word
Alphabetic word contraction_ ->
contraction_
|> Maybe.withDefault ""
|> String.append word
dictUpdate : String -> Dict String Int -> Dict String Int
dictUpdate key dict =
case Dict.get key dict of
Just count ->
Dict.insert key (count + 1) dict
Nothing ->
Dict.insert key 1 dict
To do this we have to define two helper functions, validWordToString
and dictUpdate
.
validWordToString
does exactly what it says, and can convert any instance of ValidWord
into a simple String
. We use case statements to determine which type of ValidWord
we're dealing with. Note that while the Numeric
case is fairly straightforward pattern decomposition, the Alphabetic
case has that pesky contraction, which is a Maybe String
. We use Maybe.withDefault here to provide a default value in case the contraction is Nothing
(Our default is the empty string ""
) and we use String.append to attach it to the first part of the string.
dictUpdate
is function that takes a key and a Dictionary and does the following:
- if the key already exist in the dictionary, replace the value at that key with itself plus one more
- if the key does not exist, insert the key into the dictionary with a value of 1.
These two functions will be combined with the last few functions we have to put this all together.
validWordHelper : List ValidWord -> Parser (Step (List ValidWord) (Dict String Int))
validWordHelper revValidWords =
let
continueLooping : ValidWord -> Step (List ValidWord) (Dict String Int)
continueLooping vw =
Loop (vw :: revValidWords)
stopLooping : a -> Step (List ValidWord) (Dict String Int)
stopLooping _ =
revValidWords
|> List.map (validWordToString >> String.toLower)
|> List.foldl dictUpdate Dict.empty
|> Done
in
oneOf
[ succeed continueLooping
|= oneOf [ backtrackable alphabeticWord, numericWord ]
, succeed () |> Parser.map stopLooping
]
validWord : Parser (Dict String Int)
validWord =
loop [] validWordHelper
We're almost at the finish line of our parser. We've created two parsers than can handle either case of our valid tokens, numericWord
and alphabeticWord
, but they both only produce one token. We still need to define a way to turn a string into a sequence of tokens.
Enter the Parser.loop function. This is by far the gnarliest part of this application.
loop
has the type signature
loop : state -> (state -> Parser (Step state a)) -> Parser a
If you find this signature a little intimidating, don't worry about it. I literally only understood it while writing this document. 😅 It also helps if you're used to folding functions for sequences.
What can help is if you notice that the loop
function references another Parser library type we haven't seen yet, Step.
Step is adefined by:
type Step state a
= Loop state
| Done a
Essentially, loop
lets you traverse a sequence, applying a parser(s) and updating some "state" every time the parser succeeds. Once the parser succeeds, the Loop
branch of the Step type defines what to do next, and the Done
branch defines what to do one there are no more tokens to be captured.
In our case, our "state" that we're updating is a List ValidWords
and our "a", the shape of the data once the loop terminates, is the Dict String Int
we want as our final product.
As we chomp through our string, we bring along this constantly updating state and then finally when we stopLooping
, we use the validWordToString
and dictUpdate
function we defined earlier to transform our
List ValidWord
state to a List String
and then we "fold" that list into a dictionary using our dictUpdate
function to handle adding each string in our list to the dictionary correctly.
If this is still confusing to you, that's okay. It's defintely not the most intuitive thing if you haven't worked with functional parsers before. I wrote this article so people could explore this code at their leisure, while using the docs and other resources for reference.
Last but not least we define the actual function that defines our API wordCount
.
wordCount : String -> Dict String Int
wordCount sentence =
case Parser.run validWord sentence of
Ok parsed ->
parsed
Err _ ->
Dict.empty
Up until now we've just been constructing Parsers.
Parser.run is the function that can actually take a Parser a
and return a value. To be more specific, Parser.run
has the signature:
run : Parser a -> String -> Result (List DeadEnd) a
Here we see the another infamous functional programming type, the Result type!
Result
has the signature
type Result error value
= Ok value
| Err error
Like Maybe
, Result
is used to encapsulate data from operations that may succeed or fail. Unlike Maybe
which simply returns Nothing
, Result
has an Err a
case where that a
can be any data type we want, which you can use to hold extra context for your error.
In the can of the run
function, it's associated error type is a List DeadEnd
, while is basically a list of all the ways your parser failed.
All this to say, when we run our parser in wordCount
we have to handle both cases of Result
, the case where it succeeds (in which case we return the successfully parsed Dictionary), and the case where it fails (we chose to ignore the error type and just return an empty Dictionary)
- We needed to create a function that accepted a string and converted it to a dictionary of key words and the number of times they occur
- valid words could be either a sequence of numbers, letters, or letters with an apostrophe embedded in them - We created a "token" type
ValidWords
to represent the two ways of producing a valid word - We defined a parser for each case of
ValidWord
, designed to chomp if from a string of text - We defined a loop that would keep trying to chomp a
ValidWord
from our string of text - We defined some helper functions to convert between or major types,
ValidWord
,String
,List
, andDict
I spent way too much time on this project but I really enjoyed it. Parsing is a foundational technique of software development and the elm-parser library is actually extremely robust. I could spend a lot of time tweaking this solution for better performance, or more strict enforcement of the type rules, but I think I can leave it in this space for now. I won't presume to think this is the perfect article. But perfect is the enemy of the good.
- Here's a gist of the full solution, and the corresponding tests
- Here's a link to an online Elm editor where I've made a very very crude interface for trying out some of the test samples.
If you can think of anything I could do to improve this article, please reach out to me at tkshillinz@icloud.com. I'm also in the Elm
and F#
slack groups as @Kirk Shillingford
Thank you for your time. 🙏🏾