A standalone functional parser for CommonMark
CommonMark is a standard, unambiguous syntax specification for Markdown, which is being developed by a committee of Markdown hackers; its goal is to obviate some of the technical shortcomings of the original Markdown specs and remedy the painful lack of Markdown standardisation. This summer, I intend to write a standalone, extensible, pure-Haskell library for parsing CommonMark.
Relevant ticket and discussion
Markdown is a markup language with a simple, plain-text formatting syntax, primarily meant for generating HTML. Since its inception in 2004, it has enjoyed tremendous success and has become ubiquitous on the Web (Stack Exchange, GitHub wikis, static blogs, etc.).
Unfortunately, syntax ambiguities in the original specification have led to numerous divergent Markdown implementations. Because of the variety of different Markdown flavours floating around, content authors cannot expect their Markdown source to reliably translate to the same output everywhere (see this example). In short, Markdown has turned into the Web's Tower of Babel. Moreover, syntactic infelicities can make Markdown difficult difficult to parse (e.g. require backtracking).
But take heart! To remedy the situation, a standardisation committee of high-profile Markdown users/hackers has produced CommonMark, an updated Markdown specification, which, as a result of its unambiguous and friendlier syntax, is much easier to parse than previous incarnations of Markdown are.
In the framework of this year's GSoC, I intend to write a standalone, extensible, pure-Haskell library for parsing CommonMark.
Although a fine-grained design of my CommonMark parser remains to be worked out, my approach would involve a monadic
(or, possibly, applicative) parser combinator, and would rely on the parsing strategy used in the CommonMark reference
Whether an existing library (
etc.) or a custom parser combinator would be used for the task is still unknown, at this early stage.
In any case, I wouldn't be starting from scratch. Despite its limitations, the Markdown reader currently used
in Pandoc (the popular document converter and one of Haskell's flagship projects) would be a good starting point;
so would John MacFarlane's
cheapskate, a parsing library for
(pre-CommonMark) Markdown. Besides, John reports that he has
(John MacFarlane, Google-Group discussion, February 20, 2015)
A standalone, strictly CommonMark-conformant parser, written as a pure Haskell library, with a focus on simplicity, extensibility, and performance, to be released under a very liberal license on Hackage
Optional (time permitting)
- Integration into Pandoc
- Integration into popular Haskell web frameworks (Yesod, Snap, etc.)
Beyond GSoC 2015
- Recommendations (in the form of a paper and/or as a series of blogposts) about an alternative Haddock syntax based on CommonMark
- Integration into Haddock, the Haskell documentation tool
- A prototypical Haskell static blogging framework (a la Octopress or yst) using CommonMark
Benefits to the Haskell community
This project is expected to benefit the wider Haskell community in many ways. Two, in particular, come to mind.
Servicing open-source Haskell-based tools
If the number of followers of the CommonMark project on GitHub is any indication, CommonMark is likely to be widely adopted as the new Markdown lingua franca in the wake of its 1.0 release. A performant, CommonMark-conformant, pure-Haskell parsing library will then be in high demand.
The current Markdown parser used by Pandoc, needs a rewrite (see item 8 in this task list); according to John MacFarlane, the CommonMark syntax is such that a well crafted CommonMark parser could bring a tenfold speedup over the current Pandoc Markdown reader.
However, the CommonMark parser resulting from this GSoC project wouldn't just benefit Pandoc. Instead of being tied to the latter, it would be released as a standalone library, under a liberal license (yet to be determined), which would allow for its integration into a variety of tools, including Haskell web frameworks (Yesod, Snap, etc.).
Paving the way for a CommonMark syntax for Haddock
The two syntaxes currently understood by Haddock, the Haskell documentation tool, suffer from several limitations and can prove distracting to the human reader. As a result, an increasingly frustrated number of Haskellers have been clamouring for an alternate, easier, more expressive Haddock syntax, possibly based on Markdown:
Haddock's current markup language leaves something to be desired once you want to write more serious documentation (e.g. several paragraphs of introductory text at the top of the module doc). Several features are lacking (bold text, links that render as text instead of URLs, inline HTML).
I suggest that we implement an alternative haddock syntax that's a superset of Markdown. [...]
(Johan Tibell, Haskell Cafe, April 4, 2013)
I don't particularly care if the format we get is markdown.
I simply want some way to express the formatting. That could be embedded html fragments in callouts, embedded markdown under some markdown block incantation, pretty much anything!
Right now I can't put a labeled hyperlink to a document in my documentation. I can't do any formatting whatsoever and it is quite frankly rather pathetic.
Give me anything that lets me put attributes on images, do an embedded callout to a nicer markup, or even just format a link as something other than its naked text and I'll use it.
(Edward Kmett, Haskell subreddit, August 31, 2013)
The strongest objection to the addition of such a syntax has been the lack of Markdown standardisation. However, the advent of CommonMark offers a unique opportunity for Haddock to take the leap, while adopting as principled a stance towards Markdown support as Haskell's stance towards side effects.
In the longer term, this GSoC project is expected to pave the way for the addition of a CommonMark syntax to Haddock. Moreover, although the release of CommonMark 1.0 is on the horizon, the specs are still in a state of flux; if any syntactic infelicity (in using CommonMark for literate Haskell) were identified over the course of this GSoC project, the opportunity would be there to effect changes in the CommonMark specs so as to mitigate / obliterate it.
April: Getting up to speed
- Deepen my understanding of functional-parsing theory (literature) and practice (tools and their implementation)
- Get intimately familiar with the CommonMark specification
- Study existing approaches to parsing (pre-CommonMark) Markdown
April 27th: Happy days!
- Celebrate acceptance to GSoC 2015
April 28th - May 28th: "On your Mark"(down)
- Bond with mentor
- Agree on a library name and a licensing model
- Set up project's GitHub repo
- Investigation, design, prototyping
May 29th - May 31th: Attend ZuriHac2015
- Soak in the wisdom of high-profile Haskellers
- Get my hands dirty with GHC / library development
- Network with fellow Haskellers
June 1st - June 5th: Design validation
- Validate prospective implementation with mentor
June 6th - July 3rd: Implementation
July 4th - July 25th: Testing and benchmarking
- Make sure the parser passes the CommonMark validation test suite
- Run benchmarks against CommonMark parsers written in other languages ( etc.)
- Tune performance
July 26th - August 17th: Dotting the i's and crossing the t's
- Write documentation
- Prepare submission on Hackage
I'm an electrical-engineering PhD student at University College Cork, Ireland, where I also teach a course on numerical methods. Although I don't have a Computer-Science background per se, I have, purely out of interest, taken multiple CS courses, including compilers, algorithms, and functional programming (in which I got top marks).
In the process, I've developed a keen interest in compilation, and I got such a kick out of learning Haskell that I'm now determined to get a job in which I build cool things with this wonderful language. I keep a close eye on Stack-Overflow questions about Haskell, and I've recently started contributing answers in that tag.
Finally, despite having limited open-source experience, I have an intimate knowledge of Git: I am the second most active user in the Git tag on Stack Overflow (at the time of writing this proposal), I push to GitHub on a daily basis, and I've also organised a couple of Git tutorials for undergraduate students.
Name: Julien Cretel
Phone: +353 (0)85 818 0441
Stack Overflow: http://stackoverflow.com/users/2541573/jubobs