Skip to content

Instantly share code, notes, and snippets.

@jubobs
Last active August 29, 2015 14:17
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save jubobs/7a9298eeaf02bcefbc35 to your computer and use it in GitHub Desktop.
Haskell GSoC 2015 (draft) proposal - A standalone functional parser for CommonMark

A standalone functional parser for CommonMark

Abstract

CommonMark is a standard, unambiguous syntax specification for Markdown, which is being developed by a committee of Markdown hackers; its goal is to obviate some of the technical shortcomings of the original Markdown specs and remedy the painful lack of Markdown standardisation. This summer, I intend to write a standalone, extensible, pure-Haskell library for parsing CommonMark.

Prospective mentor

John MacFarlane, author of Pandoc and member of the CommonMark committee

Relevant ticket and discussion

Introduction

Markdown is a markup language with a simple, plain-text formatting syntax, primarily meant for generating HTML. Since its inception in 2004, it has enjoyed tremendous success and has become ubiquitous on the Web (Stack Exchange, GitHub wikis, static blogs, etc.).

Unfortunately, syntax ambiguities in the original specification have led to numerous divergent Markdown implementations. Because of the variety of different Markdown flavours floating around, content authors cannot expect their Markdown source to reliably translate to the same output everywhere (see this example). In short, Markdown has turned into the Web's Tower of Babel. Moreover, syntactic infelicities can make Markdown difficult difficult to parse (e.g. require backtracking).

But take heart! To remedy the situation, a standardisation committee of high-profile Markdown users/hackers has produced CommonMark, an updated Markdown specification, which, as a result of its unambiguous and friendlier syntax, is much easier to parse than previous incarnations of Markdown are.

Shortly after this new standard was unveiled, CommonMark parsers written in languages such as C, Javascript, PhP, and Python started to appear, but one written in Haskell has so far remained elusive.

In the framework of this year's GSoC, I intend to write a standalone, extensible, pure-Haskell library for parsing CommonMark.

Outline

Although a fine-grained design of my CommonMark parser remains to be worked out, my approach would involve a monadic (or, possibly, applicative) parser combinator, and would rely on the parsing strategy used in the CommonMark reference implementations (commonmark.js and cmark). Whether an existing library (parsec, attoparsec, trifecta, etc.) or a custom parser combinator would be used for the task is still unknown, at this early stage.

In any case, I wouldn't be starting from scratch. Despite its limitations, the Markdown reader currently used in Pandoc (the popular document converter and one of Haskell's flagship projects) would be a good starting point; so would John MacFarlane's cheapskate, a parsing library for (pre-CommonMark) Markdown. Besides, John reports that he has

[...] already developed algorithms for parsing CommonMark efficiently, without backtracking. They are so much more efficient than what pandoc currently does that even the JavaScript implementation of commonmark is 3-4 times faster than pandoc, and the C implementation is 30-40 times faster.

(John MacFarlane, Google-Group discussion, February 20, 2015)

Deliverables

Required

A standalone, strictly CommonMark-conformant parser, written as a pure Haskell library, with a focus on simplicity, extensibility, and performance, to be released under a very liberal license on Hackage

Optional (time permitting)

  • Integration into Pandoc
  • Integration into popular Haskell web frameworks (Yesod, Snap, etc.)

Beyond GSoC 2015

  • Recommendations (in the form of a paper and/or as a series of blogposts) about an alternative Haddock syntax based on CommonMark
  • Integration into Haddock, the Haskell documentation tool
  • A prototypical Haskell static blogging framework (a la Octopress or yst) using CommonMark

Benefits to the Haskell community

This project is expected to benefit the wider Haskell community in many ways. Two, in particular, come to mind.

Servicing open-source Haskell-based tools

If the number of followers of the CommonMark project on GitHub is any indication, CommonMark is likely to be widely adopted as the new Markdown lingua franca in the wake of its 1.0 release. A performant, CommonMark-conformant, pure-Haskell parsing library will then be in high demand.

The current Markdown parser used by Pandoc, needs a rewrite (see item 8 in this task list); according to John MacFarlane, the CommonMark syntax is such that a well crafted CommonMark parser could bring a tenfold speedup over the current Pandoc Markdown reader.

However, the CommonMark parser resulting from this GSoC project wouldn't just benefit Pandoc. Instead of being tied to the latter, it would be released as a standalone library, under a liberal license (yet to be determined), which would allow for its integration into a variety of tools, including Haskell web frameworks (Yesod, Snap, etc.).

Paving the way for a CommonMark syntax for Haddock

The two syntaxes currently understood by Haddock, the Haskell documentation tool, suffer from several limitations and can prove distracting to the human reader. As a result, an increasingly frustrated number of Haskellers have been clamouring for an alternate, easier, more expressive Haddock syntax, possibly based on Markdown:

Haddock's current markup language leaves something to be desired once you want to write more serious documentation (e.g. several paragraphs of introductory text at the top of the module doc). Several features are lacking (bold text, links that render as text instead of URLs, inline HTML).

I suggest that we implement an alternative haddock syntax that's a superset of Markdown. [...]

(Johan Tibell, Haskell Cafe, April 4, 2013)

I don't particularly care if the format we get is markdown.

I simply want some way to express the formatting. That could be embedded html fragments in callouts, embedded markdown under some markdown block incantation, pretty much anything!

Right now I can't put a labeled hyperlink to a document in my documentation. I can't do any formatting whatsoever and it is quite frankly rather pathetic.

Give me anything that lets me put attributes on images, do an embedded callout to a nicer markup, or even just format a link as something other than its naked text and I'll use it.

(Edward Kmett, Haskell subreddit, August 31, 2013)

The strongest objection to the addition of such a syntax has been the lack of Markdown standardisation. However, the advent of CommonMark offers a unique opportunity for Haddock to take the leap, while adopting as principled a stance towards Markdown support as Haskell's stance towards side effects.

In the longer term, this GSoC project is expected to pave the way for the addition of a CommonMark syntax to Haddock. Moreover, although the release of CommonMark 1.0 is on the horizon, the specs are still in a state of flux; if any syntactic infelicity (in using CommonMark for literate Haskell) were identified over the course of this GSoC project, the opportunity would be there to effect changes in the CommonMark specs so as to mitigate / obliterate it.

Roadmap

April: Getting up to speed

  • Deepen my understanding of functional-parsing theory (literature) and practice (tools and their implementation)
  • Get intimately familiar with the CommonMark specification
  • Study existing approaches to parsing (pre-CommonMark) Markdown

April 27th: Happy days!

  • Celebrate acceptance to GSoC 2015 :p

April 28th - May 28th: "On your Mark"(down)

  • Bond with mentor
  • Agree on a library name and a licensing model
  • Set up project's GitHub repo
  • Investigation, design, prototyping

May 29th - May 31th: Attend ZuriHac2015

  • Soak in the wisdom of high-profile Haskellers
  • Get my hands dirty with GHC / library development
  • Network with fellow Haskellers

June 1st - June 5th: Design validation

  • Validate prospective implementation with mentor

June 6th - July 3rd: Implementation

July 4th - July 25th: Testing and benchmarking

  • Make sure the parser passes the CommonMark validation test suite
  • Run benchmarks against CommonMark parsers written in other languages ( etc.)
  • Tune performance

July 26th - August 17th: Dotting the i's and crossing the t's

  • Write documentation
  • Prepare submission on Hackage

About me

Profile

I'm an electrical-engineering PhD student at University College Cork, Ireland, where I also teach a course on numerical methods. Although I don't have a Computer-Science background per se, I have, purely out of interest, taken multiple CS courses, including compilers, algorithms, and functional programming (in which I got top marks).

In the process, I've developed a keen interest in compilation, and I got such a kick out of learning Haskell that I'm now determined to get a job in which I build cool things with this wonderful language. I keep a close eye on Stack-Overflow questions about Haskell, and I've recently started contributing answers in that tag.

I am also involved in the TeX world: I've contributed a couple of LaTeX packages to CTAN, and I'm one of the top users of TeX.SE, where I remain particularly active in the listings tag.

Finally, despite having limited open-source experience, I have an intimate knowledge of Git: I am the second most active user in the Git tag on Stack Overflow (at the time of writing this proposal), I push to GitHub on a daily basis, and I've also organised a couple of Git tutorials for undergraduate students.

Contact information

Name: Julien Cretel

Phone: +353 (0)85 818 0441

Email: j.cretel@umail.ucc.ie

GitHub: https://github.com/jubobs

Stack Overflow: http://stackoverflow.com/users/2541573/jubobs

Twitter: https://twitter.com/_jubobs_

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment