jubobs/GSoC-2015-proposal.md Secret

## GSoC-2015-proposal.md

      
    Raw
  

              GSoC-2015-proposal.md
            
          
    A standalone functional parser for CommonMark

Abstract

CommonMark is a standard, unambiguous syntax specification for Markdown, which is being developed by
a committee of Markdown hackers; its goal is to obviate some of the technical shortcomings of the original Markdown specs
and remedy the painful lack of Markdown standardisation. This summer, I intend to write a standalone, extensible,
pure-Haskell library for parsing CommonMark.
Prospective mentor

John MacFarlane, author of Pandoc
and member of the CommonMark committee
Relevant ticket and discussion


https://ghc.haskell.org/trac/summer-of-code/ticket/1660
https://groups.google.com/forum/#!topic/pandoc-discuss/xZrf-dL0ZPs

Introduction

Markdown is a markup language with a simple, plain-text formatting syntax,
primarily meant for generating HTML. Since its inception in 2004, it has enjoyed tremendous success and has become
ubiquitous on the Web (Stack Exchange, GitHub wikis, static blogs, etc.).
Unfortunately, syntax ambiguities in the original specification have led to numerous divergent Markdown implementations.
Because of the variety of different Markdown flavours floating around, content authors cannot expect their Markdown source
to reliably translate to the same output everywhere (see this example).
In short, Markdown has turned into the Web's Tower of Babel.
Moreover, syntactic infelicities can make Markdown difficult difficult to parse (e.g. require backtracking).
But take heart! To remedy the situation, a standardisation committee of high-profile Markdown users/hackers has produced
CommonMark, an updated Markdown specification, which, as a result of its unambiguous and
friendlier syntax, is much easier to parse than previous incarnations of Markdown are.
Shortly after this new standard was unveiled, CommonMark parsers written in languages such as
C,
Javascript,
PhP,
and Python
started to appear, but one written in Haskell has so far remained elusive.
In the framework of this year's GSoC, I intend to write a standalone, extensible,
pure-Haskell library for parsing CommonMark.
Outline

Although a fine-grained design of my CommonMark parser remains to be worked out, my approach would involve a monadic
(or, possibly, applicative) parser combinator, and would rely on the parsing strategy used in the CommonMark reference
implementations (commonmark.js and cmark).
Whether an existing library (parsec,
attoparsec, trifecta,
etc.) or a custom parser combinator would be used for the task is still unknown, at this early stage.
In any case, I wouldn't be starting from scratch. Despite its limitations, the Markdown reader currently used
in Pandoc (the popular document converter and one of Haskell's flagship projects) would be a good starting point;
so would John MacFarlane's cheapskate, a parsing library for
(pre-CommonMark) Markdown. Besides, John reports that he has

[...] already developed algorithms for parsing CommonMark efficiently,
without backtracking. They are so much more efficient than what pandoc
currently does that even the JavaScript implementation of commonmark is
3-4 times faster than pandoc, and the C implementation is 30-40 times
faster.

(John MacFarlane, Google-Group discussion,
February 20, 2015)
Deliverables

Required

A standalone, strictly CommonMark-conformant parser, written as a pure Haskell library, with a focus on
simplicity, extensibility, and performance, to be released under a very liberal license on
Hackage
Optional (time permitting)


Integration into Pandoc
Integration into popular Haskell web frameworks (Yesod, Snap, etc.)

Beyond GSoC 2015


Recommendations (in the form of a paper and/or as a series of blogposts)
about an alternative Haddock syntax based on CommonMark
Integration into Haddock, the Haskell documentation tool
A prototypical Haskell static blogging framework (a la Octopress or
yst) using CommonMark

Benefits to the Haskell community

This project is expected to benefit the wider Haskell community in many ways.
Two, in particular, come to mind.
Servicing open-source Haskell-based tools

If the number of followers of the CommonMark project on GitHub is any indication,
CommonMark is likely to be widely adopted as the new Markdown lingua franca in the wake of its 1.0 release.
A performant, CommonMark-conformant, pure-Haskell parsing library will then be in high demand.
The current Markdown parser used by Pandoc,
needs a rewrite (see item 8 in this task list);
according to John MacFarlane, the CommonMark syntax is such that a well crafted CommonMark parser could
bring a tenfold speedup
over the current Pandoc Markdown reader.
However, the CommonMark parser resulting from this GSoC project wouldn't just benefit Pandoc.
Instead of being tied to the latter, it would be released as a standalone library, under a liberal license
(yet to be determined),
which would allow for its integration into a variety of tools, including Haskell web frameworks (Yesod, Snap, etc.).
Paving the way for a CommonMark syntax for Haddock

The two syntaxes currently understood by Haddock, the Haskell documentation tool,
suffer from several limitations and can prove distracting to the human reader.
As a result, an increasingly frustrated number of Haskellers have been clamouring for an alternate, easier,
more expressive Haddock syntax, possibly based on Markdown:

Haddock's current markup language leaves something to be desired once
you want to write more serious documentation (e.g. several paragraphs
of introductory text at the top of the module doc). Several features
are lacking (bold text, links that render as text instead of URLs,
inline HTML).


I suggest that we implement an alternative haddock syntax that's a
superset of Markdown. [...]

(Johan Tibell, Haskell Cafe, April 4, 2013)

I don't particularly care if the format we get is markdown.


I simply want some way to express the formatting. That could be embedded html fragments in callouts,
embedded markdown under some markdown block incantation, pretty much anything!


Right now I can't put a labeled hyperlink to a document in my documentation.
I can't do any formatting whatsoever and it is quite frankly rather pathetic.


Give me anything that lets me put attributes on images, do an embedded callout to a nicer markup,
or even just format a link as something other than its naked text and I'll use it.

(Edward Kmett, Haskell subreddit,
August 31, 2013)
The strongest objection to
the addition of such a syntax has been the lack of Markdown standardisation.
However, the advent of CommonMark offers a unique opportunity for Haddock to take the leap,
while adopting as principled a stance towards Markdown support as Haskell's stance towards side effects.
In the longer term, this GSoC project is expected to pave the way for the addition of a CommonMark syntax to Haddock.
Moreover, although the release of CommonMark 1.0 is on the horizon, the specs are still in a state of flux;
if any syntactic infelicity (in using CommonMark for literate Haskell) were identified over the course of this GSoC project,
the opportunity would be there to effect changes in the CommonMark specs so as to mitigate / obliterate it.
Roadmap

April: Getting up to speed

Deepen my understanding of functional-parsing theory (literature) and practice (tools and their implementation)
Get intimately familiar with the CommonMark specification
Study existing approaches to parsing (pre-CommonMark) Markdown

April 27th: Happy days!

Celebrate acceptance to GSoC 2015 :p

April 28th - May 28th: "On your Mark"(down)

Bond with mentor
Agree on a library name and a licensing model
Set up project's GitHub repo
Investigation, design, prototyping

May 29th - May 31th: Attend ZuriHac2015

Soak in the wisdom of high-profile Haskellers
Get my hands dirty with GHC / library development
Network with fellow Haskellers

June 1st - June 5th: Design validation

Validate prospective implementation with mentor

June 6th - July 3rd: Implementation
July 4th - July 25th: Testing and benchmarking

Make sure the parser passes the CommonMark validation test suite
Run benchmarks against CommonMark parsers written in other languages ( etc.)
Tune performance

July 26th - August 17th: Dotting the i's and crossing the t's

Write documentation
Prepare submission on Hackage

About me

Profile

I'm an electrical-engineering PhD student at University College Cork, Ireland,
where I also teach a course on numerical methods.
Although I don't have a Computer-Science background per se, I have, purely out of interest,
taken multiple CS courses, including compilers, algorithms, and functional programming (in which I got top marks).
In the process, I've developed a keen interest in compilation, and I got such a kick out of learning Haskell
that I'm now determined to get a job in which I build cool things with this wonderful language.
I keep a close eye on Stack-Overflow questions about Haskell, and I've recently started contributing answers in that tag.
I am also involved in the TeX world: I've contributed a couple of LaTeX packages to CTAN,
and I'm one of the top users of TeX.SE,
where I remain particularly active in the listings tag.
Finally, despite having limited open-source experience, I have an intimate knowledge of Git:
I am the second most active user in the Git tag on Stack Overflow
(at the time of writing this proposal), I push to GitHub on a daily basis,
and I've also organised a couple of Git tutorials for undergraduate students.
Contact information

Name: Julien Cretel
Phone: +353 (0)85 818 0441
Email: j.cretel@umail.ucc.ie
GitHub: https://github.com/jubobs
Stack Overflow: http://stackoverflow.com/users/2541573/jubobs
Twitter: https://twitter.com/_jubobs_