This document is background information for a series of projects I am currently (as of early 2020) working on. I am making it public in an attempt to solicit useful comments and advice on the ideas documented here; to the best of my knowledge and belief, it contains no information proprietary to either my present employer nor any other commercial, non-BSD-license-using organisation.
Markdown makes a far better standard document format for hand-editable content than HTML. Previous forays into HTML-to-Markdown conversion several years ago were frustrating due to the immaturity of both the available Markdown spec(s) and of the tools then available to do the conversion (notably, now-ancient versions of Pandoc).
Converting with Pandoc
Note that there is a maintained Ruby-wrapper Gem for Pandoc on GitHub. That Gem has apparent limitations, however, as noted below.
Pandoc has several variants of Markdown that it supports for input and output.
- commonmark -- a vendor-neutral, "standardised" Markdown;
- gfm -- Github-Flavoured Markdown;
- markdown_github -- a deprecated but more versatile GFM parser (highly deprecated for output);
- markdown -- Pandoc's extended Markdown, comparable but not identical to GFM;
- markdown_mmd -- MultiMarkdown, an extended, longer-lived set of extensions built on top of the historical Markdown core; and
- markdown_strict -- John Gruber's original Markdown, ca. 2004.
Not all variants of Markdown support all features of modern HTML, either because they predate the current language (e.g.,
markdown_strict) or because the variant doesn't support a feature supported by other Markdown variants, such as footnotes.
Converting from HTML to Markdown
Note that valid HTML is, by definition, valid Markdown although it is not native Markdown; i.e., it does not have the clear legibility nor easy authoring and modification capabilities that Markdown has. For example,
<h1>A SQL walked into a bar...</h1> <p>...walked up to two tables and asked, "May I <a href="https://www.databasestar.com/sql-joins/">JOIN</a> you?</p>
can be rendered in native Markdown as
# A SQL walked into a bar... ...walked up to two tables and asked, "May I [JOIN](https://www.databasestar.com/sql-joins/) you?
Which is easier for you to read and write? (Note that the document you are reading now was authored and is maintained in Github-Flavoured Markdown.)
Any current Markdown-to-HTML converter will take the above Markdown and reproduce the above HTML from it.
To convert from hand-coded "rich" HTML, possibly including links, code blocks, etc, we want to use the
html input format and the
markdown_github output format. Note that this is deprecated as of version 184.108.40.206 (and likely somewhat earlier), but it is the most resilient GFM-compliant Markdown output supported by Pandoc. However, the
pandoc-ruby Gem does not and apparently has never supported the
markdown_github output format. A means to pass format names directly to Pandoc was previously supported (e.g.,
pandoc json for JSON); a quick read of the commit history does not make glaringly obvious whether or not that has been removed. If it has not, then supplying an output format of
'pandoc markdown_github' may work; that should be investigated further.
To convert from non-generated Markdown to HTML, one would use a command line such as
$ pandoc -f html -t markdown_github -o output.md input.html
or, with the
# at some likely initialisation point PandocRuby::WRITERS['markdown_github'] = 'pandoc markdown_github' PandocRuby::READERS['markdown_github'] = 'pandoc markdown_github' PandocRuby::STRING_WRITERS['markdown_github'] = 'pandoc parkdown_github' # ... content = PandocRuby.new(['/path/to/input.html'], from: 'html').to_markdown_github # ... or ... content = PandocRuby.new(html_content_str, from: 'html').to_markdown_github
Properly Generated HTML
For machine-generated or validated HTML, we can instead use the
html (HTML5/HTML4) output (or, indeed, any of the supported writers from
content = PandocRuby.new(['/path/to/valid_input.html'], from: 'html').to_gfm # ... or ... content = PandocRuby.new(valid_html_content_str, from: 'html').to_commonmark
This should probably be configurable in some way, since the
pandoc-ruby Gem returns error messages as though they were output content, which thus needs to be inspected appropriately. Actually, input that should be converted using
github_markdown will return more-or-less mangled Markdown when using one of the other format specifiers, making it even more difficult for inspection to correctly determine whether the output has been mangled or not.
Note also that the
pandoc-ruby Gem requires that the
pandoc CLI executable be in the current