Skip to content

Instantly share code, notes, and snippets.

@jdickey
Last active May 4, 2020 17:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jdickey/16b361e167e8eb3a833a3be81b2407a1 to your computer and use it in GitHub Desktop.
Save jdickey/16b361e167e8eb3a833a3be81b2407a1 to your computer and use it in GitHub Desktop.
Converting HTML to Markdown for use as a project-standard document markup

This document is background information for a series of projects I am currently (as of early 2020) working on. I am making it public in an attempt to solicit useful comments and advice on the ideas documented here; to the best of my knowledge and belief, it contains no information proprietary to either my present employer nor any other commercial, non-BSD-license-using organisation.

Markdown makes a far better standard document format for hand-editable content than HTML. Previous forays into HTML-to-Mark­down conversion several years ago were frustrating due to the immaturity of both the available Markdown spec(s) and of the tools then available to do the conversion (notably, now-ancient versions of Pandoc).

Converting with Pandoc

Pandoc is a highly capable document converter. It can convert from dozens of formats, to dozens of formats.

Note that there is a maintained Ruby-wrapper Gem for Pandoc on GitHub. That Gem has apparent limitations, however, as noted below.

Pandoc has several variants of Markdown that it supports for input and output.

  • commonmark -- a vendor-neutral, "standardised" Markdown;
  • gfm -- Github-Flavoured Markdown;
  • markdown_github -- a deprecated but more versatile GFM parser (highly deprecated for output);
  • markdown -- Pandoc's extended Markdown, comparable but not identical to GFM;
  • markdown_mmd -- MultiMarkdown, an extended, longer-lived set of extensions built on top of the historical Markdown core; and
  • markdown_strict -- John Gruber's original Markdown, ca. 2004.

Not all variants of Markdown support all features of modern HTML, either because they predate the current language (e.g., markdown_strict) or because the variant doesn't support a feature supported by other Markdown variants, such as footnotes.

Converting from HTML to Markdown

Note that valid HTML is, by definition, valid Markdown although it is not native Markdown; i.e., it does not have the clear legibility nor easy authoring and modification capabilities that Markdown has. For example,

<h1>A SQL walked into a bar...</h1>
<p>...walked up to two tables and asked, "May I <a href="https://www.databasestar.com/sql-joins/">JOIN</a> you?</p>

can be rendered in native Markdown as

# A SQL walked into a bar...

...walked up to two tables and asked, "May I [JOIN](https://www.databasestar.com/sql-joins/) you?

Which is easier for you to read and write? (Note that the document you are reading now was authored and is maintained in Github-Flavoured Markdown.)

Any current Markdown-to-HTML converter will take the above Markdown and reproduce the above HTML from it.

Non-Generated HTML

To convert from hand-coded "rich" HTML, possibly including links, code blocks, etc, we want to use the html input format and the markdown_github output format. Note that this is deprecated as of version 2.9.2.1 (and likely somewhat earlier), but it is the most resilient GFM-compliant Markdown output supported by Pandoc. However, the pandoc-ruby Gem does not and apparently has never supported the markdown_github output format. A means to pass format names directly to Pandoc was previously supported (e.g., pandoc json for JSON); a quick read of the commit history does not make glaringly obvious whether or not that has been removed. If it has not, then supplying an output format of 'pandoc markdown_github' may work; that should be investigated further.

To convert from non-generated Markdown to HTML, one would use a command line such as

$ pandoc -f html -t markdown_github -o output.md input.html

or, with the pandoc-ruby Gem,

    # at some likely initialisation point

    PandocRuby::WRITERS['markdown_github'] = 'pandoc markdown_github'
    PandocRuby::READERS['markdown_github'] = 'pandoc markdown_github'
    PandocRuby::STRING_WRITERS['markdown_github'] = 'pandoc parkdown_github'

    # ...

    content = PandocRuby.new(['/path/to/input.html'], from: 'html').to_markdown_github

    # ... or ...

    content = PandocRuby.new(html_content_str, from: 'html').to_markdown_github

Properly Generated HTML

For machine-generated or validated HTML, we can instead use the gfm (preferred), commonmark, or html (HTML5/HTML4) output (or, indeed, any of the supported writers from pandoc-ruby; e.g.,

	content = PandocRuby.new(['/path/to/valid_input.html'], from: 'html').to_gfm

    # ... or ...

    content = PandocRuby.new(valid_html_content_str, from: 'html').to_commonmark

(CommonMark is, if you didn't know, a "strongly defined, highly compatible specification of Markdown". It supports most commonly-used features of GFM.)

Other Notes

This should probably be configurable in some way, since the pandoc-ruby Gem returns error messages as though they were output content, which thus needs to be inspected appropriately. Actually, input that should be converted using github_markdown will return more-or-less mangled Markdown when using one of the other format specifiers, making it even more difficult for inspection to correctly determine whether the output has been mangled or not.

Note also that the pandoc-ruby Gem requires that the pandoc CLI executable be in the current PATH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment