Skip to content

Instantly share code, notes, and snippets.

@wbamberg
Last active April 26, 2021 17:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wbamberg/10b512e2a35874bfe043740b7fd9790d to your computer and use it in GitHub Desktop.
Save wbamberg/10b512e2a35874bfe043740b7fd9790d to your computer and use it in GitHub Desktop.

Specifying MDN to Markdown conversion

This document describes how we'll convert MDN's HTML content into Markdown. It's focused on the JavaScript docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript) because converting that is our immediate goal: however, it should be useful for converting more doc sets in the future.

It tries to take a systematic approach to conversion by listing:

  • every HTML element
  • every HTML attribute
  • every value for the class attribute encountered

and deciding what the conversion process needs to do when it sees that item.

The full details are in this Google sheet: https://docs.google.com/spreadsheets/d/1Nb-WUHveeUfi5YV0-pzVyHI1vR1IC8xF40IdkiceyQQ/edit#gid=0 . This document describes the spreadsheet, summarises the results, and provides more details on the choices it lists.

The spreadsheet has one page for elements, one for attributes, one for values of the class attribute. Each page has four columns:

  • Name: the name of the element/attribute/class in question
  • Conversion: what the conversion process should do when it encounters this element/attribute/class
  • In JS docs?: whether this element/attribute/class even occurs in the JS docs. Note that this is redundant for class: because this can be anything, I've only listed values which actually occur in the JS docs.
  • Issue: link to the GitHub issue where we are discussing what to do about this item

Conversion categories

Cells in the "Conversion" column list one of a few different categories, which we'll describe here.

  • GFM: This is the easy category, where an item has a direct representation in GFM. This applies to things like <p>, <li>, <img src=...>, <pre class="js"> and so on. In the sheet I've highlighted these in a soothing avocado colour.

  • Error: This means: if we see this item, the content is not yet ready for conversion. It needs people to fix the content so this item no longer appears in the source. So the conversion process needs to log an error and we need to address it.

    We should choose this category when we don't want to support this item in our Markdown source, but we can't just remove it automatically, because this will probably break the content. The style attribute is a good example.

  • Strip tags/strip attribute: This means: throw away the tag/attribute, but keep the contents of the tag.

    For example, a <span> element with no attributes isn't adding anything that we want to capture in the converted markup. Sometimes these choices remove semantics from the markup: for example the sheet recommends that we discard <abbr> tags. So to make this choice means we accept that loss.

    Often it's hard to choose between this choice and "Error": it's a matter of judgement whether we should make it a manual change to decide what to do about a tag rather than silently remove it.

  • Keep original: This means: don't convert the source, just transfer the tag and its contents as-is.

    We should choose this when we do want to keep the original feature, but don't have a sensible way to represent it in Markdown. MathML and SVG are good examples here.

  • GFM XYZ: This means: treat this as a different but related element XYZ, that has a GFM representation, and emit the GFM representation of that related element.

    Compared with just "GFM", this is a bit dodgy, because we're generally throwing away some semantics. But we have no way to represent these semantics anyway in our target format, so we don't have a better option.

    Something like <dfn> is an example of this: we might choose to use the GFM syntax for <em>, because that matches how browsers typically render <dfn>. But we lose the semantics.

For attributes especially, we sometimes combine these, because the resolution is different depending on what the element is. For example src can be converted to GFM when it's attached to an <img>, but is an error otherwise.

Unresolved elements/attributes/class values

Not all items have a category selected. This means we're still not sure what to do about them. All these items should have a link to a GitHub issue in which we can work out what to do with the item.

Once the issue has reached a consensus we can assign a category to the corresponding items and close the issue. The resolution might of course be to invent a new category: for example we might have "Extend GFM" for something like <dl>.

Currently the following groups of items are unresolved:

Once all items are resolved here, we have a complete plan for converting the source into Markdown.

As a practical matter, we only need to have a resolution now for items that appear in the JS docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment