This document describes how we'll convert MDN's HTML content into Markdown. It's focused on the JavaScript docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript) because converting that is our immediate goal: however, it should be useful for converting more doc sets in the future.
It tries to take a systematic approach to conversion by listing:
- every HTML element
- every HTML attribute
- every value for the
class
attribute encountered
and deciding what the conversion process needs to do when it sees that item.
The full details are in this Google sheet: https://docs.google.com/spreadsheets/d/1Nb-WUHveeUfi5YV0-pzVyHI1vR1IC8xF40IdkiceyQQ/edit#gid=0 . This document describes the spreadsheet, summarises the results, and provides more details on the choices it lists.
The spreadsheet has one page for elements, one for attributes, one for values of the class
attribute. Each page has four columns:
- Name: the name of the element/attribute/class in question
- Conversion: what the conversion process should do when it encounters this element/attribute/class
- In JS docs?: whether this element/attribute/class even occurs in the JS docs. Note that this is redundant for
class
: because this can be anything, I've only listed values which actually occur in the JS docs. - Issue: link to the GitHub issue where we are discussing what to do about this item
Cells in the "Conversion" column list one of a few different categories, which we'll describe here.
-
GFM: This is the easy category, where an item has a direct representation in GFM. This applies to things like
<p>
,<li>
,<img src=...>
,<pre class="js">
and so on. In the sheet I've highlighted these in a soothing avocado colour. -
Error: This means: if we see this item, the content is not yet ready for conversion. It needs people to fix the content so this item no longer appears in the source. So the conversion process needs to log an error and we need to address it.
We should choose this category when we don't want to support this item in our Markdown source, but we can't just remove it automatically, because this will probably break the content. The
style
attribute is a good example. -
Strip tags/strip attribute: This means: throw away the tag/attribute, but keep the contents of the tag.
For example, a
<span>
element with no attributes isn't adding anything that we want to capture in the converted markup. Sometimes these choices remove semantics from the markup: for example the sheet recommends that we discard<abbr>
tags. So to make this choice means we accept that loss.Often it's hard to choose between this choice and "Error": it's a matter of judgement whether we should make it a manual change to decide what to do about a tag rather than silently remove it.
-
Keep original: This means: don't convert the source, just transfer the tag and its contents as-is.
We should choose this when we do want to keep the original feature, but don't have a sensible way to represent it in Markdown. MathML and SVG are good examples here.
-
GFM XYZ: This means: treat this as a different but related element XYZ, that has a GFM representation, and emit the GFM representation of that related element.
Compared with just "GFM", this is a bit dodgy, because we're generally throwing away some semantics. But we have no way to represent these semantics anyway in our target format, so we don't have a better option.
Something like
<dfn>
is an example of this: we might choose to use the GFM syntax for<em>
, because that matches how browsers typically render<dfn>
. But we lose the semantics.
For attributes especially, we sometimes combine these, because the resolution is different depending on what the element is. For example src
can be converted to GFM when it's attached to an <img>
, but is an error otherwise.
Not all items have a category selected. This means we're still not sure what to do about them. All these items should have a link to a GitHub issue in which we can work out what to do with the item.
Once the issue has reached a consensus we can assign a category to the corresponding items and close the issue. The resolution might of course be to invent a new category: for example we might have "Extend GFM" for something like <dl>
.
Currently the following groups of items are unresolved:
- elements:
<table>
and related items: tracked by mdn/content#4325<dl>
and related items: tracked by mdn/content#4367<sup>
and<sub>
: not yet tracked by an issue
- classes
hidden
: mdn/content#3694note
,warning
,notecard
: mdn/content#3483callout
,button
: mdn/content#3927summary
,seoSummary
: mdn/content#3923fullwidth-table
,standard-table
: not yet tracked by an issue
Once all items are resolved here, we have a complete plan for converting the source into Markdown.
As a practical matter, we only need to have a resolution now for items that appear in the JS docs.