Skip to content

Instantly share code, notes, and snippets.

@gabetax
Created January 30, 2012 05:42
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gabetax/1702774 to your computer and use it in GitHub Desktop.
Save gabetax/1702774 to your computer and use it in GitHub Desktop.
XHTML to Markdown XSLT Translation
<html>
<body>
<h1>Our Navigation</h1>
<p>I'm writing an example xhtml document to get converted into markdown!</p>
<h2>Examples</h2>
<h3>Text formatting</h3>
<p>Sometimes with longer <em>paragraphs</em><br/>we just want a new line <strong>immediately</strong>.</p>
<div>Divs are block elements too, and people don't always put their text in p tags.</div>
<h3>Quotey things</h3>
<pre>
This is some documentation that should be block quoted.
It has some indentation.
And another sentence.
</pre>
<blockquote>
All right, brain. You don't like me and I don't like you, but let's just do this and I can get back to killing you with beer.
--Homer Simpson
</blockquote>
<h3>Images</h3>
<img src="http://www.google.com/intl/en_com/images/srpr/logo3w.png" alt="Google" />
<br />
<hr />
<h3>List test!</h3>
<ul>
<li>About</li>
<li>Services
<ol>
<li>Programming</li>
<li>Design</li>
<li>Marketing</li>
</ol>
</li>
<li>Team
<ol>
<li>All</li>
<li>Executive</li>
<li>Client Services</li>
<li>Programming
<ol>
<li><a href="http://rubyonrails.org">Ruby <em>on</em> Rails</a></li>
<li>PHP</li>
<li>.NET</li>
<li>Orbeon</li>
</ol>
</li>
<li>Design</li>
<li>Marketing</li>
</ol>
</li>
<li>Blog</li>
<li>Contact</li>
</ul>
</body>
</html>
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:functx="http://www.functx.com"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
version="2.0">
<xsl:output method="text" />
<xsl:strip-space elements="*" />
<!-- Required for li indenting -->
<xsl:function name="functx:repeat-string" as="xs:string">
<xsl:param name="stringToRepeat" as="xs:string?"/>
<xsl:param name="count" as="xs:integer"/>
<xsl:sequence select="string-join((for $i in 1 to $count return $stringToRepeat), '')"/>
</xsl:function>
<xsl:template match="/html/body">
<xsl:apply-templates select="*" />
</xsl:template>
<xsl:template match="li">
<xsl:value-of select="functx:repeat-string(' ', count(ancestor::li))"/>
<xsl:choose>
<xsl:when test="name(..) = 'ol'">
<xsl:value-of select="position()" />
<xsl:text>. </xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>* </xsl:text>
</xsl:otherwise>
</xsl:choose>
<xsl:value-of select="normalize-space(text())" />
<xsl:apply-templates select="* except (ul|ol)" />
<xsl:text>&#xa;&#xa;</xsl:text>
<xsl:apply-templates select="ul|ol" />
</xsl:template>
<!-- Don't process text() nodes for these - prevents unnecessary whitespace -->
<xsl:template match="ul|ol">
<xsl:apply-templates select="* except text()" />
</xsl:template>
<xsl:template match="a">
<xsl:text>[</xsl:text>
<xsl:apply-templates select="node()|text()" />
<xsl:text>](</xsl:text>
<xsl:value-of select="@href" />
<xsl:text>)</xsl:text>
</xsl:template>
<xsl:template match="img">
<xsl:text>![</xsl:text>
<xsl:value-of select="@alt" />
<xsl:text>](</xsl:text>
<xsl:value-of select="@src" />
<xsl:text>)</xsl:text>
</xsl:template>
<xsl:template match="strong|b">
<xsl:text>**</xsl:text>
<xsl:value-of select="." />
<xsl:text>**</xsl:text>
</xsl:template>
<xsl:template match="em|i">
<xsl:text>*</xsl:text>
<xsl:value-of select="." />
<xsl:text>*</xsl:text>
</xsl:template>
<xsl:template match="code">
<!-- todo: skip the ` if inside a pre -->
<xsl:text>`</xsl:text>
<xsl:value-of select="." />
<xsl:text>`</xsl:text>
</xsl:template>
<xsl:template match="br">
<xsl:text> &#xa;</xsl:text>
</xsl:template>
<!-- Block elements -->
<xsl:template match="hr">
<xsl:text>----&#xa;&#xa;</xsl:text>
</xsl:template>
<xsl:template match="p|div">
<xsl:apply-templates select="*|text()" />
<xsl:text>&#xa;&#xa;</xsl:text> <!-- Block element -->
</xsl:template>
<xsl:template match="*[matches(name(), 'h[1-6]')]">
<xsl:value-of select="functx:repeat-string('#', xs:integer(substring(name(), 2)))" />
<xsl:text> </xsl:text>
<xsl:apply-templates select="*|text()" />
<xsl:text>&#xa;&#xa;</xsl:text> <!-- Block element -->
</xsl:template>
<xsl:template match="pre">
<xsl:text> </xsl:text>
<xsl:value-of select="replace(text(), '&#xa;', '&#xa; ')" />
<xsl:text>&#xa;&#xa;</xsl:text> <!-- Block element -->
</xsl:template>
<xsl:template match="blockquote">
<xsl:text>&gt; </xsl:text>
<xsl:value-of select="replace(text(), '&#xa;', '&#xa;&gt; ')" />
<xsl:text>&#xa;&#xa;</xsl:text> <!-- Block element -->
</xsl:template>
</xsl:stylesheet>

Our Navigation

I'm writing an example xhtml document to get converted into markdown!

Examples

Text formatting

Sometimes with longer paragraphs
we just want a new line immediately.

Divs are block elements too, and people don't always put their text in p tags.

Quotey things

This is some documentation that should be block quoted.
  It has some indentation.
And another sentence.
All right, brain. You don't like me and I don't like you, but let's just do this and I can get back to killing you with beer.

--Homer Simpson

Images

Google

List test!

  • About
  • Services
    1. Programming
    2. Design
    3. Marketing
  • Team
    1. All
    2. Executive
    3. Client Services
    4. Programming
      1. Ruby on Rails
      2. PHP
      3. .NET
      4. Orbeon
    5. Design
    6. Marketing
  • Blog
  • Contact
@gabetax
Copy link
Author

gabetax commented Jan 30, 2012

I wrote this as a personal XSLT learning project.

To run XSLT on OSX:

$ brew install saxon
$ saxon example.html html-to-markdown.xsl

This would benefit for some additions to better handle block elements.

@SmileGobo
Copy link

In programming list probably you have mistacke. Orbeon -> Oberon

@gabetax
Copy link
Author

gabetax commented Jun 29, 2022

"Orbeon" here refers to https://doc.orbeon.com, which is a XML-based web framework that uses XSLT to implement business logic (and was my main motivator for thsi experiment with XSLT).

@namedgraph
Copy link

namedgraph commented Nov 22, 2022

@gabetax what about any elements you don't have templates for, such as <table>? Shouldn't there be a catch-all template copying them as-is?

@namedgraph
Copy link

Using version="3.0" it should be enough to add this:

<xsl:mode on-no-match="deep-copy"/>

@namedgraph
Copy link

It looks like your stylesheet is also missing xpath-default-namespace="http://www.w3.org/1999/xhtml" (or xhtml:-prefixed names in template match patterns).

My full prologue now looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:functx="http://www.functx.com"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xpath-default-namespace="http://www.w3.org/1999/xhtml"
version="3.0">
   
 <xsl:output method="text" />
 <xsl:strip-space elements="*" />
 <xsl:mode on-no-match="deep-copy"/>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment