Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save simonw/41d56712427e6a4178fc6495d664005f to your computer and use it in GitHub Desktop.
Save simonw/41d56712427e6a4178fc6495d664005f to your computer and use it in GitHub Desktop.
Convert Datasette RST changelog to GFM for releases - for https://github.com/simonw/datasette/issues/680
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Convert Datasette RST changelog to GFM for releases\n",
"\n",
"For https://github.com/simonw/datasette/issues/680\n",
"\n",
"Question is: can I automatically take the most recent section from https://datasette.readthedocs.io/en/latest/changelog.html and convert it into Markdown suitable for automatically posting a GitHub release?\n",
"\n",
"Pandoc looks useful but isn't pure Python. https://pypi.org/project/pypandoc/ includes a prebuilt binary wheel for OS X though, so I'll start by playing with that.\n",
"\n",
" ~ $ jupyter-venv/bin/pip install pypandoc"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from pypandoc.pandoc_download import download_pandoc"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"* Downloading pandoc from https://github.com/jgm/pandoc/releases/download/2.9.2/pandoc-2.9.2-macOS.pkg ...\n",
"* Unpacking pandoc-2.9.2-macOS.pkg to tempfolder...\n",
"* Copying pandoc to /Users/simonw/Applications/pandoc ...\n",
"* Making /Users/simonw/Applications/pandoc/pandoc executeable...\n",
"* Copying pandoc-citeproc to /Users/simonw/Applications/pandoc ...\n",
"* Making /Users/simonw/Applications/pandoc/pandoc-citeproc executeable...\n",
"* Done.\n"
]
}
],
"source": [
"download_pandoc()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pypandoc"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"some title\n",
"==========\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text('# some title', 'rst', format='md'))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"changelog = '''.. _v0_36:\n",
"\n",
"0.36 (2020-02-21)\n",
"-----------------\n",
"\n",
"* The ``datasette`` object passed to plugins now has API documentation: :ref:`datasette`. (`#576 <https://github.com/simonw/datasette/issues/576>`__)\n",
"* New methods on ``datasette``: ``.add_database()`` and ``.remove_database()`` - :ref:`documentation <datasette_add_database>`. (`#671 <https://github.com/simonw/datasette/issues/671>`__)\n",
"* ``prepare_connection()`` plugin hook now takes optional ``datasette`` and ``database`` arguments - :ref:`plugin_hook_prepare_connection`. (`#678 <https://github.com/simonw/datasette/issues/678>`__)\n",
"* Added three new plugins and one new conversion tool to the :ref:`ecosystem`.'''"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
".. _v0_36:\n",
"\n",
"0.36 (2020-02-21)\n",
"-----------------\n",
"\n",
"* The ``datasette`` object passed to plugins now has API documentation: :ref:`datasette`. (`#576 <https://github.com/simonw/datasette/issues/576>`__)\n",
"* New methods on ``datasette``: ``.add_database()`` and ``.remove_database()`` - :ref:`documentation <datasette_add_database>`. (`#671 <https://github.com/simonw/datasette/issues/671>`__)\n",
"* ``prepare_connection()`` plugin hook now takes optional ``datasette`` and ``database`` arguments - :ref:`plugin_hook_prepare_connection`. (`#678 <https://github.com/simonw/datasette/issues/678>`__)\n",
"* Added three new plugins and one new conversion tool to the :ref:`ecosystem`.\n"
]
}
],
"source": [
"print(changelog)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# 0.36 (2020-02-21)\n",
"\n",
" - The `datasette` object passed to plugins now has API documentation:\n",
" `datasette`.\n",
" ([\\#576](https://github.com/simonw/datasette/issues/576))\n",
" - New methods on `datasette`: `.add_database()` and\n",
" `.remove_database()` - `documentation <datasette_add_database>`.\n",
" ([\\#671](https://github.com/simonw/datasette/issues/671))\n",
" - `prepare_connection()` plugin hook now takes optional `datasette`\n",
" and `database` arguments - `plugin_hook_prepare_connection`.\n",
" ([\\#678](https://github.com/simonw/datasette/issues/678))\n",
" - Added three new plugins and one new conversion tool to the\n",
" `ecosystem`.\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(changelog, 'gfm', format='rst'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This isn't a bad start, but has one big flaw: the `:ref:datasette` references were understandable not resolved, becausee I didn't provide enough information for that to happen.\n",
"\n",
"One option: use rendered HTML as input instead of RST."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"changelog_html = \"\"\"\n",
"<div class=\"section\" id=\"v0-36\">\n",
"<span id=\"id2\"></span><h2>0.36 (2020-02-21)<a class=\"headerlink\" href=\"#v0-36\" title=\"Permalink to this headline\">¶</a></h2>\n",
"<ul class=\"simple\">\n",
"<li>The <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code> object passed to plugins now has API documentation: <a class=\"reference internal\" href=\"datasette.html#datasette\"><span class=\"std std-ref\">Datasette class</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/576\">#576</a>)</li>\n",
"<li>New methods on <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code>: <code class=\"docutils literal notranslate\"><span class=\"pre\">.add_database()</span></code> and <code class=\"docutils literal notranslate\"><span class=\"pre\">.remove_database()</span></code> - <a class=\"reference internal\" href=\"datasette.html#datasette-add-database\"><span class=\"std std-ref\">documentation</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/671\">#671</a>)</li>\n",
"<li><code class=\"docutils literal notranslate\"><span class=\"pre\">prepare_connection()</span></code> plugin hook now takes optional <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code> and <code class=\"docutils literal notranslate\"><span class=\"pre\">database</span></code> arguments - <a class=\"reference internal\" href=\"plugins.html#plugin-hook-prepare-connection\"><span class=\"std std-ref\">prepare_connection(conn, database, datasette)</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/678\">#678</a>)</li>\n",
"<li>Added three new plugins and one new conversion tool to the <a class=\"reference internal\" href=\"ecosystem.html#ecosystem\"><span class=\"std std-ref\">The Datasette Ecosystem</span></a>.</li>\n",
"</ul>\n",
"</div>\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<div id=\"v0-36\" class=\"section\">\n",
"\n",
"<span id=\"id2\"></span>\n",
"\n",
"## 0.36 (2020-02-21)[¶](#v0-36 \"Permalink to this headline\")\n",
"\n",
" - The `datasette` object passed to plugins now has API documentation:\n",
" [<span class=\"std std-ref\">Datasette\n",
" class</span>](datasette.html#datasette).\n",
" ([\\#576](https://github.com/simonw/datasette/issues/576))\n",
" - New methods on `datasette`: `.add_database()` and\n",
" `.remove_database()` -\n",
" [<span class=\"std std-ref\">documentation</span>](datasette.html#datasette-add-database).\n",
" ([\\#671](https://github.com/simonw/datasette/issues/671))\n",
" - `prepare_connection()` plugin hook now takes optional `datasette`\n",
" and `database` arguments -\n",
" [<span class=\"std std-ref\">prepare\\_connection(conn, database,\n",
" datasette)</span>](plugins.html#plugin-hook-prepare-connection).\n",
" ([\\#678](https://github.com/simonw/datasette/issues/678))\n",
" - Added three new plugins and one new conversion tool to the\n",
" [<span class=\"std std-ref\">The Datasette\n",
" Ecosystem</span>](ecosystem.html#ecosystem).\n",
"\n",
"</div>\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(changelog_html, 'gfm', format='html'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's pretty great! A couple of improvements:\n",
"\n",
"1. I can strip out the `Permalink to this headline` elements before running the transformation\n",
"2. I should resolve the relative path links to full URLs\n",
"3. I can probably strip out the `<span>` and `<div>` elements entirely\n",
"\n",
"It might be possible to do some of this using advanced Pandoc options - https://pandoc.org/MANUAL.html - but I'm going to play it safe and do it with BeautifulSoup instead.\n",
"\n",
"But first... let's try it against a more complex example that isn't just a single bulleted list."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"bigger_html = '''<div class=\"section\" id=\"v0-31\">\n",
"<span id=\"id9\"></span><h2>0.31 (2019-11-11)<a class=\"headerlink\" href=\"#v0-31\" title=\"Permalink to this headline\">¶</a></h2>\n",
"<p>This version adds compatibility with Python 3.8 and breaks compatibility with Python 3.5.</p>\n",
"<p>If you are still running Python 3.5 you should stick with <code class=\"docutils literal notranslate\"><span class=\"pre\">0.30.2</span></code>, which you can install like this:</p>\n",
"<div class=\"highlight-default notranslate\"><div class=\"highlight\"><pre><span></span><span class=\"n\">pip</span> <span class=\"n\">install</span> <span class=\"n\">datasette</span><span class=\"o\">==</span><span class=\"mf\">0.30</span><span class=\"o\">.</span><span class=\"mi\">2</span>\n",
"</pre></div>\n",
"</div>\n",
"<ul class=\"simple\">\n",
"<li>Format SQL button now works with read-only SQL queries - thanks, Tobias Kunze (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/pull/602\">#602</a>)</li>\n",
"<li>New <code class=\"docutils literal notranslate\"><span class=\"pre\">?column__notin=x,y,z</span></code> filter for table views (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/614\">#614</a>)</li>\n",
"<li>Table view now uses <code class=\"docutils literal notranslate\"><span class=\"pre\">select</span> <span class=\"pre\">col1,</span> <span class=\"pre\">col2,</span> <span class=\"pre\">col3</span></code> instead of <code class=\"docutils literal notranslate\"><span class=\"pre\">select</span> <span class=\"pre\">*</span></code></li>\n",
"<li>Database filenames can now contain spaces - thanks, Tobias Kunze (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/pull/590\">#590</a>)</li>\n",
"<li>Removed obsolete <code class=\"docutils literal notranslate\"><span class=\"pre\">?_group_count=col</span></code> feature (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/504\">#504</a>)</li>\n",
"<li>Improved user interface and documentation for <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span> <span class=\"pre\">publish</span> <span class=\"pre\">cloudrun</span></code> (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/608\">#608</a>)</li>\n",
"<li>Tables with indexes now show the <code class=\"docutils literal notranslate\"><span class=\"pre\">CREATE</span> <span class=\"pre\">INDEX</span></code> statements on the table page (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/618\">#618</a>)</li>\n",
"<li>Current version of <a class=\"reference external\" href=\"https://www.uvicorn.org/\">uvicorn</a> is now shown on <code class=\"docutils literal notranslate\"><span class=\"pre\">/-/versions</span></code></li>\n",
"<li>Python 3.8 is now supported! (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/622\">#622</a>)</li>\n",
"<li>Python 3.5 is no longer supported.</li>\n",
"</ul>\n",
"</div>'''"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<div id=\"v0-31\" class=\"section\">\n",
"\n",
"<span id=\"id9\"></span>\n",
"\n",
"## 0.31 (2019-11-11)[¶](#v0-31 \"Permalink to this headline\")\n",
"\n",
"This version adds compatibility with Python 3.8 and breaks compatibility\n",
"with Python 3.5.\n",
"\n",
"If you are still running Python 3.5 you should stick with `0.30.2`,\n",
"which you can install like this:\n",
"\n",
"<div class=\"highlight-default notranslate\">\n",
"\n",
"<div class=\"highlight\">\n",
"\n",
" pip install datasette==0.30.2\n",
"\n",
"</div>\n",
"\n",
"</div>\n",
"\n",
" - Format SQL button now works with read-only SQL queries - thanks,\n",
" Tobias Kunze ([\\#602](https://github.com/simonw/datasette/pull/602))\n",
" - New `?column__notin=x,y,z` filter for table views\n",
" ([\\#614](https://github.com/simonw/datasette/issues/614))\n",
" - Table view now uses `select col1, col2, col3` instead of `select *`\n",
" - Database filenames can now contain spaces - thanks, Tobias Kunze\n",
" ([\\#590](https://github.com/simonw/datasette/pull/590))\n",
" - Removed obsolete `?_group_count=col` feature\n",
" ([\\#504](https://github.com/simonw/datasette/issues/504))\n",
" - Improved user interface and documentation for `datasette publish\n",
" cloudrun` ([\\#608](https://github.com/simonw/datasette/issues/608))\n",
" - Tables with indexes now show the `CREATE INDEX` statements on the\n",
" table page ([\\#618](https://github.com/simonw/datasette/issues/618))\n",
" - Current version of [uvicorn](https://www.uvicorn.org/) is now shown\n",
" on `/-/versions`\n",
" - Python 3.8 is now supported\\!\n",
" ([\\#622](https://github.com/simonw/datasette/issues/622))\n",
" - Python 3.5 is no longer supported.\n",
"\n",
"</div>\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(bigger_html, 'gfm', format='html'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This looks pretty great rendered! https://gist.github.com/simonw/bec7efcee94b4fe7215ddb95ee6a3ec1#v0-31\n",
"\n",
"I can strip out the `<h2>` (including the permalink thing) entirely since that will turn into the title of the release.\n",
"\n",
"So all I really need to do is resolve those links into full URLs."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup as Soup"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"soup = Soup(changelog_html)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<a class=\"headerlink\" href=\"#v0-36\" title=\"Permalink to this headline\">¶</a>,\n",
" <a class=\"reference internal\" href=\"datasette.html#datasette\"><span class=\"std std-ref\">Datasette class</span></a>,\n",
" <a class=\"reference internal\" href=\"datasette.html#datasette-add-database\"><span class=\"std std-ref\">documentation</span></a>,\n",
" <a class=\"reference internal\" href=\"plugins.html#plugin-hook-prepare-connection\"><span class=\"std std-ref\">prepare_connection(conn, database, datasette)</span></a>,\n",
" <a class=\"reference internal\" href=\"ecosystem.html#ecosystem\"><span class=\"std std-ref\">The Datasette Ecosystem</span></a>]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"relative_links = [\n",
" a for a in soup.findAll(\"a\")\n",
" if not (a['href'].startswith('https://') or a['href'].startswith('http://'))\n",
"]\n",
"relative_links"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"for a in relative_links:\n",
" a['href'] = 'https://datasette.readthedocs.io/en/latest/' + a['href']"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<html><body><div class=\"section\" id=\"v0-36\">\n",
"<span id=\"id2\"></span><h2>0.36 (2020-02-21)<a class=\"headerlink\" href=\"https://datasette.readthedocs.io/en/latest/#v0-36\" title=\"Permalink to this headline\">¶</a></h2>\n",
"<ul class=\"simple\">\n",
"<li>The <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code> object passed to plugins now has API documentation: <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/datasette.html#datasette\"><span class=\"std std-ref\">Datasette class</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/576\">#576</a>)</li>\n",
"<li>New methods on <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code>: <code class=\"docutils literal notranslate\"><span class=\"pre\">.add_database()</span></code> and <code class=\"docutils literal notranslate\"><span class=\"pre\">.remove_database()</span></code> - <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/datasette.html#datasette-add-database\"><span class=\"std std-ref\">documentation</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/671\">#671</a>)</li>\n",
"<li><code class=\"docutils literal notranslate\"><span class=\"pre\">prepare_connection()</span></code> plugin hook now takes optional <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code> and <code class=\"docutils literal notranslate\"><span class=\"pre\">database</span></code> arguments - <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/plugins.html#plugin-hook-prepare-connection\"><span class=\"std std-ref\">prepare_connection(conn, database, datasette)</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/678\">#678</a>)</li>\n",
"<li>Added three new plugins and one new conversion tool to the <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/ecosystem.html#ecosystem\"><span class=\"std std-ref\">The Datasette Ecosystem</span></a>.</li>\n",
"</ul>\n",
"</div>\n",
"</body></html>\n"
]
}
],
"source": [
"print(soup)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'0.36 (2020-02-21)¶'"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now strip out that h2\n",
"h2 = soup.find(\"h2\")\n",
"title = h2.text\n",
"h2.extract()\n",
"title"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<html><body><div class=\"section\" id=\"v0-36\">\n",
"<span id=\"id2\"></span>\n",
"<ul class=\"simple\">\n",
"<li>The <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code> object passed to plugins now has API documentation: <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/datasette.html#datasette\"><span class=\"std std-ref\">Datasette class</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/576\">#576</a>)</li>\n",
"<li>New methods on <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code>: <code class=\"docutils literal notranslate\"><span class=\"pre\">.add_database()</span></code> and <code class=\"docutils literal notranslate\"><span class=\"pre\">.remove_database()</span></code> - <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/datasette.html#datasette-add-database\"><span class=\"std std-ref\">documentation</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/671\">#671</a>)</li>\n",
"<li><code class=\"docutils literal notranslate\"><span class=\"pre\">prepare_connection()</span></code> plugin hook now takes optional <code class=\"docutils literal notranslate\"><span class=\"pre\">datasette</span></code> and <code class=\"docutils literal notranslate\"><span class=\"pre\">database</span></code> arguments - <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/plugins.html#plugin-hook-prepare-connection\"><span class=\"std std-ref\">prepare_connection(conn, database, datasette)</span></a>. (<a class=\"reference external\" href=\"https://github.com/simonw/datasette/issues/678\">#678</a>)</li>\n",
"<li>Added three new plugins and one new conversion tool to the <a class=\"reference internal\" href=\"https://datasette.readthedocs.io/en/latest/ecosystem.html#ecosystem\"><span class=\"std std-ref\">The Datasette Ecosystem</span></a>.</li>\n",
"</ul>\n",
"</div>\n",
"</body></html>\n"
]
}
],
"source": [
"print(soup)"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<div id=\"v0-36\" class=\"section\">\n",
"\n",
"<span id=\"id2\"></span>\n",
"\n",
" - The `datasette` object passed to plugins now has API documentation:\n",
" [<span class=\"std std-ref\">Datasette\n",
" class</span>](https://datasette.readthedocs.io/en/latest/datasette.html#datasette).\n",
" ([\\#576](https://github.com/simonw/datasette/issues/576))\n",
" - New methods on `datasette`: `.add_database()` and\n",
" `.remove_database()` -\n",
" [<span class=\"std std-ref\">documentation</span>](https://datasette.readthedocs.io/en/latest/datasette.html#datasette-add-database).\n",
" ([\\#671](https://github.com/simonw/datasette/issues/671))\n",
" - `prepare_connection()` plugin hook now takes optional `datasette`\n",
" and `database` arguments -\n",
" [<span class=\"std std-ref\">prepare\\_connection(conn, database,\n",
" datasette)</span>](https://datasette.readthedocs.io/en/latest/plugins.html#plugin-hook-prepare-connection).\n",
" ([\\#678](https://github.com/simonw/datasette/issues/678))\n",
" - Added three new plugins and one new conversion tool to the\n",
" [<span class=\"std std-ref\">The Datasette\n",
" Ecosystem</span>](https://datasette.readthedocs.io/en/latest/ecosystem.html#ecosystem).\n",
"\n",
"</div>\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(str(soup), 'gfm', format='html'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I think this is good enough: https://gist.github.com/simonw/bec7efcee94b4fe7215ddb95ee6a3ec1#file-final-md\n",
"\n",
"I wonder if stripping id= attributes will get rid of those div and spans?"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"for el in soup.findAll():\n",
" if \"id\" in el.attrs:\n",
" del el.attrs[\"id\"]"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<div class=\"section\">\n",
"\n",
"<span></span>\n",
"\n",
" - The `datasette` object passed to plugins now has API documentation:\n",
" [<span class=\"std std-ref\">Datasette\n",
" class</span>](https://datasette.readthedocs.io/en/latest/datasette.html#datasette).\n",
" ([\\#576](https://github.com/simonw/datasette/issues/576))\n",
" - New methods on `datasette`: `.add_database()` and\n",
" `.remove_database()` -\n",
" [<span class=\"std std-ref\">documentation</span>](https://datasette.readthedocs.io/en/latest/datasette.html#datasette-add-database).\n",
" ([\\#671](https://github.com/simonw/datasette/issues/671))\n",
" - `prepare_connection()` plugin hook now takes optional `datasette`\n",
" and `database` arguments -\n",
" [<span class=\"std std-ref\">prepare\\_connection(conn, database,\n",
" datasette)</span>](https://datasette.readthedocs.io/en/latest/plugins.html#plugin-hook-prepare-connection).\n",
" ([\\#678](https://github.com/simonw/datasette/issues/678))\n",
" - Added three new plugins and one new conversion tool to the\n",
" [<span class=\"std std-ref\">The Datasette\n",
" Ecosystem</span>](https://datasette.readthedocs.io/en/latest/ecosystem.html#ecosystem).\n",
"\n",
"</div>\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(str(soup), 'gfm', format='html'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I think I want to disable the `raw_html` option which is included in the `gfm` bundle."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"ename": "RuntimeError",
"evalue": "Pandoc died with exitcode \"6\" during conversion: b'Unknown option --raw_html.\\nTry pandoc --help for more information.\\n'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-61-5acb1858a48a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpypandoc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconvert_text\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msoup\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'gfm'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'html'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mextra_args\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"--raw_html\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/jupyter-venv/lib/python3.7/site-packages/pypandoc/__init__.py\u001b[0m in \u001b[0;36mconvert_text\u001b[0;34m(source, to, format, extra_args, encoding, outputfile, filters)\u001b[0m\n\u001b[1;32m 101\u001b[0m \u001b[0msource\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_as_unicode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mencoding\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 102\u001b[0m return _convert_input(source, format, 'string', to, extra_args=extra_args,\n\u001b[0;32m--> 103\u001b[0;31m outputfile=outputfile, filters=filters)\n\u001b[0m\u001b[1;32m 104\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 105\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/jupyter-venv/lib/python3.7/site-packages/pypandoc/__init__.py\u001b[0m in \u001b[0;36m_convert_input\u001b[0;34m(source, format, input_type, to, extra_args, outputfile, filters)\u001b[0m\n\u001b[1;32m 323\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 324\u001b[0m raise RuntimeError(\n\u001b[0;32m--> 325\u001b[0;31m \u001b[0;34m'Pandoc died with exitcode \"%s\" during conversion: %s'\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstderr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 326\u001b[0m )\n\u001b[1;32m 327\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mRuntimeError\u001b[0m: Pandoc died with exitcode \"6\" during conversion: b'Unknown option --raw_html.\\nTry pandoc --help for more information.\\n'"
]
}
],
"source": [
"print(pypandoc.convert_text(str(soup), 'gfm', format='html', extra_args=[\"--raw_html\"]))"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'/Users/simonw/Applications/pandoc/pandoc'"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pypandoc.get_pandoc_path()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" ~ $ /Users/simonw/Applications/pandoc/pandoc --list-extensions=gfm\n",
" +all_symbols_escapable\n",
" -ascii_identifiers\n",
" +auto_identifiers\n",
" +autolink_bare_uris\n",
" +backtick_code_blocks\n",
" -east_asian_line_breaks\n",
" +emoji\n",
" +fenced_code_blocks\n",
" +gfm_auto_identifiers\n",
" -hard_line_breaks\n",
" +intraword_underscores\n",
" +lists_without_preceding_blankline\n",
" +pipe_tables\n",
" +raw_html\n",
" -raw_tex\n",
" +shortcut_reference_links\n",
" -smart\n",
" +space_in_atx_header\n",
" +strikeout\n",
" +task_lists\n",
"\n",
"https://pandoc.org/MANUAL.html#extensions\n",
"\n",
"> An extension can be enabled by adding +EXTENSION to the format name and disabled by adding -EXTENSION. For example, --from markdown_strict+footnotes is strict Markdown with footnotes enabled, while --from markdown-footnotes-pipe_tables is pandoc’s Markdown without footnotes or pipe tables."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" - The `datasette` object passed to plugins now has API documentation:\n",
" [Datasette\n",
" class](https://datasette.readthedocs.io/en/latest/datasette.html#datasette).\n",
" ([\\#576](https://github.com/simonw/datasette/issues/576))\n",
" - New methods on `datasette`: `.add_database()` and\n",
" `.remove_database()` -\n",
" [documentation](https://datasette.readthedocs.io/en/latest/datasette.html#datasette-add-database).\n",
" ([\\#671](https://github.com/simonw/datasette/issues/671))\n",
" - `prepare_connection()` plugin hook now takes optional `datasette`\n",
" and `database` arguments - [prepare\\_connection(conn, database,\n",
" datasette)](https://datasette.readthedocs.io/en/latest/plugins.html#plugin-hook-prepare-connection).\n",
" ([\\#678](https://github.com/simonw/datasette/issues/678))\n",
" - Added three new plugins and one new conversion tool to the [The\n",
" Datasette\n",
" Ecosystem](https://datasette.readthedocs.io/en/latest/ecosystem.html#ecosystem).\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(str(soup), 'gfm-raw_html', format='html'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"https://gist.github.com/simonw/bec7efcee94b4fe7215ddb95ee6a3ec1#file-gfm-raw_html-md\n",
"\n",
"This is **great**! I'd like to ditch the line wrapping though."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" - The `datasette` object passed to plugins now has API documentation: [Datasette class](https://datasette.readthedocs.io/en/latest/datasette.html#datasette). ([\\#576](https://github.com/simonw/datasette/issues/576))\n",
" - New methods on `datasette`: `.add_database()` and `.remove_database()` - [documentation](https://datasette.readthedocs.io/en/latest/datasette.html#datasette-add-database). ([\\#671](https://github.com/simonw/datasette/issues/671))\n",
" - `prepare_connection()` plugin hook now takes optional `datasette` and `database` arguments - [prepare\\_connection(conn, database, datasette)](https://datasette.readthedocs.io/en/latest/plugins.html#plugin-hook-prepare-connection). ([\\#678](https://github.com/simonw/datasette/issues/678))\n",
" - Added three new plugins and one new conversion tool to the [The Datasette Ecosystem](https://datasette.readthedocs.io/en/latest/ecosystem.html#ecosystem).\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(str(soup), 'gfm-raw_html', format='html', extra_args=[\"--wrap=none\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, try it again on the larger example."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"## 0.31 (2019-11-11)[¶](#v0-31 \"Permalink to this headline\")\n",
"\n",
"This version adds compatibility with Python 3.8 and breaks compatibility with Python 3.5.\n",
"\n",
"If you are still running Python 3.5 you should stick with `0.30.2`, which you can install like this:\n",
"\n",
" pip install datasette==0.30.2\n",
"\n",
" - Format SQL button now works with read-only SQL queries - thanks, Tobias Kunze ([\\#602](https://github.com/simonw/datasette/pull/602))\n",
" - New `?column__notin=x,y,z` filter for table views ([\\#614](https://github.com/simonw/datasette/issues/614))\n",
" - Table view now uses `select col1, col2, col3` instead of `select *`\n",
" - Database filenames can now contain spaces - thanks, Tobias Kunze ([\\#590](https://github.com/simonw/datasette/pull/590))\n",
" - Removed obsolete `?_group_count=col` feature ([\\#504](https://github.com/simonw/datasette/issues/504))\n",
" - Improved user interface and documentation for `datasette publish cloudrun` ([\\#608](https://github.com/simonw/datasette/issues/608))\n",
" - Tables with indexes now show the `CREATE INDEX` statements on the table page ([\\#618](https://github.com/simonw/datasette/issues/618))\n",
" - Current version of [uvicorn](https://www.uvicorn.org/) is now shown on `/-/versions`\n",
" - Python 3.8 is now supported\\! ([\\#622](https://github.com/simonw/datasette/issues/622))\n",
" - Python 3.5 is no longer supported.\n",
"\n"
]
}
],
"source": [
"print(pypandoc.convert_text(bigger_html, 'gfm-raw_html', format='html', extra_args=[\"--wrap=none\"]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nailed it!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment