Skip to content

Instantly share code, notes, and snippets.

@tomo-makes
Last active February 16, 2024 02:33
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save tomo-makes/b03e910ea7095bbe2c98de5be828dfba to your computer and use it in GitHub Desktop.
Save tomo-makes/b03e910ea7095bbe2c98de5be828dfba to your computer and use it in GitHub Desktop.
Wordファイル(.docx)をMarkdownへ変換する
$ pandoc -s <input>.docx --wrap=none --reference-links --extract-media=media -t gfm --filter ./despan.py -o <output>.md

いろいろと試した挙句、下記の観点でこれがベスト。

  • --wrap=none 勝手にwrapさせたくない(defaultではwrapしてしまう)
  • --reference-links
  • --extract-media=media docxに埋め込まれたpngなどを抽出できる
  • -t gfm github形式のmarkdownで出力したい(tableがpandoc defaultのmarkdownは他の形式になってしまう)

ref: Pandoc - Pandoc User’s Guide

ref: How to remove title anchor when converting docx to markdown? · Issue #1893 · jgm/pandoc

  • filterを使い、title anchorを削除する
#!/usr/bin/env python
# copied from https://github.com/jgm/pandoc/issues/1893
"""
despan.py
Pandoc filter to convert all regular text to uppercase.
Code, link URLs, etc. are not affected.
"""
from pandocfilters import toJSONFilter, Str
def despan(key, value, format, meta):
if key == 'Span':
return []
if __name__ == "__main__":
toJSONFilter(despan)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment