Skip to content

Instantly share code, notes, and snippets.

@DirkR
Created February 19, 2014 11:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save DirkR/f5f0ff32216cd5fb3a49 to your computer and use it in GitHub Desktop.
Save DirkR/f5f0ff32216cd5fb3a49 to your computer and use it in GitHub Desktop.
convert_to_markdown.sh converts the wget-dump of an website into a collection of Markdown files.
#!/bin/sh
IN_DIR=~/Downloads/website_raw
OUT_DIR=content/pages
TITLE_STRING_SEP=' | '
XIDEL="xidel --input-format html"
PANDOC="pandoc -f html -t markdown_mmd"
[ -d $OUT_DIR ] || mkdir -p $OUT_DIR
for file in $(cd $IN_DIR ; find en de -name \*.html | grep -v '/node/')
do
TITLE=$($XIDEL --extract '//title' "$IN_DIR/$file" | sed -e "s/$TITLE_STRING_SEP.*$//")
TEXT=$($XIDEL --extract "//article" --output-format html "$IN_DIR/$file" | $PANDOC)
#eval $($XIDEL -e 'TITLE:=//title' --output-format bash)
LANG=${file:0:2}
FILE_BASENAME=${file/.html/}
SLUG=${FILE_BASENAME:3}
MD_FILE=${FILE_BASENAME}.md
CONTAINER=$(dirname "$OUT_DIR/$MD_FILE")
[ -d $CONTAINER ] || mkdir -p $CONTAINER
cat <<-EOT > "$OUT_DIR/$MD_FILE"
title: $TITLE
slug: $SLUG
lang: $LANG
$TEXT
EOT
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment