Skip to content

Instantly share code, notes, and snippets.

Created November 25, 2011 18:29
Show Gist options
  • Save yorkxin/1394128 to your computer and use it in GitHub Desktop.
Save yorkxin/1394128 to your computer and use it in GitHub Desktop. (XML export) to Octopress importer. See for change notes.
# coding: utf-8
# Original File:
# Modified by Yu-Cheng Chuang <>
# Licensed under MIT License (same as the original file)
# This version of wordpressdotcom.rb is compatible
# with the real-world export file, which:
# - Makes paragraphs (<p>) and line breaks (<br>)
# with simple_format borrowed from Ruby on Rails' ActionPack
# ( does not actually store <p> tags)
# - Removes <br> in <pre>, which is usually unnecessary
# - Decodes encoded URI to avoid double-encoding of non-ascii slugs (permalink_title)
# e.g. If you have a post with title "café",
# may already escaped the slug to "caf%C3%A9"
# In this case, if you don't decode it to the original form,
# The filename will be double-encoded to "caf%25C3%25A9"
# and so the post URL (if you have :title in the URL format).
# - Disable Disqus comment for a post if commenting was disabled on that post.
# But does not support
# - [sourcecode language='blahblah'] block, please grep them out yourself.
# - Convert HTML to Markdown
require 'rubygems'
require 'hpricot'
require 'fileutils'
require 'psych'
require 'time'
module Jekyll
# This importer takes a wordpress.xml file, which can be exported from your
# blog (/wp-admin/export.php).
module WordpressDotCom
# From ActionPack of Ruby on Rails
def self.simple_format(text)
text = '' if text.nil?
start_tag = "<p>"
text = text.to_str
text.gsub!(/\r\n?/, "\n") # \r\n and \r -> \n
text.gsub!(/\n\n+/, "</p>\n\n#{start_tag}") # 2+ newline -> paragraph
text.gsub!(/([^>])(\n)([^\n<])/, '\1<br>\2\3')
text.insert 0, start_tag
def self.remove_br_in_pre(text)
doc = Hpricot(text)"pre br").remove
def self.process(filename = "wordpress.xml")
import_count =
doc = Hpricot::XML(
(doc/:channel/:item).each do |item|
title =
permalink_title ='wp:post_name').inner_text
# Fallback to "prettified" title if post_name is empty (can happen)
if permalink_title == ""
permalink_title = title.downcase.split.join('-')
date = Time.parse('wp:post_date').inner_text)
status ='wp:status').inner_text
if status == "publish"
published = true
published = false
comment_status ='wp:comment_status').inner_text
if comment_status == "open"
comments = true
comments = false
type ='wp:post_type').inner_text
categories = (item/"category[@domain=category]").map{|c| c.inner_text}.reject{|c| c == 'Uncategorized'}.uniq
tags = (item/"category[@domain=post_tag]").map{|t| t.inner_text}.uniq
name = "#{date.strftime('%Y-%m-%d')}-#{URI.decode permalink_title}.html"
header = {
'layout' => type,
'title' => title,
'categories' => categories,
'tags' => tags,
'published' => published,
'comments' => comments
FileUtils.mkdir_p "source/_#{type}s""source/_#{type}s/#{name}", "w") do |f|
f.puts header.to_yaml
f.puts '---'
f.puts remove_br_in_pre simple_format'content:encoded').inner_text
import_count[type] += 1
import_count.each do |key, value|
puts "Imported #{value} #{key}s"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment