Last active October 23, 2016 00:43
This quick and dirty script imports posts and images exported by the Posterous backup feature into Octopress. Requires the escape_utils and nokogiri gems. Doesn't import comments.See comments below the gist for more instructions.
#!/usr/bin/env ruby
# This quick and dirty script imports posts and images exported by the
# Posterous backup feature into Octopress. Requires the escape_utils and
# nokogiri gems. Doesn't import comments.
# Videos and images are copied into a post-specific image directory used
# by my customized Octopress setup. Encoded videos are downloaded from
# Posterous. Images will probably need to be compressed/optimized afterward.
# Links to other posts in the same import will try to be converted. You will
# need to edit the generate_* functions below if your permalink format is
# different from /:year/:month/:day/:title/.
# Links, images, videos, special characters/question marks, etc. should be
# verified after running this script.
# Posterous seems to have broken any UTF-8 characters in the exported
# wordpress_export_1.xml, but you can work around this by concatenating all the
# *.xml files under posts/ and replacing all <item> tags in
# wordpress_export_1.xml with the concatenated <item> tags from posts/*.xml.
# You may also want to remove all CR characters from the .xml file first.
# Run from the base directory of your Octopress setup.
# Usage:
# cd [octopress_base_dir]
# ./posterous_import.rb /path/to/wordpress_export_1.xml [base_path]
# ./posterous_import.rb --links /path/to/wordpress_export_1.xml [base_path]
# base_path is the base path of your blog's URLs (e.g. '/' or '/blog').
# The --links invocation generates a directory and index.html under source/ for
# each Posterous permalink, allowing an old Posterous domain to be setup with
# 301 redirects to new post locations. The --links invocation does not import
# any posts. This is useful if you use a permalink format that differs from
# Posterous's (which is the default behavior).
# This script is not guaranteed to work with any Posterous archive other than
# my own. Do what you want with this script; attribution is appreciated, but
# optional. Comments and corrections are welcome.
# In hindsight it may have been easier to fix up the archived HTML posts or
# individual XML files instead of using the RSS feed.
# Created 2013 by Mike Bourgeous - Released under CC0
require 'rss'
require 'yaml'
require 'fileutils'
require 'escape_utils'
require 'nokogiri'
# Fixes references to Posterous in document tags of the given type. Only
# attributes that appear to contain a Posterous URL will be processed.
# If no block is given, tries to find a file matching the tag's attribute under
# [srcdir], or if [srcdir] is nil, downloads the URI contained in [attr]. The
# matching file, if one is found, will be copied into [destdir], and the tag's
# [attr] attribute changed to point at [serverdir]/filename. Posterous image
# name abbreviation is taken into account, but this has not been tested with a
# wide variety of names.
# If a block is given, the block will be called once for each matching tag and
# the contents of its [attr] attribute, and the return value of the block used
# to replace the tag's [attr] attribute.
# After the attribute is updated, an immediately surrounding <a> tag linking to
# Posterous, if one exists, will be removed.
# doc - The parsed Nokogiri document.
# srcdir - The directory in which to find replacement files, or nil to download
# the originals.
# destdir - The directory to which to copy replacement files.
# serverdir - The name of destdir on the server (used for updating image tags).
# tag - The name of the tags to update.
# attr - The attribute of the tags to update.
def fix_sources doc, srcdir, destdir, serverdir, tag='img', attr='src', &bl
puts "\tFixing #{tag} tags' #{attr} attribute"
tags = doc.css(tag)
postregex = %r{https?://[^/]*}
tags.each do |img|
next unless img[attr] =~ postregex
shortname = img[attr].split('/').last.split('.scaled').first
ext = shortname.split('.').last.downcase
puts "\t#{tag}: #{shortname}"
if block_given?
img[attr] = yield img, img[attr]
if srcdir == nil
# Download the file
puts "\t\tDownloading #{shortname}", shortname), "w") do |file|
in_img = shortname
# Find matching files
matches = Dir.entries(srcdir).select {|imgfile|
imgfile.downcase.end_with?(ext) &&
imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first)
if matches.length == 0
matches = Dir.entries(srcdir).select {|imgfile|
imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first)
if matches.length == 0
puts "\n\n\n########\nNo match found for #{img[attr]} in #{srcdir}\n########\n\n"
if matches.length > 1
reduced = {|imgfile|
if reduced.length == 1
matches = reduced
puts "\n\n\n########\nMore than one match found for #{shortname}:"
puts matches
puts "You will need to double-check #{tag} tags in #{filename}\n\n"
in_img = matches.first
puts "\t\tUsing #{in_img} for #{shortname}"
# Copy the file into the destination directory
FileUtils.cp(File.join(srcdir, in_img), destdir)
# Update the tag's attribute
img[attr] = EscapeUtils.escape_uri(File.join(serverdir, in_img))
# Remove a link wrapping the image, if one exists
parent = img.parent
if parent.node_name == 'a' && parent['href'] =~ postregex
puts "\t\tRemoving parent link: #{parent['href']}"
# Writes each item from the given RSS feed into ./source/_posts (use Dir.chdir
# to change directories first if necessary). Posts will be marked as
# unpublished if the post's link starts with '/private/'.
# rss - The File containing the RSS feed. The images will be found relative to
# the feed.
# basedir - The server directory in which the blog's posts and images/
# directory reside.
def generate_posts rss_file, basedir='/'
basedir = "/#{basedir}" unless basedir.start_with? '/'
basedir = "#{basedir}/" unless basedir.end_with? '/'
dir = File.dirname(File.expand_path(rss_file))
rss =
feed = RSS::Parser.parse(rss, false)
item_map = Hash[*{|item|
link ='/').last
[link, {:item => item, :filename => item.pubDate.strftime("source/_posts/%Y-%m-%d-#{link}.html")}]
feed.items.each do |item|
post_uri = URI.parse(
permalink ='/').last
filename = item_map[permalink][:filename]
date = item.pubDate
header = {
'layout' => "post",
'title' => item.title,
'date' => date,
'comments' => true,
'categories' =>{|cat| cat.domain == "tag"}.map{|cat| cat.content},
'published' => !post_uri.path.start_with?('/private/')
puts "Generating #{filename}#{header['published'] ? '' : ' (unpublished)'}"
imgdir = "source/images/#{date.strftime('%Y/%m/%d')}/#{permalink}/"
serverdir = '/' + imgdir.split('/', 2).last
outfile =, "w")
outfile.puts header.to_yaml
outfile.puts "---"
# Fix up images and video
html = Nokogiri::HTML("<div id=\"import_#{permalink}\">#{EscapeUtils.unescape_html(item.content_encoded)}</div>")
images = html.css('img')
fix_sources html, date.strftime("#{dir}/image/%Y/%m"), imgdir, serverdir
fix_sources html, nil, imgdir, serverdir, 'source'
fix_sources html, nil, nil, nil, 'video', 'poster' do nil end
# Fix up links to other posts
fix_sources html, nil, nil, nil, 'a', 'href' do |tag, href|
link_uri = URI.parse(href)
next unless ==
link_shortname = href.split('/').last.split('#').first
if item_map.include? link_shortname
link = item_map[link_shortname][:item]
href = link.pubDate.strftime("#{basedir}%Y/%m/%d/#{link_shortname}/")
href += "##{link_uri.fragment}" if link_uri.fragment
puts "\t\tUsing #{link.title} (#{href})"
puts "\t######## No match found for #{href}"
outfile.puts html.css("div#import_#{permalink}"){|node| node.to_html}.join
# Generates a redirecting link from the permalink of each item from the given
# RSS feed to the corresponding post generated by generate_posts().
# rss - The File containing the RSS feed.
# basedir - The server directory in which the blog's posts and images/
# directory reside.
def generate_links rss_file, basedir='/'
basedir = "/#{basedir}" unless basedir.start_with? '/'
basedir = "#{basedir}/" unless basedir.end_with? '/'
dir = File.dirname(File.expand_path(rss_file))
rss =
feed = RSS::Parser.parse(rss, false)
item_map = Hash[*{|item|
link ='/').last
[link, {:item => item, :filename => item.pubDate.strftime("source/#{link}/index.html")}]
feed.items.each do |item|
post_uri = URI.parse(
permalink ='/').last
filename = item_map[permalink][:filename]
dirname = File.dirname(filename)
href = item.pubDate.strftime("#{basedir}%Y/%m/%d/#{permalink}/")
title = item.title
outfile =, "w")
outfile.write <<-HTML
<!DOCTYPE html>
<meta http-equiv="Refresh" content="0; url=#{href}">
<link href="#{basedir}stylesheets/screen.css" rel="stylesheet" type="text/css">
<a style="color: inherit; text-decoration: none" href="#{href}">#{title}</a>
if __FILE__ == $0
raise 'No RSS feed given' unless $ARGV.length > 0
if $ARGV[0] == '--links'
raise 'No RSS feed given' unless $ARGV.length > 1
generate_links $ARGV[1], $ARGV[2] || '/'
generate_posts $ARGV[0], $ARGV[1] || '/'
A bit of manual work still has to be done before and after with my script, such as gathering the XML files and minifying images. To generate the single XML file my script would need (since wordpress_export_1.xml has replaced all UTF-8 characters with question marks), I would do something like this:

cd /path/to/space-[numbers, name, etc.]
cat head.xml posts/*.xml > fixed_export.xml
echo '</channel></rss>' >> fixed_export.xml
cd /path/to/new/blog
./posterous_import.rb /path/to/space-[numbers, name, etc.]/fixed_export.xml

Thanks a lot! This great script saved me a lot of time!

This seems to be giving me a problem with Posterous posts that were archived. What seems to be happening is it is reading the wordpress_export_1.xml file, and that is referencing a post in 2010-05, but the earliest date in the images directory is 2010-07.

Not quite sure how to approach this.


Ok here is something else I have learned....this is an example of one of a snippet from 1 of my posts:

<h3>Know when to change tables - by Tony Hsieh (CEO of Zappos)</h3>
<div class='post_info'>
<span class='post_time'>June 21 2010, 11:46 PM</span>
<span class='author'>&nbsp;by Marc Gayle</span>
<div class='post_body'><p><div class='p_embed p_image_embed'>
<img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>

The filename of the image, is also specified in the fixed_exports.xml as can be seen here:

<content:encoded><![CDATA[<p><div class='p_embed p_image_embed'>
<img alt="Media_httpfarm3static_mayii" height="375" src="" width="500" />

This is the error that parsing this file generated:

Generating source/_posts/2010-06-22-know-when-to-change-tables-by-tony-hsieh-ceo-of-zappos.html
    Fixing img tags' src attribute
    img: media_httpfarm3static_mAyIi.jpg
/Dropbox/My Blog/posterous_import.rb:101:in `open': No such file or directory - /Dropbox/My Blog/Marc Gayle/image/2010/06 (Errno::ENOENT)
    from /Dropbox/My Blog/posterous_import.rb:101:in `entries'
    from /Dropbox/My Blog/posterous_import.rb:101:in `block in fix_sources'

So the trick is, when the image is not found at the default image/year/month/day path, to either search the directory structure for the filename, or to actually find the path within the individual html file included in the archive - in this case <img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>.

Any thoughts on the best way to approach this?

For what it's worth, I have forked this and updated it to fix the issues I was having.

