Last active October 23, 2016 00:43
This quick and dirty script imports posts and images exported by the Posterous backup feature into Octopress. Requires the escape_utils and nokogiri gems. Doesn't import comments.See comments below the gist for more instructions.
#!/usr/bin/env ruby
# This quick and dirty script imports posts and images exported by the
# Posterous backup feature into Octopress. Requires the escape_utils and
# nokogiri gems. Doesn't import comments.
# Videos and images are copied into a post-specific image directory used
# by my customized Octopress setup. Encoded videos are downloaded from
# Posterous. Images will probably need to be compressed/optimized afterward.
# Links to other posts in the same import will try to be converted. You will
# need to edit the generate_* functions below if your permalink format is
# different from /:year/:month/:day/:title/.
# Links, images, videos, special characters/question marks, etc. should be
# verified after running this script.
# Posterous seems to have broken any UTF-8 characters in the exported
# wordpress_export_1.xml, but you can work around this by concatenating all the
# *.xml files under posts/ and replacing all <item> tags in
# wordpress_export_1.xml with the concatenated <item> tags from posts/*.xml.
# You may also want to remove all CR characters from the .xml file first.
# Run from the base directory of your Octopress setup.
# Usage:
# cd [octopress_base_dir]
# ./posterous_import.rb /path/to/wordpress_export_1.xml [base_path]
# ./posterous_import.rb --links /path/to/wordpress_export_1.xml [base_path]
# base_path is the base path of your blog's URLs (e.g. '/' or '/blog').
# The --links invocation generates a directory and index.html under source/ for
# each Posterous permalink, allowing an old Posterous domain to be setup with
# 301 redirects to new post locations. The --links invocation does not import
# any posts. This is useful if you use a permalink format that differs from
# Posterous's (which is the default behavior).
# This script is not guaranteed to work with any Posterous archive other than
# my own. Do what you want with this script; attribution is appreciated, but
# optional. Comments and corrections are welcome.
# In hindsight it may have been easier to fix up the archived HTML posts or
# individual XML files instead of using the RSS feed.
# Created 2013 by Mike Bourgeous - Released under CC0
require 'rss'
require 'yaml'
require 'fileutils'
require 'escape_utils'
require 'nokogiri'
# Fixes references to Posterous in document tags of the given type. Only
# attributes that appear to contain a Posterous URL will be processed.
# If no block is given, tries to find a file matching the tag's attribute under
# [srcdir], or if [srcdir] is nil, downloads the URI contained in [attr]. The
# matching file, if one is found, will be copied into [destdir], and the tag's
# [attr] attribute changed to point at [serverdir]/filename. Posterous image
# name abbreviation is taken into account, but this has not been tested with a
# wide variety of names.
# If a block is given, the block will be called once for each matching tag and
# the contents of its [attr] attribute, and the return value of the block used
# to replace the tag's [attr] attribute.
# After the attribute is updated, an immediately surrounding <a> tag linking to
# Posterous, if one exists, will be removed.
# doc - The parsed Nokogiri document.
# srcdir - The directory in which to find replacement files, or nil to download
# the originals.
# destdir - The directory to which to copy replacement files.
# serverdir - The name of destdir on the server (used for updating image tags).
# tag - The name of the tags to update.
# attr - The attribute of the tags to update.
def fix_sources doc, srcdir, destdir, serverdir, tag='img', attr='src', &bl
puts "\tFixing #{tag} tags' #{attr} attribute"
tags = doc.css(tag)
postregex = %r{https?://[^/]*}
tags.each do |img|
next unless img[attr] =~ postregex
shortname = img[attr].split('/').last.split('.scaled').first
ext = shortname.split('.').last.downcase
puts "\t#{tag}: #{shortname}"
if block_given?
img[attr] = yield img, img[attr]
if srcdir == nil
# Download the file
puts "\t\tDownloading #{shortname}", shortname), "w") do |file|
in_img = shortname
# Find matching files
matches = Dir.entries(srcdir).select {|imgfile|
imgfile.downcase.end_with?(ext) &&
imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first)
if matches.length == 0
matches = Dir.entries(srcdir).select {|imgfile|
imgfile.gsub(/\s+/, '_').include?(shortname.split('.').first)
if matches.length == 0
puts "\n\n\n########\nNo match found for #{img[attr]} in #{srcdir}\n########\n\n"
if matches.length > 1
reduced = {|imgfile|
if reduced.length == 1
matches = reduced
puts "\n\n\n########\nMore than one match found for #{shortname}:"
puts matches
puts "You will need to double-check #{tag} tags in #{filename}\n\n"
in_img = matches.first
puts "\t\tUsing #{in_img} for #{shortname}"
# Copy the file into the destination directory
FileUtils.cp(File.join(srcdir, in_img), destdir)
# Update the tag's attribute
img[attr] = EscapeUtils.escape_uri(File.join(serverdir, in_img))
# Remove a link wrapping the image, if one exists
parent = img.parent
if parent.node_name == 'a' && parent['href'] =~ postregex
puts "\t\tRemoving parent link: #{parent['href']}"
# Writes each item from the given RSS feed into ./source/_posts (use Dir.chdir
# to change directories first if necessary). Posts will be marked as
# unpublished if the post's link starts with '/private/'.
# rss - The File containing the RSS feed. The images will be found relative to
# the feed.
# basedir - The server directory in which the blog's posts and images/
# directory reside.
def generate_posts rss_file, basedir='/'
basedir = "/#{basedir}" unless basedir.start_with? '/'
basedir = "#{basedir}/" unless basedir.end_with? '/'
dir = File.dirname(File.expand_path(rss_file))
rss =
feed = RSS::Parser.parse(rss, false)
item_map = Hash[*{|item|
link ='/').last
[link, {:item => item, :filename => item.pubDate.strftime("source/_posts/%Y-%m-%d-#{link}.html")}]
feed.items.each do |item|
post_uri = URI.parse(
permalink ='/').last
filename = item_map[permalink][:filename]
date = item.pubDate
header = {
'layout' => "post",
'title' => item.title,
'date' => date,
'comments' => true,
'categories' =>{|cat| cat.domain == "tag"}.map{|cat| cat.content},
'published' => !post_uri.path.start_with?('/private/')
puts "Generating #{filename}#{header['published'] ? '' : ' (unpublished)'}"
imgdir = "source/images/#{date.strftime('%Y/%m/%d')}/#{permalink}/"
serverdir = '/' + imgdir.split('/', 2).last
outfile =, "w")
outfile.puts header.to_yaml
outfile.puts "---"
# Fix up images and video
html = Nokogiri::HTML("<div id=\"import_#{permalink}\">#{EscapeUtils.unescape_html(item.content_encoded)}</div>")
images = html.css('img')
fix_sources html, date.strftime("#{dir}/image/%Y/%m"), imgdir, serverdir
fix_sources html, nil, imgdir, serverdir, 'source'
fix_sources html, nil, nil, nil, 'video', 'poster' do nil end
# Fix up links to other posts
fix_sources html, nil, nil, nil, 'a', 'href' do |tag, href|
link_uri = URI.parse(href)
next unless ==
link_shortname = href.split('/').last.split('#').first
if item_map.include? link_shortname
link = item_map[link_shortname][:item]
href = link.pubDate.strftime("#{basedir}%Y/%m/%d/#{link_shortname}/")
href += "##{link_uri.fragment}" if link_uri.fragment
puts "\t\tUsing #{link.title} (#{href})"
puts "\t######## No match found for #{href}"
outfile.puts html.css("div#import_#{permalink}"){|node| node.to_html}.join
# Generates a redirecting link from the permalink of each item from the given
# RSS feed to the corresponding post generated by generate_posts().
# rss - The File containing the RSS feed.
# basedir - The server directory in which the blog's posts and images/
# directory reside.
def generate_links rss_file, basedir='/'
basedir = "/#{basedir}" unless basedir.start_with? '/'
basedir = "#{basedir}/" unless basedir.end_with? '/'
dir = File.dirname(File.expand_path(rss_file))
rss =
feed = RSS::Parser.parse(rss, false)
item_map = Hash[*{|item|
link ='/').last
[link, {:item => item, :filename => item.pubDate.strftime("source/#{link}/index.html")}]
feed.items.each do |item|
post_uri = URI.parse(
permalink ='/').last
filename = item_map[permalink][:filename]
dirname = File.dirname(filename)
href = item.pubDate.strftime("#{basedir}%Y/%m/%d/#{permalink}/")
title = item.title
outfile =, "w")
outfile.write <<-HTML
<!DOCTYPE html>
<meta http-equiv="Refresh" content="0; url=#{href}">
<link href="#{basedir}stylesheets/screen.css" rel="stylesheet" type="text/css">
<a style="color: inherit; text-decoration: none" href="#{href}">#{title}</a>
if __FILE__ == $0
raise 'No RSS feed given' unless $ARGV.length > 0
if $ARGV[0] == '--links'
raise 'No RSS feed given' unless $ARGV.length > 1
generate_links $ARGV[1], $ARGV[2] || '/'
generate_posts $ARGV[0], $ARGV[1] || '/'
Thanks a lot! This great script saved me a lot of time!

This seems to be giving me a problem with Posterous posts that were archived. What seems to be happening is it is reading the wordpress_export_1.xml file, and that is referencing a post in 2010-05, but the earliest date in the images directory is 2010-07.

Not quite sure how to approach this.


Ok here is something else I have learned....this is an example of one of a snippet from 1 of my posts:

<h3>Know when to change tables - by Tony Hsieh (CEO of Zappos)</h3>
<div class='post_info'>
<span class='post_time'>June 21 2010, 11:46 PM</span>
<span class='author'>&nbsp;by Marc Gayle</span>
<div class='post_body'><p><div class='p_embed p_image_embed'>
<img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>

The filename of the image, is also specified in the fixed_exports.xml as can be seen here:

<content:encoded><![CDATA[<p><div class='p_embed p_image_embed'>
<img alt="Media_httpfarm3static_mayii" height="375" src="" width="500" />

This is the error that parsing this file generated:

Generating source/_posts/2010-06-22-know-when-to-change-tables-by-tony-hsieh-ceo-of-zappos.html
    Fixing img tags' src attribute
    img: media_httpfarm3static_mAyIi.jpg
/Dropbox/My Blog/posterous_import.rb:101:in `open': No such file or directory - /Dropbox/My Blog/Marc Gayle/image/2010/06 (Errno::ENOENT)
    from /Dropbox/My Blog/posterous_import.rb:101:in `entries'
    from /Dropbox/My Blog/posterous_import.rb:101:in `block in fix_sources'

So the trick is, when the image is not found at the default image/year/month/day path, to either search the directory structure for the filename, or to actually find the path within the individual html file included in the archive - in this case <img src='../../../image/2010/07/11605730-media_httpfarm3static_mAyIi.jpg'>.

Any thoughts on the best way to approach this?

Copy link

For what it's worth, I have forked this and updated it to fix the issues I was having.

