jaredhirsch/howto-detumblrize.mkd

## howto-detumblrize.mkd

      
    Raw
  

              howto-detumblrize.mkd
            
          
    TL;DR skip to the summary for the bash one-liner if so inclined :-)
Why & wherefore

I have an old tumblr that I want to back up. I don't use it regularly, but
I want to have a backup of the whole site, design as well as content, just
in case; it's not just the pictures I love, but the whole thing. I'm not
particularly worried about downloading the streaming audio--can't help you
there, gentle reader.
This is the story of a man, a man page, and a page. And lots of other pages.
Downloading just one page (and all the images, fonts, JS, CSS)

Before trying to download the whole site, I started with just one page. I want
all the assets, not just the images or underlying HTML, and I want the links
converted, so I can look at the files offline.
I found this swell incantation in the wget man page:
wget -E -H -k -K -p http://<site>/<document> 

It worked, but not perfectly:


Unfortunately, the -E (--adjust-extension) option renamed my webfonts from
Blah.eot to Blah.eot.html. Not cool, bro. The image names came out goofy as
well, so it didn't really help at all.


This didn't do any rate-limiting. AWS hosts all of tumblr's images, so
I'm sure they block an IP that goes bananas with downloading. We can tell it
to pause between downloads using --wait=, randomize the pauses
using --random-wait, and limit overall download speed using --limit-rate.
This is all 'good citizen' stuff, and prevents my IP being throttled or blocked
by AWS or tumblr :-)


By default, wget is ridiculously verbose. The -nv option makes it less so,
but not totally silent. (You can use -q to make it perfectly quiet.)


Adjusting the options (but keeping that shit alphabetical, so I don't go
crazy trying to find options as this gets longer) we have:
wget -H -k -K --limit-rate=500k -nv -p --random-wait --wait=0.5 http://<site>/<document>

That's better. Rate limiting and waiting increased download time by a factor
of three, from 6 seconds to 18 seconds. I can deal with that. So far, so good.
It might be a little more fun to let 'em know I'm not just some Chinese
botnet instance of wget, but a friendly hacker taking care of business.
So maybe a custom user-agent with a nod to Firefox and a reference to some
made-up thing called Sunflower (with the golden ratio its version number ;-).
--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398"

and a custom header:
--header="Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff."

and, of course, a From header:
--header="From: Jared 'the dragon' Hirsch <ohai@6a68.net>"

At this point, the command line is getting ridiculously long. Some of wget's
options can be stuffed in a config file--happily, all the ones I care about:
# moving stuff into wgetrc
header = From: 'Jared "the dragon" Hirsch' <ohai@6a68.net>
header = Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff.
user_agent = Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398
# same as -H  
span_hosts = on         
# same as -k
convert_links = on      
# same as -K
backup_converted = on   
limit_rate = 500k       
# same as -nv
verbose = off           
# same as -p
page_requisites = on    
random_wait = on
wait = 0.5

And calling wget is as simple as
wget --config=my-wgetrc some-site.tumblr.com

A few pages

What I want next is to try downloading a few linked pages, without leaving the
site and going to some other site (like the links to Bill Israel's site, which
I left in the page, as I modified his design).
Let's start with "--recursive" and "--level 2", to go two clicks deep.
Also, I'm sick of downloading tracking JS code, so let's exclude quantcast with
a little "--exclude-domains=quantserve.com". Bill Israel is a great guy (I
imagine anyway), but I don't want his stuff: exclude cubicle17.com as well.
I'm going to try these things at the command line, at first, and promote them
to the config file when they seem to be really working.
wget --config=my-wgetrc --recursive --level=2 \
    --exclude-domains=quantserve.com,cubicle17.com some-site.tumblr.com

Well, yes and no. Recursion is hard. It followed a bunch of links off of my
website, which is linked from my tumblr. Dammit.
What I actually want is not a recursive download. What I want is to download
some-site.tumblr.com/page/1 up to /page/104. So let's just do that.
One page plus some bash

We need a little for-loop at the bash prompt to get this done.
A bit of googling turns up the one-liner to iterate over a for-loop n times:
for i in $(seq 1 n); do $i; done

In my case, then, where n goes up to 104, it should be something like:
for i in $(seq 1 104); do wget some-site.tumblr.com/page/$i; done

And that should do it. This isn't going to download flash streaming music, but
I can live with that.
Summary

If you want to download some-site.tumblr.com/page/1 to some-site.tumblr.com/page/N,
you can set up a wget config file:
# stuff in my-wgetrc
# same as -H  
span_hosts = on         
# same as -k
convert_links = on      
# same as -K
backup_converted = on   
limit_rate = 500k       
# same as -nv
verbose = off           
# same as -p
page_requisites = on    
random_wait = on
wait = 0.5

Then at a linux/mac command line, do:
for i in $(seq 1 N); do wget --config=my-wgetrc some-site.tumblr.com/page/$i; done

And that'll fetch your precious.