Skip to content

Instantly share code, notes, and snippets.

Last active January 22, 2021 20:55
Show Gist options
  • Save jaredhirsch/4354202 to your computer and use it in GitHub Desktop.
Save jaredhirsch/4354202 to your computer and use it in GitHub Desktop.
Downloading stuff off of tumblr

TL;DR skip to the summary for the bash one-liner if so inclined :-)

Why & wherefore

I have an old tumblr that I want to back up. I don't use it regularly, but I want to have a backup of the whole site, design as well as content, just in case; it's not just the pictures I love, but the whole thing. I'm not particularly worried about downloading the streaming audio--can't help you there, gentle reader.

This is the story of a man, a man page, and a page. And lots of other pages.

Downloading just one page (and all the images, fonts, JS, CSS)

Before trying to download the whole site, I started with just one page. I want all the assets, not just the images or underlying HTML, and I want the links converted, so I can look at the files offline.

I found this swell incantation in the wget man page:

wget -E -H -k -K -p http://<site>/<document> 

It worked, but not perfectly:

  • Unfortunately, the -E (--adjust-extension) option renamed my webfonts from Blah.eot to Blah.eot.html. Not cool, bro. The image names came out goofy as well, so it didn't really help at all.

  • This didn't do any rate-limiting. AWS hosts all of tumblr's images, so I'm sure they block an IP that goes bananas with downloading. We can tell it to pause between downloads using --wait=, randomize the pauses using --random-wait, and limit overall download speed using --limit-rate. This is all 'good citizen' stuff, and prevents my IP being throttled or blocked by AWS or tumblr :-)

  • By default, wget is ridiculously verbose. The -nv option makes it less so, but not totally silent. (You can use -q to make it perfectly quiet.)

Adjusting the options (but keeping that shit alphabetical, so I don't go crazy trying to find options as this gets longer) we have:

wget -H -k -K --limit-rate=500k -nv -p --random-wait --wait=0.5 http://<site>/<document>

That's better. Rate limiting and waiting increased download time by a factor of three, from 6 seconds to 18 seconds. I can deal with that. So far, so good.

It might be a little more fun to let 'em know I'm not just some Chinese botnet instance of wget, but a friendly hacker taking care of business. So maybe a custom user-agent with a nod to Firefox and a reference to some made-up thing called Sunflower (with the golden ratio its version number ;-).

--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398"

and a custom header:

--header="Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff."

and, of course, a From header:

--header="From: Jared 'the dragon' Hirsch <>"

At this point, the command line is getting ridiculously long. Some of wget's options can be stuffed in a config file--happily, all the ones I care about:

# moving stuff into wgetrc
header = From: 'Jared "the dragon" Hirsch' <>
header = Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff.
user_agent = Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398
# same as -H  
span_hosts = on         
# same as -k
convert_links = on      
# same as -K
backup_converted = on   
limit_rate = 500k       
# same as -nv
verbose = off           
# same as -p
page_requisites = on    
random_wait = on
wait = 0.5

And calling wget is as simple as

wget --config=my-wgetrc

A few pages

What I want next is to try downloading a few linked pages, without leaving the site and going to some other site (like the links to Bill Israel's site, which I left in the page, as I modified his design).

Let's start with "--recursive" and "--level 2", to go two clicks deep.

Also, I'm sick of downloading tracking JS code, so let's exclude quantcast with a little "". Bill Israel is a great guy (I imagine anyway), but I don't want his stuff: exclude as well.

I'm going to try these things at the command line, at first, and promote them to the config file when they seem to be really working.

wget --config=my-wgetrc --recursive --level=2 \,

Well, yes and no. Recursion is hard. It followed a bunch of links off of my website, which is linked from my tumblr. Dammit.

What I actually want is not a recursive download. What I want is to download up to /page/104. So let's just do that.

One page plus some bash

We need a little for-loop at the bash prompt to get this done.

A bit of googling turns up the one-liner to iterate over a for-loop n times:

for i in $(seq 1 n); do $i; done

In my case, then, where n goes up to 104, it should be something like:

for i in $(seq 1 104); do wget$i; done

And that should do it. This isn't going to download flash streaming music, but I can live with that.


If you want to download to, you can set up a wget config file:

# stuff in my-wgetrc
# same as -H  
span_hosts = on         
# same as -k
convert_links = on      
# same as -K
backup_converted = on   
limit_rate = 500k       
# same as -nv
verbose = off           
# same as -p
page_requisites = on    
random_wait = on
wait = 0.5

Then at a linux/mac command line, do:

for i in $(seq 1 N); do wget --config=my-wgetrc$i; done

And that'll fetch your precious.

Copy link

ghost commented Aug 1, 2016

I found that wget honors robots.txt, which may lead to truncated file tree. -e robots=off (or same thing in rc) fixes that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment