Skip to content

Instantly share code, notes, and snippets.

@jaredhirsch
Last active January 22, 2021 20:55
Show Gist options
  • Star 13 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save jaredhirsch/4354202 to your computer and use it in GitHub Desktop.
Save jaredhirsch/4354202 to your computer and use it in GitHub Desktop.
Downloading stuff off of tumblr

TL;DR skip to the summary for the bash one-liner if so inclined :-)

Why & wherefore

I have an old tumblr that I want to back up. I don't use it regularly, but I want to have a backup of the whole site, design as well as content, just in case; it's not just the pictures I love, but the whole thing. I'm not particularly worried about downloading the streaming audio--can't help you there, gentle reader.

This is the story of a man, a man page, and a page. And lots of other pages.

Downloading just one page (and all the images, fonts, JS, CSS)

Before trying to download the whole site, I started with just one page. I want all the assets, not just the images or underlying HTML, and I want the links converted, so I can look at the files offline.

I found this swell incantation in the wget man page:

wget -E -H -k -K -p http://<site>/<document> 

It worked, but not perfectly:

  • Unfortunately, the -E (--adjust-extension) option renamed my webfonts from Blah.eot to Blah.eot.html. Not cool, bro. The image names came out goofy as well, so it didn't really help at all.

  • This didn't do any rate-limiting. AWS hosts all of tumblr's images, so I'm sure they block an IP that goes bananas with downloading. We can tell it to pause between downloads using --wait=, randomize the pauses using --random-wait, and limit overall download speed using --limit-rate. This is all 'good citizen' stuff, and prevents my IP being throttled or blocked by AWS or tumblr :-)

  • By default, wget is ridiculously verbose. The -nv option makes it less so, but not totally silent. (You can use -q to make it perfectly quiet.)

Adjusting the options (but keeping that shit alphabetical, so I don't go crazy trying to find options as this gets longer) we have:

wget -H -k -K --limit-rate=500k -nv -p --random-wait --wait=0.5 http://<site>/<document>

That's better. Rate limiting and waiting increased download time by a factor of three, from 6 seconds to 18 seconds. I can deal with that. So far, so good.

It might be a little more fun to let 'em know I'm not just some Chinese botnet instance of wget, but a friendly hacker taking care of business. So maybe a custom user-agent with a nod to Firefox and a reference to some made-up thing called Sunflower (with the golden ratio its version number ;-).

--user-agent="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398"

and a custom header:

--header="Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff."

and, of course, a From header:

--header="From: Jared 'the dragon' Hirsch <ohai@6a68.net>"

At this point, the command line is getting ridiculously long. Some of wget's options can be stuffed in a config file--happily, all the ones I care about:

# moving stuff into wgetrc
header = From: 'Jared "the dragon" Hirsch' <ohai@6a68.net>
header = Love-You-Guys: but the instagram thing has me worried. just grabbing my stuff.
user_agent = Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.8.2; en-US) Wget/1.13.4 Sunflower/1.61803398
# same as -H  
span_hosts = on         
# same as -k
convert_links = on      
# same as -K
backup_converted = on   
limit_rate = 500k       
# same as -nv
verbose = off           
# same as -p
page_requisites = on    
random_wait = on
wait = 0.5

And calling wget is as simple as

wget --config=my-wgetrc some-site.tumblr.com

A few pages

What I want next is to try downloading a few linked pages, without leaving the site and going to some other site (like the links to Bill Israel's site, which I left in the page, as I modified his design).

Let's start with "--recursive" and "--level 2", to go two clicks deep.

Also, I'm sick of downloading tracking JS code, so let's exclude quantcast with a little "--exclude-domains=quantserve.com". Bill Israel is a great guy (I imagine anyway), but I don't want his stuff: exclude cubicle17.com as well.

I'm going to try these things at the command line, at first, and promote them to the config file when they seem to be really working.

wget --config=my-wgetrc --recursive --level=2 \
    --exclude-domains=quantserve.com,cubicle17.com some-site.tumblr.com

Well, yes and no. Recursion is hard. It followed a bunch of links off of my website, which is linked from my tumblr. Dammit.

What I actually want is not a recursive download. What I want is to download some-site.tumblr.com/page/1 up to /page/104. So let's just do that.

One page plus some bash

We need a little for-loop at the bash prompt to get this done.

A bit of googling turns up the one-liner to iterate over a for-loop n times:

for i in $(seq 1 n); do $i; done

In my case, then, where n goes up to 104, it should be something like:

for i in $(seq 1 104); do wget some-site.tumblr.com/page/$i; done

And that should do it. This isn't going to download flash streaming music, but I can live with that.

Summary

If you want to download some-site.tumblr.com/page/1 to some-site.tumblr.com/page/N, you can set up a wget config file:

# stuff in my-wgetrc
# same as -H  
span_hosts = on         
# same as -k
convert_links = on      
# same as -K
backup_converted = on   
limit_rate = 500k       
# same as -nv
verbose = off           
# same as -p
page_requisites = on    
random_wait = on
wait = 0.5

Then at a linux/mac command line, do:

for i in $(seq 1 N); do wget --config=my-wgetrc some-site.tumblr.com/page/$i; done

And that'll fetch your precious.

Copy link

ghost commented Aug 1, 2016

I found that wget honors robots.txt, which may lead to truncated file tree. -e robots=off (or same thing in rc) fixes that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment