handeyeco/scraping-tumblr.md

## scraping-tumblr.md

      
    Raw
  

              scraping-tumblr.md
            
          
    Artist Stalker: Scraping Tumblr

Introduction

I think we can all agree that Tumblr is lame now, joining the ranks of Facebook and Instagram - organizations who regularly censor content on a platform that was designed with freedom of expression and information in mind. As an art lover, I've been worried that the rise of corporate censorship would affect my favorite artists, so I decided to start working on backing up content locally. Here's what I had to do:
JavaScript URL Collection

First I logged into Tumblr, opened Chrome Dev Tools, went to the Network tab, and dug around until I found the endpoint they use for their infinite scrolling. At the time of this writing it started with indash_blog. By right clicking I was able to Copy as Fetch.

By going to the Sources tab, users are able to write code Snippets that get run within the site. I created a new one, pasted the fetch code, and wrote some logic around it:

// User set variables
const username = 'cartoonhangover'
const delay = 1000
const offsetStride = 10
const maxAttempts = 3

// Internally used variables
let offset = 0
let errors = 0
let lastCount = null
let links = []

// Not sure if needed, but Tumblr seems to use it
const formKey = document.getElementById('tumblr_form_key').content

function pull() {
  console.log('Fetching', offset)
  fetch(`https://www.tumblr.com/svc/indash_blog?tumblelog_name_or_id=${username}&post_id=&limit=10&offset=${offset}&should_bypass_safemode_forpost=true&should_bypass_safemode_forblog=true&should_bypass_tagfiltering=true&can_modify_safe_mode=true`, {
    "credentials": "include",
    "headers": {
      "accept": "application/json, text/javascript, */*; q=0.01",
      "accept-language": "en-US,en;q=0.9",
      "x-requested-with": "XMLHttpRequest",
      "x-tumblr-form-key": formKey
    },
    "referrer": "https://www.tumblr.com/",
    "referrerPolicy": "origin-when-cross-origin",
    "body": null,
    "method": "GET",
    "mode": "cors"
  })
  .then(res => res.json())
  .then(json => {
    // Keep track of how many results we got last time
    lastCount = json.response.posts.length

    // Grab the original size image URL
    // for posts that have images
    json.response.posts.map(post => {
      if (post.photos) {
        post.photos.map(photo => {
          links.push(photo.original_size.url)
        })
      }
    })
  })
  .catch(error => {
    errors++
    console.warn('Unable to parse last response')
    console.warn(error)
  })
  .finally(() => {
    // If we haven't hit too many errors
    // or the last request had posts,
    // make a request for the next group
    if (errors < maxAttempts && lastCount !== 0) {
      offset += offsetStride
      setTimeout(pull, delay)
    } else {
      // Log our output
      console.log(links)
      if (errors) {
        console.log('Ran into some errors, all photos may not be present')
      }
    }
  })
}

pull()
Basically all the code is doing is simulating infinite scroll to go through a user's list of posts, copying the URLs for the originally sized image, and then logging the array of URLs. It's on an interval to prevent Tumblr from noticing the scrub.
Once it printed the array I right clicked it in the console, clicked Store as a Global Variable, and then typed copy(temp1) to copy the array to my clipboard. I looked briefly for a programmatic way to copy to the clipboard from the Snippet, but got bored of searching.

Python Image Collection

I saved the array as a JSON file called urls.json. It looked kind of like this:
[
  "https://66.media.tumblr.com/01d773f953c56b57a75e75d7fa10e8d6/tumblr_pl2w0chdSI1so49byo1_1280.png",
  "https://66.media.tumblr.com/b2708c2e4020a6f06b592a21119c2087/tumblr_pl6fkx6vW91r6y37vo1_1280.jpg",
  "https://66.media.tumblr.com/09ba2e38e29e5c22908a381fff07fc8c/tumblr_pl4ybqTmC61rps1iho1_1280.png",
  "https://66.media.tumblr.com/51e70464e06a0f07e2bd2dbc053cf09c/tumblr_pkz8nlgah11rps1iho1_1280.png",
  "https://66.media.tumblr.com/38cb5e90fc11a7ae3d5f927184e88020/tumblr_pktn94hjZw1so49byo1_1280.jpg",
  "https://66.media.tumblr.com/717054ad764683e153f4a0706a5dc241/tumblr_pktapjElcQ1r6y37vo1_1280.jpg",
  "https://66.media.tumblr.com/c002f719a2acd4f5a5c538a7a80ef852/tumblr_pkmpm3BMo31qbr6kxo1_1280.jpg"
]
Then I switched to Python v3 to download the URLs:
import json
import urllib.request
from time import sleep

# These should probably be cli arguments
delay = 1
url_file = 'urls.json'
save_dir = './photos/'

successes = 0
failures = []

with open(url_file) as file:
    urls = json.load(file)
    for url in urls:
        try:
            # Grab the filename from the end of the url
            name = url.split("/")[-1]
            print(name)
            urllib.request.urlretrieve(url, save_dir + name)
            successes += 1
            # Delay in an attempt to avoid IP blocking
        except:
            print('Unable to fetch image from: ' + url)
            failures.append(url)
            
        sleep(delay)

print("Done")
print("Successes: " + successes)
print("Failures: " + len(failures))
print(failures)
When run (I run it using python3 download.py), it loops through the list of URLs and downloads/saves the images.
Disclaimer

Tumblr doesn't want you scraping their site. Whatever you do, don't do it. When I said "I" earlier, I meant hypothetically. I didn't scrape them.