Skip to content

Instantly share code, notes, and snippets.

@seven1m
Last active December 5, 2017 03:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seven1m/5241bb427621392a924274c0e39c1585 to your computer and use it in GitHub Desktop.
Save seven1m/5241bb427621392a924274c0e39c1585 to your computer and use it in GitHub Desktop.
download all images for twitter archive
# gem install http
# unzip archive.zip -d archive
# cd archive
# ruby archive.rb
require 'http'
require 'fileutils'
require 'digest'
FileUtils.mkdir_p('media')
paths = Dir['data/**/*.js'].to_a + ['index.html']
paths.each_with_index do |path, index|
puts "#{index + 1} of #{paths.size}"
data = File.read(path)
data.gsub!(/"(http[^"]+)(\.(ico|png|gif|jpg|jpeg|mov|mp4|mpg|mpeg))"/i) do
print '.'
ext = Regexp.last_match[2]
url = Regexp.last_match[1].gsub(%r{\\/}, '/')
name = Digest::MD5.hexdigest(url) + ext
asset_path = 'media/' + name
unless File.exist?(asset_path)
begin
raw = HTTP.get(url + ext).to_s
File.write(asset_path, raw)
rescue HTTP::ConnectionError
puts url + ext + ' could not be downloaded'
next
end
end
'"' + asset_path + '"'
end
File.write(path, data)
puts
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment