Skip to content

Instantly share code, notes, and snippets.

@spraints
Created September 26, 2022 12:01
Show Gist options
  • Save spraints/6d067774d03b60f496098f3a6682b629 to your computer and use it in GitHub Desktop.
Save spraints/6d067774d03b60f496098f3a6682b629 to your computer and use it in GitHub Desktop.
Extract stream blocks from a PDF
# Usage: ruby extract.rb PDF...
#
# Extracts all 'stream' blocks from the PDF. This includes all images embedded
# in it. All of the output files have a '.bin' extension. Sometimes you can see
# what they are by doing something like this:
#
# $ ruby extract.rb example.pdf
# $ file *.bin
def main
namer = Namer.new
ARGV.each do |arg|
begin
process(arg, namer)
rescue => e
puts "#{arg}: (#{e.class}) #{e}"
end
end
end
def process(f, namer)
all = File.open(f, "rb") { |f| f.read }
streams = all.split("\nstream\n")
prev = ""
streams.drop(1).each do |stream|
good, rest = stream.split("\nendstream\n", 2)
if rest.nil?
prev = prev + good
next
end
good = prev + good
prev = ""
name = namer.name("bin")
puts "#{f} -> #{name}"
File.write(name, good)
end
end
class Namer
def initialize
@t = Time.now.to_i
@n = 0
end
def name(ext)
@n += 1
"#@t-#@n.#{ext}"
end
end
main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment