Skip to content

Instantly share code, notes, and snippets.

@dwalters-zz
Created December 7, 2009 00:51
Show Gist options
  • Save dwalters-zz/250515 to your computer and use it in GitHub Desktop.
Save dwalters-zz/250515 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
#
# Scans a bzip2 file for all block headers and outputs their bit offset within the file.
# As blocks are independent, the resulting offsets can be decompressed via random access.
#
# Author: Dan Walters
BZ2_BLOCK_MAGIC = 0x314159265359 # BCD pi
BZ2_EOS_MAGIC = 0x177245385090 # BCD sqrt(pi)
def bz2_block_scan(filename)
File.open(filename) do |f|
# valid header?
header = f.read(4)
raise "not a valid bz2 file" unless header =~ /^BZh[0-9]/
# build a regex for all trailing 5 byte permutations of the 6 byte magic
patterns = (0..7).collect do |i|
trailing_fixnum = (BZ2_BLOCK_MAGIC >> i) & 0xffffffffff
trailing_bytes = ["%010x" % [trailing_fixnum]].pack('H*')
[trailing_bytes, {:leading_bit_count => 8-i, :leading_bit_value => (BZ2_BLOCK_MAGIC >> 40+i)}]
end
combined = patterns.collect { |p| Regexp.escape(p[0]) }.join('|')
re = Regexp.new(combined)
# read the file in chunks, looking for matches
meta = Hash[patterns]
base_offset, leftover = header.length, ""
while data = f.read(10 * 1024 * 1024)
data = leftover + data
while m = re.match(data)
# when we have a match, verify that the leading bits match what is expected
matched_offset, matched_meta = m.offset(0).first, meta[m.to_s]
prev_byte = data[matched_offset-1].to_i
leading_bits = prev_byte & ((1 << matched_meta[:leading_bit_count])-1)
bit_offset = (base_offset + matched_offset) * 8 - matched_meta[:leading_bit_count]
# $stderr.puts({:base_offset => base_offset, :matched_offset => matched_offset, :matched_meta => matched_meta, :prev_byte => prev_byte, :leading_bits => leading_bits, :bit_offset => bit_offset, :m => m.to_s}.inspect)
yield bit_offset if leading_bits == matched_meta[:leading_bit_value]
# advance
base_offset += m.offset(0).last
data = m.post_match
end
leftover = data
end
end
end
if __FILE__ == $0
bz2_block_scan(ARGV.first || "/dev/stdin") do |bit_offset|
puts bit_offset
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment