public
Last active

streaming gzip

  • Download Gist
em-http-gzip-streaming.rb
Ruby
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103
require 'rubygems'
require 'em-http-request'
 
# Monkey-patched Gzip Decoder to handle
# Gzip streams.
#
# This takes advantage of the fact that
# Zlib::GzipReader takes an IO object &
# reads from it as it decompresses.
#
# It also relies on Zlib only checking for
# nil as the method of determining whether
# it has reached EOF.
#
# `IO#read(len, buf)` can also denote EOF by returning a string
# shorter than `len`, but Zlib doesn't care about that.
#
module EventMachine::HttpDecoders
class GZip < Base
class LazyStringIO
def initialize string=""
@stream=string
end
 
def << string
@stream << string
end
 
def read length=nil,buffer=nil
buffer||=""
length||=0
buffer << @stream[0..(length-1)]
@stream = @stream[length..-1]
buffer
end
def size
@stream.size
end
end
 
def self.encoding_names
%w(gzip compressed)
end
 
def decompress(compressed)
@buf ||= LazyStringIO.new
@buf << compressed
# Zlib::GzipReader loads input in 2048 byte chunks
if @buf.size > 2048
@gzip ||= Zlib::GzipReader.new @buf
@gzip.readline # lines are bigger than compressed chunks, so this works
# you could also use #readpartial, but then you need to tune
# the max length
# don't use #read, because it will attempt to read the full file
# readline uses #gets under the covers, so you could try that too.
end
end
 
def finalize
@gzip.read
end
end
end
 
 
url = "my-streaming-url"
user = "my-user"
password = "my-password"
 
EventMachine.run do
 
http = EventMachine::HttpRequest.new(url).get :head => {
'Accept-Encoding' => 'gzip',
'Authorization' => [ user, password ] }
 
http.headers do |hash|
p [:status, http.response_header.status]
p [:headers, hash]
if http.response_header.status > 299
puts 'unsuccessful request'
EM.stop
end
end
 
http.stream do |chunk|
print chunk
end
http.callback do
p "done"
EM.stop
end
 
http.errback do
puts "there was an error"
p http.error
p http.response
EM.stop
end
 
end

Hmm, interesting. Should @buf be reset after @gzip.readline? Part of the appeal for http.stream is that you don't have to buffer a giant blob of data.. if you do need the entire blob, you should be explicitly buffering it yourself at that point.

I don't think we'd want to reset it as it might contain more chunks that need to be decompressed.
It only needs to buffer enough to not cause GzipReader to complain. On the other hand, after the headers have been processed, you could probably just call readpartial or gets until the buffer is empty, because the GzipReader probably maintains enough state to keep decompressing.

That sounds reasonable, assuming it can be made to work. :-)

Is there a specific use case where you're streaming a gzipped file? Surprisingly, I haven't had any bug reports or feature requests around this previously.

It's not a file, it's a continuous stream of compressed data--which is why you can't one shot it, because the request never really ends.

Sorry, bad choice of language there.. that's what I meant. :-)

If you're streaming, the memory bloat associated with buffering the entire response is not an issue for you?

The LazyStringIO object prevents the whole response from being buffered by dropping the read portions of its buffer.

GzipReader pulls in data on demand provided you call it the right way, so memory usage is limited.

One problem I could see is if chunks are consistently bigger than lines, you'd start queuing up decompressed data that hasn't been passed to callbacks yet.

Maybe you could use readpartial and a loop to pull out all the current decompressed buffer. But, readpartial blocks if there's not unzipped data available.

I wish you could just use it like you did w/ Inflate though. It might be fun to try to implement an equivalent API for Gzip.

Thanks, this helped a lot getting my Powertrack client to interface with Gnip console.

the monkey patch bit of this has been cleaned up and merged into em-http-request, which rocks!

https://github.com/igrigorik/em-http-request/pull/186

@baroquebobcat Thanks for your work producing this. There is discussion that could use your attention at https://github.com/igrigorik/em-http-request/issues/204

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.