-
-
Save baroquebobcat/1468622 to your computer and use it in GitHub Desktop.
require 'rubygems' | |
require 'em-http-request' | |
# Monkey-patched Gzip Decoder to handle | |
# Gzip streams. | |
# | |
# This takes advantage of the fact that | |
# Zlib::GzipReader takes an IO object & | |
# reads from it as it decompresses. | |
# | |
# It also relies on Zlib only checking for | |
# nil as the method of determining whether | |
# it has reached EOF. | |
# | |
# `IO#read(len, buf)` can also denote EOF by returning a string | |
# shorter than `len`, but Zlib doesn't care about that. | |
# | |
module EventMachine::HttpDecoders | |
class GZip < Base | |
class LazyStringIO | |
def initialize string="" | |
@stream=string | |
end | |
def << string | |
@stream << string | |
end | |
def read length=nil,buffer=nil | |
buffer||="" | |
length||=0 | |
buffer << @stream[0..(length-1)] | |
@stream = @stream[length..-1] | |
buffer | |
end | |
def size | |
@stream.size | |
end | |
end | |
def self.encoding_names | |
%w(gzip compressed) | |
end | |
def decompress(compressed) | |
@buf ||= LazyStringIO.new | |
@buf << compressed | |
# Zlib::GzipReader loads input in 2048 byte chunks | |
if @buf.size > 2048 | |
@gzip ||= Zlib::GzipReader.new @buf | |
@gzip.readline # lines are bigger than compressed chunks, so this works | |
# you could also use #readpartial, but then you need to tune | |
# the max length | |
# don't use #read, because it will attempt to read the full file | |
# readline uses #gets under the covers, so you could try that too. | |
end | |
end | |
def finalize | |
@gzip.read | |
end | |
end | |
end | |
url = "my-streaming-url" | |
user = "my-user" | |
password = "my-password" | |
EventMachine.run do | |
http = EventMachine::HttpRequest.new(url).get :head => { | |
'Accept-Encoding' => 'gzip', | |
'Authorization' => [ user, password ] } | |
http.headers do |hash| | |
p [:status, http.response_header.status] | |
p [:headers, hash] | |
if http.response_header.status > 299 | |
puts 'unsuccessful request' | |
EM.stop | |
end | |
end | |
http.stream do |chunk| | |
print chunk | |
end | |
http.callback do | |
p "done" | |
EM.stop | |
end | |
http.errback do | |
puts "there was an error" | |
p http.error | |
p http.response | |
EM.stop | |
end | |
end |
I don't think we'd want to reset it as it might contain more chunks that need to be decompressed.
It only needs to buffer enough to not cause GzipReader to complain. On the other hand, after the headers have been processed, you could probably just call readpartial or gets until the buffer is empty, because the GzipReader probably maintains enough state to keep decompressing.
That sounds reasonable, assuming it can be made to work. :-)
Is there a specific use case where you're streaming a gzipped file? Surprisingly, I haven't had any bug reports or feature requests around this previously.
It's not a file, it's a continuous stream of compressed data--which is why you can't one shot it, because the request never really ends.
Sorry, bad choice of language there.. that's what I meant. :-)
If you're streaming, the memory bloat associated with buffering the entire response is not an issue for you?
The LazyStringIO object prevents the whole response from being buffered by dropping the read portions of its buffer.
GzipReader pulls in data on demand provided you call it the right way, so memory usage is limited.
One problem I could see is if chunks are consistently bigger than lines, you'd start queuing up decompressed data that hasn't been passed to callbacks yet.
Maybe you could use readpartial and a loop to pull out all the current decompressed buffer. But, readpartial blocks if there's not unzipped data available.
I wish you could just use it like you did w/ Inflate though. It might be fun to try to implement an equivalent API for Gzip.
Thanks, this helped a lot getting my Powertrack client to interface with Gnip console.
the monkey patch bit of this has been cleaned up and merged into em-http-request, which rocks!
Excellent!
@baroquebobcat Thanks for your work producing this. There is discussion that could use your attention at igrigorik/em-http-request#204
Hmm, interesting. Should
@buf
be reset after @gzip.readline? Part of the appeal for http.stream is that you don't have to buffer a giant blob of data.. if you do need the entire blob, you should be explicitly buffering it yourself at that point.