Skip to content

Instantly share code, notes, and snippets.

@baroquebobcat
Created December 12, 2011 19:11
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save baroquebobcat/1468622 to your computer and use it in GitHub Desktop.
Save baroquebobcat/1468622 to your computer and use it in GitHub Desktop.
streaming gzip
require 'rubygems'
require 'em-http-request'
# Monkey-patched Gzip Decoder to handle
# Gzip streams.
#
# This takes advantage of the fact that
# Zlib::GzipReader takes an IO object &
# reads from it as it decompresses.
#
# It also relies on Zlib only checking for
# nil as the method of determining whether
# it has reached EOF.
#
# `IO#read(len, buf)` can also denote EOF by returning a string
# shorter than `len`, but Zlib doesn't care about that.
#
module EventMachine::HttpDecoders
class GZip < Base
class LazyStringIO
def initialize string=""
@stream=string
end
def << string
@stream << string
end
def read length=nil,buffer=nil
buffer||=""
length||=0
buffer << @stream[0..(length-1)]
@stream = @stream[length..-1]
buffer
end
def size
@stream.size
end
end
def self.encoding_names
%w(gzip compressed)
end
def decompress(compressed)
@buf ||= LazyStringIO.new
@buf << compressed
# Zlib::GzipReader loads input in 2048 byte chunks
if @buf.size > 2048
@gzip ||= Zlib::GzipReader.new @buf
@gzip.readline # lines are bigger than compressed chunks, so this works
# you could also use #readpartial, but then you need to tune
# the max length
# don't use #read, because it will attempt to read the full file
# readline uses #gets under the covers, so you could try that too.
end
end
def finalize
@gzip.read
end
end
end
url = "my-streaming-url"
user = "my-user"
password = "my-password"
EventMachine.run do
http = EventMachine::HttpRequest.new(url).get :head => {
'Accept-Encoding' => 'gzip',
'Authorization' => [ user, password ] }
http.headers do |hash|
p [:status, http.response_header.status]
p [:headers, hash]
if http.response_header.status > 299
puts 'unsuccessful request'
EM.stop
end
end
http.stream do |chunk|
print chunk
end
http.callback do
p "done"
EM.stop
end
http.errback do
puts "there was an error"
p http.error
p http.response
EM.stop
end
end
@igrigorik
Copy link

Hmm, interesting. Should @buf be reset after @gzip.readline? Part of the appeal for http.stream is that you don't have to buffer a giant blob of data.. if you do need the entire blob, you should be explicitly buffering it yourself at that point.

@baroquebobcat
Copy link
Author

I don't think we'd want to reset it as it might contain more chunks that need to be decompressed.
It only needs to buffer enough to not cause GzipReader to complain. On the other hand, after the headers have been processed, you could probably just call readpartial or gets until the buffer is empty, because the GzipReader probably maintains enough state to keep decompressing.

@igrigorik
Copy link

That sounds reasonable, assuming it can be made to work. :-)

Is there a specific use case where you're streaming a gzipped file? Surprisingly, I haven't had any bug reports or feature requests around this previously.

@baroquebobcat
Copy link
Author

It's not a file, it's a continuous stream of compressed data--which is why you can't one shot it, because the request never really ends.

@igrigorik
Copy link

Sorry, bad choice of language there.. that's what I meant. :-)

If you're streaming, the memory bloat associated with buffering the entire response is not an issue for you?

@baroquebobcat
Copy link
Author

The LazyStringIO object prevents the whole response from being buffered by dropping the read portions of its buffer.

GzipReader pulls in data on demand provided you call it the right way, so memory usage is limited.

One problem I could see is if chunks are consistently bigger than lines, you'd start queuing up decompressed data that hasn't been passed to callbacks yet.

Maybe you could use readpartial and a loop to pull out all the current decompressed buffer. But, readpartial blocks if there's not unzipped data available.

I wish you could just use it like you did w/ Inflate though. It might be fun to try to implement an equivalent API for Gzip.

@WizardOfOgz
Copy link

Thanks, this helped a lot getting my Powertrack client to interface with Gnip console.

@baroquebobcat
Copy link
Author

the monkey patch bit of this has been cleaned up and merged into em-http-request, which rocks!

igrigorik/em-http-request#186

@WizardOfOgz
Copy link

Excellent!

@eriwen
Copy link

eriwen commented Dec 21, 2012

@baroquebobcat Thanks for your work producing this. There is discussion that could use your attention at igrigorik/em-http-request#204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment