Skip to content

Instantly share code, notes, and snippets.

@thejefflarson
Created January 5, 2016 18:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thejefflarson/e57cefd2eb09ec0c8dee to your computer and use it in GitHub Desktop.
Save thejefflarson/e57cefd2eb09ec0c8dee to your computer and use it in GitHub Desktop.
extern crate hyper;
extern crate flate2;
use hyper::Client;
use std::io::{BufRead, BufReader};
use flate2::read::GzDecoder;
fn main() {
let client = Client::new();
let res = client.get("https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443736672328.14/wat/CC-MAIN-20151001215752-00000-ip-10-137-6-227.ec2.internal.warc.wat.gz")
.send().unwrap();
let decoder = GzDecoder::new(res).unwrap();
let reader = BufReader::new(decoder);
for c in reader.lines() {
println!("{}", c.unwrap());
}
}
/*
only outputs a few lines, should spew megabytes:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-11-08T20:11:50Z
WARC-Filename: CC-MAIN-20151001215752-00000-ip-10-137-6-227.ec2.internal.warc.gz
WARC-Record-ID: <urn:uuid:64b2a8b0-ad11-406f-8619-0d615ff46a0b>
Content-Type: application/warc-fields
Content-Length: 108
Software-Info: ia-web-commons.1.0-SNAPSHOT-20151107044109
Extracted-Date: Sun, 08 Nov 2015 20:11:50 GMT
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment