Skip to content

Instantly share code, notes, and snippets.

@anjackson
Created August 1, 2015 07:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anjackson/382598643813b9ebecc3 to your computer and use it in GitHub Desktop.
Save anjackson/382598643813b9ebecc3 to your computer and use it in GitHub Desktop.
Example WARC from wget
WARC/1.0
WARC-Type: warcinfo
Content-Type: application/warc-fields
WARC-Date: 2015-07-31T16:32:22Z
WARC-Record-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Filename: humans.warc.gz
WARC-Block-Digest: sha1:AARITJBDT4LFDLBOUU63IJAD2MK7WFL3
Content-Length: 241
software: Wget/1.16.3 (darwin14.1.0)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
robots: classic
wget-arguments: "--warc-file=humans" "https://www.google.com/humans.txt"
WARC/1.0
WARC-Type: request
WARC-Target-URI: https://www.google.com/humans.txt
Content-Type: application/http;msgtype=request
WARC-Date: 2015-07-31T16:32:23Z
WARC-Record-ID: <urn:uuid:A69CB3BD-6FED-40A8-891A-A9D0BA5A7577>
WARC-IP-Address: 74.125.24.106
WARC-Warcinfo-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Block-Digest: sha1:A4JBGO5JITBQTSACSH347OTILSQAOEQZ
Content-Length: 154
GET /humans.txt HTTP/1.1
User-Agent: Wget/1.16.3 (darwin14.1.0)
Accept: */*
Accept-Encoding: identity
Host: www.google.com
Connection: Keep-Alive
WARC/1.0
WARC-Type: response
WARC-Record-ID: <urn:uuid:EBEED9E3-7FD9-4BBA-B82B-3C801069F459>
WARC-Warcinfo-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Concurrent-To: <urn:uuid:A69CB3BD-6FED-40A8-891A-A9D0BA5A7577>
WARC-Target-URI: https://www.google.com/humans.txt
WARC-Date: 2015-07-31T16:32:23Z
WARC-IP-Address: 74.125.24.106
WARC-Block-Digest: sha1:TXMPRCEB7AODZHBD5C27W6VG6ZJSOVDB
WARC-Payload-Digest: sha1:VG5HLEL4BBSYEVABD3URLB5ZW7O3XBOY
Content-Type: application/http;msgtype=response
Content-Length: 687
HTTP/1.1 200 OK
Vary: Accept-Encoding
Content-Type: text/plain
Last-Modified: Tue, 11 Mar 2014 22:11:10 GMT
Date: Fri, 31 Jul 2015 16:32:23 GMT
Expires: Fri, 31 Jul 2015 16:32:23 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
Server: sffe
X-XSS-Protection: 1; mode=block
Alternate-Protocol: 443:quic,p=1
Accept-Ranges: none
Transfer-Encoding: chunked
11e
Google is built by a large team of engineers, designers, researchers, robots, and others in many different sites across the globe. It is updated continuously, and built with more tools and technologies than we can shake a stick at. If you'd like to help us out, see google.com/careers.
0
WARC/1.0
WARC-Type: metadata
WARC-Record-ID: <urn:uuid:B0B3862C-B271-4670-A4B5-B127576C6118>
WARC-Warcinfo-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Target-URI: metadata://gnu.org/software/wget/warc/MANIFEST.txt
WARC-Date: 2015-07-31T16:32:23Z
WARC-Block-Digest: sha1:NEIRM547MH3YUQMT75OCLPB7ERKNBQHL
Content-Type: text/plain
Content-Length: 48
<urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:2491AF6D-D1AA-4072-8893-4C3DF2C6E0AF>
WARC-Warcinfo-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Concurrent-To: <urn:uuid:B0B3862C-B271-4670-A4B5-B127576C6118>
WARC-Target-URI: metadata://gnu.org/software/wget/warc/wget_arguments.txt
WARC-Date: 2015-07-31T16:32:23Z
WARC-Block-Digest: sha1:PXARGWNHTQPELVXL6XZFQWOCBB5KIGIQ
Content-Type: text/plain
Content-Length: 58
"--warc-file=humans" "https://www.google.com/humans.txt"
WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:ACDDCC95-8802-432E-991F-2B4F1037A63B>
WARC-Warcinfo-ID: <urn:uuid:CD4DD5EA-710A-43A4-9E75-2238B9664926>
WARC-Concurrent-To: <urn:uuid:B0B3862C-B271-4670-A4B5-B127576C6118>
WARC-Target-URI: metadata://gnu.org/software/wget/warc/wget.log
WARC-Date: 2015-07-31T16:32:23Z
WARC-Block-Digest: sha1:XREPNKKJBBSGFL37GKDOEJFI2DMVZBMT
Content-Type: text/plain
Content-Length: 472
Opening WARC file 'humans.warc.gz'.
--2015-07-31 17:32:22-- https://www.google.com/humans.txt
Resolving www.google.com... 74.125.24.106, 74.125.24.105, 74.125.24.99, ...
Connecting to www.google.com|74.125.24.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: 'humans.txt'
0K 6.49M=0s
2015-07-31 17:32:23 (6.49 MB/s) - 'humans.txt' saved [286]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment