Skip to content

Instantly share code, notes, and snippets.

@Veejay
Created July 19, 2016 22:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Veejay/3e6e0b4fa0112d7a394611b78cf22237 to your computer and use it in GitHub Desktop.
Save Veejay/3e6e0b4fa0112d7a394611b78cf22237 to your computer and use it in GitHub Desktop.
Extracts generator meta information from WAT files
require 'json'
class WatExtractor
attr_reader :file
def initialize file_name
@file = File.new file_name
end
def target_uri? line
line =~ /\AWARC-Target-URI:/
end
def envelope? line
line =~ /\A\{/
end
def process
generators = []
current = {}
while line = file.readline
if target_uri? line
match_data = /\AWARC-Target-URI: (?<url>.*)\r\n\z/.match(line)
puts match_data.inspect
current.store('url', match_data[:url])
end
if envelope? line
if line =~ /\"generator\"/
json = JSON.parse(line)
meta_tags = json['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Metas']
generator = meta_tags.detect do |tag|
tag['name'].eql?('generator')
end
current.store('generator', (generator || {}).fetch('content', ""))
generators.push(current)
current = {}
else
next
end
end
end
rescue EOFError
return generators
end
end
@Veejay
Copy link
Author

Veejay commented Jul 19, 2016

TODO

  • Don't process the same host twice since the generator is the same (keep a MD5 digest of the host in the loop for quick comparison)
  • Reduce memory footprint as much as possible
  • Clean up the code (only process should be a public method)
  • Have the while loop push into a Queue object and a Thread write them to a file so that if it crashes, we don't have to reprocess everything over again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment