Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Extracts generator meta information from WAT files
require 'json'
class WatExtractor
attr_reader :file
def initialize file_name
@file = File.new file_name
end
def target_uri? line
line =~ /\AWARC-Target-URI:/
end
def envelope? line
line =~ /\A\{/
end
def process
generators = []
current = {}
while line = file.readline
if target_uri? line
match_data = /\AWARC-Target-URI: (?<url>.*)\r\n\z/.match(line)
puts match_data.inspect
current.store('url', match_data[:url])
end
if envelope? line
if line =~ /\"generator\"/
json = JSON.parse(line)
meta_tags = json['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Metas']
generator = meta_tags.detect do |tag|
tag['name'].eql?('generator')
end
current.store('generator', (generator || {}).fetch('content', ""))
generators.push(current)
current = {}
else
next
end
end
end
rescue EOFError
return generators
end
end
@Veejay

This comment has been minimized.

Copy link
Owner Author

commented Jul 19, 2016

TODO

  • Don't process the same host twice since the generator is the same (keep a MD5 digest of the host in the loop for quick comparison)
  • Reduce memory footprint as much as possible
  • Clean up the code (only process should be a public method)
  • Have the while loop push into a Queue object and a Thread write them to a file so that if it crashes, we don't have to reprocess everything over again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.