Skip to content

Instantly share code, notes, and snippets.

@masao
Created October 18, 2011 12:28
Show Gist options
  • Save masao/1295301 to your computer and use it in GitHub Desktop.
Save masao/1295301 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
# $Id$
require "time"
require "uri"
require "csv"
dupcount = {}
timecount = {}
pdfcount = Hash.new( 0 )
journalcount = Hash.new( 0 )
ARGF.each do |line|
# After 2011-03:
ip, time, method, url, http_ver, http_code, mimetype, size, referer, useragent, dummy = line.chomp.match( /\A([0-9\.]+) \S+ \S+ \[(.+?)\] "\S*" "\S*" "(\w+) (\S+) (\S+)" (\d+) (\S*) (\S*) \S* \S* "([^\"]*)" "([^\"]*)" "([^\"]*)" "([^\"]*)" "([^\"]*)"\Z/o ).captures
t = time.split( /[:\/ ]/ )
utime = Time.parse( "#{ t[0] } #{ t[1] } #{ t[2] } #{t[3]}:#{t[4]}:#{t[5]} #{ t[6] }" )
# Before 2011-03:
# date, time_s, proto, ip, d1,d2, useragent, d3, d4, d5, host, d6, d7, d8, d9, url, = CSV.parse_line( line )
# method = "GET"
# mimetype = "text/html"
# http_code = "200"
# http_ver = "HTTP/1.0"
# size = -1
# referer = ""
# utime = Time.parse( "#{ date } #{ time_s } +0900" )
#p line if data.empty?
#p useragent
#p ip
next if method == "CONNECT"
case mimetype
when "application/x-javascript", "application/json", "text/xml", "text/css", /\Aimage\/./o
next
end
case http_code
when "302", "0"
next
end
#next if url =~ /jquery.*\.min\.js\Z/o
#next if url =~ /\.min\.js\Z/o
next if url =~ /\.js\Z/o
next if url =~ /\.(gif|png|jpg)\Z/o
next if url =~ /\.css\Z/o
next if url =~ /\Ahttp:\/\/onlinelibrary\.wiley\.com/o
begin
uri = URI.parse( url )
next if uri.host !~ /\.aps\.org\Z/
next if uri.host == "feeds.aps.org"
next if uri.host == "tesseract-assets.aps.org"
next if uri.path == "/favicon.ico"
rescue URI::InvalidURIError => e
p url
p line
p e
next
end
if dupcount[ ip + url ] and ( dupcount[ ip + url ] - utime ) < 30
next
else
dupcount[ ip + url ] = utime
end
timecount[ utime.hour ] ||= []
timecount[ utime.hour ] << ip
if url =~ /\/pdf/
#puts url
pdfcount[ utime.hour ] += 1
end
hostname = uri.host.match( /\A(.*)\.aps\.org\Z/o ).captures
journalcount[ hostname ] += 1
puts [ ip, utime.iso8601, method, uri, http_ver, http_code, mimetype, size, referer, useragent ].join( "\t" )
end
timecount.keys.sort.each do |h|
puts [ h, timecount[h].uniq.size, timecount[h].size, pdfcount[h] ].join( "\t" )
end
journalcount.keys.sort.each do |j|
puts [ j, journalcount[j] ].join( "\t" )
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment