Skip to content

Instantly share code, notes, and snippets.

@Te-k
Last active November 22, 2018 15:04
Show Gist options
  • Save Te-k/f8a797b582ab0c9774b81483aea9ef48 to your computer and use it in GitHub Desktop.
Save Te-k/f8a797b582ab0c9774b81483aea9ef48 to your computer and use it in GitHub Desktop.
Parse Apache access logs
import re
from urllib.parse import parse_qs
regex = re.compile('^(?P<ip>\S+)\s+-\s*(?P<userid>\S+)\s+\[(?P<datetime>[^\]]+)\]\s+"(?P<method>[A-Z]+)\s*(?P<request>[^ "]+)?\s*(HTTP/(?P<http_version>[0-9.]+))?"\s+(?P<status>[0-9]{3})\s+(?P<size>[0-9]+|-)\s+"(?P<referer>[^"]*)"\s+"(?P<user_agent>[^"]*)"')
def parse_log(log):
res = regex.match(line)
if not res:
raise(ValueError('Invalid log format'))
res = res.groupdict()
if res['request']:
if '?' in res['request']:
l = res['request'].find('?')
res['path'] = res['request'][:l]
res['query'] = parse_qs(res['request'][l:])
else:
res['path'] = res['request']
res['query'] = {}
return res
@Te-k
Copy link
Author

Te-k commented Nov 22, 2018

Not perfect, does not handle IPv6 and invalid urls with spaces

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment