Skip to content

Instantly share code, notes, and snippets.

@rayatbuzzfeed
Last active January 16, 2020 16:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rayatbuzzfeed/991150092087510643405a77fda7ef78 to your computer and use it in GitHub Desktop.
Save rayatbuzzfeed/991150092087510643405a77fda7ef78 to your computer and use it in GitHub Desktop.

challenge

Implement a script to parse log files stored in S3 doing some simple filtering and computing summary statistics. You can use any scripting tools that you are comfortable with. Using Python is a bonus.

We should be able to re-run the script to get the results.

You are free to use any internet resources to complete this script.

with the following log file located in S3: https://s3.amazonaws.com/buzzfeed-sre-test/2017-06-05-buzzfeed-web.log.gz

Given the log format is apache combined format. see here:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

For reference to what these represent, see apache doc here: https://httpd.apache.org/docs/1.3/mod/mod_log_config.html#formats

Please create a script that will:

  • give us statistics on the http status code
  • statistics on user-agents(browser)
  • top 5 popular urls

feel free to add any additional statistics for bonuses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment