Implement a script to parse log files stored in S3 doing some simple filtering and computing summary statistics. You can use any scripting tools that you are comfortable with. Using Python is a bonus.
We should be able to re-run the script to get the results.
You are free to use any internet resources to complete this script.
with the following log file located in S3: https://s3.amazonaws.com/buzzfeed-sre-test/2017-06-05-buzzfeed-web.log.gz
Given the log format is apache combined format. see here:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
For reference to what these represent, see apache doc here: https://httpd.apache.org/docs/1.3/mod/mod_log_config.html#formats
Please create a script that will:
- give us statistics on the http status code
- statistics on user-agents(browser)
- top 5 popular urls
feel free to add any additional statistics for bonuses.