Skip to content

Instantly share code, notes, and snippets.

@zilista
Created October 14, 2019 14:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zilista/77fde68994b25a5122bc85feacd86d1e to your computer and use it in GitHub Desktop.
Save zilista/77fde68994b25a5122bc85feacd86d1e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Installation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pip install apache-log-parser\n",
"\n",
"https://github.com/rory/apache-log-parser/blob/master/README.md"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Usage"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"import apache_log_parser\n",
"\n",
"line_parser = apache_log_parser.make_parser(\"%v %h %l %u %t \\\"%r\\\" %>s %b\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Supported values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" '%a' #\tRemote IP-address\n",
" '%A' #\tLocal IP-address\n",
" '%B' #\tSize of response in bytes, excluding HTTP headers.\n",
" '%b' #\tSize of response in bytes, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.\n",
" '%D' #\tThe time taken to serve the request, in microseconds.\n",
" '%f' #\tFilename\n",
" '%h' #\tRemote host\n",
" '%H' #\tThe request protocol\n",
" '%k' #\tNumber of keepalive requests handled on this connection. Interesting if KeepAlive is being used, so that, for example, a '1' means the first keepalive request after the initial one, '2' the second, etc...; otherwise this is always 0 (indicating the initial request). Available in versions 2.2.11 and later.\n",
" '%l' #\tRemote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.\n",
" '%m' #\tThe request method\n",
" '%p' #\tThe canonical port of the server serving the request\n",
" '%P' #\tThe process ID of the child that serviced the request.\n",
" '%q' #\tThe query string (prepended with a ? if a query string exists, otherwise an empty string)\n",
" '%r' #\tFirst line of request\n",
" '%R' #\tThe handler generating the response (if any).\n",
" '%s' #\tStatus. For requests that got internally redirected, this is the status of the *original* request --- %>s for the last.\n",
" '%t' #\tTime the request was received (standard english format)\n",
" '%T' #\tThe time taken to serve the request, in seconds.\n",
" '%u' #\tRemote user (from auth; may be bogus if return status (%s) is 401)\n",
" '%U' #\tThe URL path requested, not including any query string.\n",
" '%v' #\tThe canonical ServerName of the server serving the request.\n",
" '%V' #\tThe server name according to the UseCanonicalName setting.\n",
" '%X' #\tConnection status when response is completed:\n",
" # X =\tconnection aborted before the response completed.\n",
" # + =\tconnection may be kept alive after the response is sent.\n",
" # - =\tconnection will be closed after the response is sent.\n",
" # (This directive was %c in late versions of Apache 1.3, but this conflicted with the historical ssl %{var}c syntax.)\n",
" '%I' #\tBytes received, including request and headers, cannot be zero. You need to enable mod_logio to use this.\n",
" '%O' #\tBytes sent, including headers, cannot be zero. You need to enable mod_logio to use this.\n",
" \n",
" '%\\{User-Agent\\}i' # Special case of below, for matching just user agent\n",
" '%\\{[^\\}]+?\\}i' #\tThe contents of Foobar: header line(s) in the request sent to the server. Changes made by other modules (e.g. mod_headers) affect this. If you're interested in what the request header was prior to when most modules would have modified it, use mod_setenvif to copy the header into an internal environment variable and log that value with the %\\{VARNAME}e described above.\n",
" \n",
" '%\\{[^\\}]+?\\}C' #\tThe contents of cookie Foobar in the request sent to the server. Only version 0 cookies are fully supported.\n",
" '%\\{[^\\}]+?\\}e' #\tThe contents of the environment variable FOOBAR\n",
" '%\\{[^\\}]+?\\}n' #\tThe contents of note Foobar from another module.\n",
" '%\\{[^\\}]+?\\}o' #\tThe contents of Foobar: header line(s) in the reply.\n",
" '%\\{[^\\}]+?\\}p' #\tThe canonical port of the server serving the request or the server's actual port or the client's actual port. Valid formats are canonical, local, or remote.\n",
" '%\\{[^\\}]+?\\}P' #\tThe process ID or thread id of the child that serviced the request. Valid formats are pid, tid, and hextid. hextid requires APR 1.2.0 or higher.\n",
" '%\\{[^\\}]+?\\}t' #\tThe time, in the form given by format, which should be in strftime(3) format. (potentially localized)\n",
" '%\\{[^\\}]+?\\}x' # Extension value, e.g. mod_ssl protocol and cipher"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import apache_log_parser\n",
"import csv"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"!cat access.log.1 access.log.2 > all_log.log"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"\"\"\"\n",
"\n",
"Создаем файл log.csv и записываем в него строку заголовка с названием столбцов.\n",
"\n",
"\"\"\"\n",
"csv_file = open('log.csv', 'w')\n",
"data = [['remote_host', 'server_name2', 'query_string', 'time_received_isoformat', 'request_method', 'request_url', 'request_http_ver', 'request_url_scheme', 'request_url_query', 'status', 'response_bytes_clf', 'request_header_user_agent', 'request_header_user_agent__browser__family', 'request_header_user_agent__browser__version_string', 'request_header_user_agent__os__family', 'request_header_user_agent__os__version_string', 'request_header_user_agent__is_mobile']]\n",
"\n",
"with csv_file:\n",
" writer = csv.writer(csv_file)\n",
" writer.writerows(data)\n",
"csv_file.close()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"\"Читаем построчно access.log, парсим строку и записываем разобранные данные в csv\"\n",
"\n",
"with open('all_log.log') as file:\n",
" for line in file:\n",
"\n",
" line = line.strip()\n",
" line_parser = apache_log_parser.make_parser(\"%h %V %q %t \\\"%r\\\" %>s %b \\\"%{Referer}i\\\" \\\"%{User-Agent}i\\\"\")\n",
" log_line_data = line_parser(f'{line}')\n",
" \n",
" #Пишем в файл нужные данные\n",
" data = [[log_line_data['remote_host'], log_line_data['server_name2'], log_line_data['query_string'], log_line_data['time_received_isoformat'], log_line_data['request_method'], log_line_data['request_url'], log_line_data['request_http_ver'], log_line_data['request_url_scheme'], log_line_data['request_url_query'], log_line_data['status'], log_line_data['response_bytes_clf'], log_line_data['request_header_user_agent'], log_line_data['request_header_user_agent__browser__family'], log_line_data['request_header_user_agent__browser__version_string'], log_line_data['request_header_user_agent__os__family'], log_line_data['request_header_user_agent__os__version_string'], log_line_data['request_header_user_agent__is_mobile']]]\n",
" csv_file = open('log.csv', 'a')\n",
" with csv_file:\n",
" writer = csv.writer(csv_file)\n",
" writer.writerows(data)\n",
" \n",
"# print('end')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Работаем с созданным csv"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"data = pd.read_csv('log.csv')\n",
"# data.info()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"200 1108990\n",
"301 84924\n",
"304 76306\n",
"410 71090\n",
"404 21754\n",
"302 1370\n",
"400 1357\n",
"403 158\n",
"500 82\n",
"499 4\n",
"Name: status, dtype: int64\n"
]
}
],
"source": [
"status_code_count = data['status'].value_counts()\n",
"print(status_code_count)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"YandexBot 30063\n",
"AhrefsBot 24013\n",
"bingbot 8588\n",
"SemrushBot 3179\n",
"MJ12bot 2062\n",
"crawler 1013\n",
"Yandex Browser 515\n",
"Chrome 449\n",
"Mail.RU_Bot 274\n",
"Firefox 201\n",
"Googlebot 155\n",
"Opera 142\n",
"Chrome Mobile 98\n",
"YandexSearch 87\n",
"Edge 55\n",
"Mobile Safari 39\n",
"YandexMobileBot 39\n",
"YandexAccessibilityBot 26\n",
"Safari 25\n",
"IE 19\n",
"msnbot 12\n",
"Samsung Internet 10\n",
"Mail.ru Chromium Browser 9\n",
"UC Browser 7\n",
"Opera Mini 3\n",
"Other 1\n",
"TwitterBot 1\n",
"Iron 1\n",
"Edge Mobile 1\n",
"Firefox Mobile 1\n",
"Chrome Mobile WebView 1\n",
"Chrome Mobile iOS 1\n",
"Name: request_header_user_agent__browser__family, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data['status']==410]['request_header_user_agent__browser__family'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"data_YandexBot = data[data['request_header_user_agent__browser__family']=='YandexBot']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment