Skip to content

Instantly share code, notes, and snippets.

@axmakarov
Last active May 5, 2018 06:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save axmakarov/9fbc1019c532f54c4b70bdefbd6e200a to your computer and use it in GitHub Desktop.
Save axmakarov/9fbc1019c532f54c4b70bdefbd6e200a to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Делаем сессии из лога событий\n",
"Загружаем библиотеки pandas и numpy, а также display для отображения dataframe'ов"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from IPython.display import display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Загружаем лог\n",
"\n",
"Структура данных:\n",
"- id - порядковый номер события в логе\n",
"- user_id - уникальный идентификатор пользователя, совершившего событие (при решении реальной задачи анализа лога в качестве user_id может выступать IP-адрес пользователя или, например, уникальный идентификатор cookie-файла)\n",
"- date_time - время совершения события\n",
"- page - страница, на которую перешел пользователь (для решения задачи эта колонка не несет никакой пользы, я привожу её для наглядности)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index\n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index\n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog\n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2\n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts\n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog\n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index\n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df = pd.read_excel('event_log.xlsx')\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"События сгенерированные разными пользователями идут в хронологическом порядке. Для удобства отсортируем их по user_id, тогда события каждого пользователя будут идти последовательно"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index\n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts\n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index\n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index\n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog\n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2\n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog\n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df = event_df.sort_values('user_id')\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"В колонке 'diff' для каждого события отдельного пользователя посчитаем разницу между временем посещения страницы и времененем посещения предыдущей страницы. Если страница была первой для пользователя, то значение в колонке 'diff' будет NaT, т.к. нет предыдущего значения"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" <td>00:06:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:03:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" <td>00:01:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog \n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2 \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff \n",
"0 NaT \n",
"4 00:06:00 \n",
"7 NaT \n",
"1 NaT \n",
"2 00:03:00 \n",
"3 00:01:00 \n",
"5 00:31:00 \n",
"6 01:01:00 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df['diff'] = event_df.groupby('user_id')['date_time'].diff(1)\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Из основного dataframe 'event_df' создадим вспомогательный dataframe 'session_start_df'. Этот dataframe будет содержать события, которые будут считаться первыми событиями сессий. К таким событиям относятся все события, которые произошли спустя более чем 30 минут после предудыщего, либо события, которые были первыми для пользователя (NaT в колонке 'diff')\n",
"\n",
"Также создадим во вспомогательном dataframe колонку 'session_id', которая будет содержать в себе id первого события сессии. Она пригодится, чтобы корректно отобразить идентификатор сессии, когда будем соединять данные из основного и вспомогательного dataframe"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/makarov/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" \n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" <th>session_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" <td>7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff session_id \n",
"0 NaT 1 \n",
"7 NaT 8 \n",
"1 NaT 2 \n",
"5 00:31:00 6 \n",
"6 01:01:00 7 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sessions_start_df = event_df[(event_df['diff'].isnull()) | (event_df['diff'] > '1800 seconds')]\n",
"sessions_start_df['session_id'] = sessions_start_df['id']\n",
"display(sessions_start_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"С помощью функции merge_asof объединим между собой данные основного и вспомогательного dataframe'ов. Эта функция позволяет объединить данные двух dataframe'ов схожим образом с левым join'ом, но не по точному соответствию ключей, а по ближайшему. Примеры и подробности в документации: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge_asof.html\n",
"\n",
"Для корректной работы этой функции оба dataframe должны быть отсортированы по ключу, на основе которого будет происходить merge_asof"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"event_df = event_df.sort_values('id')\n",
"sessions_start_df = sessions_start_df.sort_values('id')\n",
"event_df = pd.merge_asof(event_df,sessions_start_df[['id','user_id','session_id']],on='id',by='user_id')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"После объединения отсортируем основной dataframe по user_id. И убедимся, что сессии корректно сопоставлены с событиями"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" <th>session_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" <td>00:06:00</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:03:00</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" <td>00:01:00</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" <td>7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog \n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2 \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff session_id \n",
"0 NaT 1 \n",
"4 00:06:00 1 \n",
"7 NaT 8 \n",
"1 NaT 2 \n",
"2 00:03:00 2 \n",
"3 00:01:00 2 \n",
"5 00:31:00 6 \n",
"6 01:01:00 7 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df = event_df.sort_values(['user_id','date_time'])\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Что еще можно сделать?\n",
"/1. Можно найти события, которые были первыми в сессиях. Это будет полезно, если мы захотим определить страницы входа\n",
"\n",
"Найти эти события предельно просто: их идентификаторы будут равны идентификаторам сессии"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" <th>session_id</th>\n",
" <th>is_first_event_in_session</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" <td>00:06:00</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>8</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:03:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" <td>00:01:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" <td>6</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" <td>7</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog \n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2 \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff session_id is_first_event_in_session \n",
"0 NaT 1 True \n",
"4 00:06:00 1 False \n",
"7 NaT 8 True \n",
"1 NaT 2 True \n",
"2 00:03:00 2 False \n",
"3 00:01:00 2 False \n",
"5 00:31:00 6 True \n",
"6 01:01:00 7 True "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df['is_first_event_in_session'] = event_df['id'] == event_df['session_id']\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"/2. Можно вычислить время, проведенное на странице, руководствуясь временем посещения следующей страницы\n",
"\n",
"Для этого сначала считаем разницу между предыдущей и следующей страницей внутри сессии"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" <th>session_id</th>\n",
" <th>is_first_event_in_session</th>\n",
" <th>time_on_page</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" <td>00:06:00</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" <td>00:06:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>8</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:03:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" <td>00:03:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" <td>00:01:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" <td>00:01:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" <td>6</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" <td>7</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog \n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2 \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff session_id is_first_event_in_session time_on_page \n",
"0 NaT 1 True NaT \n",
"4 00:06:00 1 False 00:06:00 \n",
"7 NaT 8 True NaT \n",
"1 NaT 2 True NaT \n",
"2 00:03:00 2 False 00:03:00 \n",
"3 00:01:00 2 False 00:01:00 \n",
"5 00:31:00 6 True NaT \n",
"6 01:01:00 7 True NaT "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df['time_on_page'] = event_df.groupby(['session_id'])['date_time'].diff(1)\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Затем смещаем посчитанную разницу на строку выше внутри сессии"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" <th>session_id</th>\n",
" <th>is_first_event_in_session</th>\n",
" <th>time_on_page</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" <td>00:06:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" <td>00:06:00</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>8</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" <td>00:03:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:03:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" <td>00:01:00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" <td>00:01:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" <td>6</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" <td>7</td>\n",
" <td>True</td>\n",
" <td>NaT</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog \n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2 \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff session_id is_first_event_in_session time_on_page \n",
"0 NaT 1 True 00:06:00 \n",
"4 00:06:00 1 False NaT \n",
"7 NaT 8 True NaT \n",
"1 NaT 2 True 00:03:00 \n",
"2 00:03:00 2 False 00:01:00 \n",
"3 00:01:00 2 False NaT \n",
"5 00:31:00 6 True NaT \n",
"6 01:01:00 7 True NaT "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df['time_on_page'] = event_df.groupby(['session_id'])['time_on_page'].shift(-1)\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Для удобства дальнейших вычислений переведем 'time_on_page' в секунды"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>user_id</th>\n",
" <th>date_time</th>\n",
" <th>page</th>\n",
" <th>diff</th>\n",
" <th>session_id</th>\n",
" <th>is_first_event_in_session</th>\n",
" <th>time_on_page</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:45:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>1</td>\n",
" <td>True</td>\n",
" <td>360.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>36004921-2faf-45b6-bd35-7496474e6c87</td>\n",
" <td>2018-05-05 07:51:00</td>\n",
" <td>/contacts</td>\n",
" <td>00:06:00</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>58774a77-5d8d-4459-b5f3-8cb539f4917c</td>\n",
" <td>2018-05-05 09:25:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>8</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:46:00</td>\n",
" <td>/index</td>\n",
" <td>NaT</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" <td>180.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:49:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:03:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" <td>60.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 07:50:00</td>\n",
" <td>/catalog2</td>\n",
" <td>00:01:00</td>\n",
" <td>2</td>\n",
" <td>False</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 08:21:00</td>\n",
" <td>/catalog</td>\n",
" <td>00:31:00</td>\n",
" <td>6</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>61955afd-9718-49df-825c-1b21e352807f</td>\n",
" <td>2018-05-05 09:22:00</td>\n",
" <td>/index</td>\n",
" <td>01:01:00</td>\n",
" <td>7</td>\n",
" <td>True</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id user_id date_time page \\\n",
"0 1 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:45:00 /index \n",
"4 5 36004921-2faf-45b6-bd35-7496474e6c87 2018-05-05 07:51:00 /contacts \n",
"7 8 58774a77-5d8d-4459-b5f3-8cb539f4917c 2018-05-05 09:25:00 /index \n",
"1 2 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:46:00 /index \n",
"2 3 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:49:00 /catalog \n",
"3 4 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 07:50:00 /catalog2 \n",
"5 6 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 08:21:00 /catalog \n",
"6 7 61955afd-9718-49df-825c-1b21e352807f 2018-05-05 09:22:00 /index \n",
"\n",
" diff session_id is_first_event_in_session time_on_page \n",
"0 NaT 1 True 360.0 \n",
"4 00:06:00 1 False NaN \n",
"7 NaT 8 True NaN \n",
"1 NaT 2 True 180.0 \n",
"2 00:03:00 2 False 60.0 \n",
"3 00:01:00 2 False NaN \n",
"5 00:31:00 6 True NaN \n",
"6 01:01:00 7 True NaN "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"event_df['time_on_page'] = event_df['time_on_page'] / np.timedelta64(1, 's')\n",
"display(event_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"На основе полученных данных мы можем посчитать простейшие показатели. А можно придумать что-нибудь по-сложнее :)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Количество пользователей: 3\n",
"Количество сессий: 5\n",
"Количество просмотров страниц: 8\n",
"Среднее время просмотра страницы: 200.0\n"
]
}
],
"source": [
"print u'Количество пользователей: {0}'.format(event_df['user_id'].nunique())\n",
"print u'Количество сессий: {0}'.format(event_df['session_id'].nunique())\n",
"print u'Количество просмотров страниц: {0}'.format(event_df['id'].count())\n",
"print u'Среднее время просмотра страницы: {0}'.format(event_df['time_on_page'].mean())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment