Skip to content

Instantly share code, notes, and snippets.

@sueszli
Last active January 26, 2024 14:14
Show Gist options
  • Save sueszli/e7d2c1fecdc9bd7e10f364a6b470c04f to your computer and use it in GitHub Desktop.
Save sueszli/e7d2c1fecdc9bd7e10f364a6b470c04f to your computer and use it in GitHub Desktop.
network traffic analysis
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are receiving identical data from 4 different multicast publishers on seperate lines in a `.pcap` file.\n",
"\n",
"The only difference between the data is the timestamp.\n",
"\n",
"We want to find the 2 packages with the fastest time of arrival."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. Parsing the data into a pandas dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `.pcap` file contains the data in a binary format about which we can read more [here](https://wiki.wireshark.org/Development/LibpcapFileFormat)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/bh/4ympj4l52bs8wxpg114kdh1m0000gn/T/ipykernel_37466/605009882.py:28: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.\n",
" df = pd.concat([df, pd.DataFrame([[timestamp, source, seqno]], columns=[\"timestamp\", \"source\", \"seqno\"])])\n"
]
}
],
"source": [
"from pathlib import Path\n",
"import dpkt\n",
"import io\n",
"import socket\n",
"\n",
"import pandas as pd\n",
"\n",
"df = pd.DataFrame(columns=[\"timestamp\", \"source\", \"seqno\"])\n",
"\n",
"file = open(Path(\"./traffic.pcap\"), \"rb\")\n",
"for ts, buf in dpkt.pcap.Reader(file):\n",
" eth = dpkt.ethernet.Ethernet(buf)\n",
" ip = eth.data\n",
" tcp = ip.data\n",
"\n",
" # parse packet\n",
" timestamp: float = ts\n",
" source: str = socket.inet_ntoa(ip.src) + \":\" + str(tcp.sport)\n",
" destination: str = socket.inet_ntoa(ip.dst) + \":\" + str(tcp.dport) # ignore, is our own ip\n",
" data: str = io.BytesIO(tcp.data).read().decode(\"utf-8\").strip()\n",
"\n",
" # parse data (not relevant)\n",
" symbol = data.split(\" \")[1].strip()\n",
" seqno = int(data.split(\" \")[3].strip())\n",
" price = int(data.split(\" \")[5].strip())\n",
"\n",
" # store data in dataframe\n",
" df = pd.concat([df, pd.DataFrame([[timestamp, source, seqno]], columns=[\"timestamp\", \"source\", \"seqno\"])])\n",
"\n",
"assert len(df) > 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I just wonder whether the `timestamp` has the same semantics as the logical timestamp `seqno`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Same order: False\n"
]
}
],
"source": [
"sort_by_timestamp = df.sort_values(by=[\"timestamp\"])\n",
"sort_by_seqno = df.sort_values(by=[\"seqno\"])\n",
"same_order = sort_by_timestamp.equals(sort_by_seqno)\n",
"\n",
"print(\"Same order: \" + str(same_order))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we know that they're different. Let's keep them both but stick to the timestamp."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Finding the two fastest packages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now all we need to do is sort by the timestamp and take the first two rows."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>timestamp</th>\n",
" <th>source</th>\n",
" <th>seqno</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.473412e+09</td>\n",
" <td>10.10.10.4:33000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.473412e+09</td>\n",
" <td>10.10.10.1:33000</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" timestamp source seqno\n",
"0 1.473412e+09 10.10.10.4:33000 0\n",
"0 1.473412e+09 10.10.10.1:33000 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_sorted = df.sort_values(\"seqno\")\n",
"fastest_packages = df_sorted.head(2)\n",
"\n",
"fastest_packages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comparing publishers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally let's compare the publishers with some metrics."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'machine nr.1': {'avg': 499.79107509469986,\n",
" 'median': 499.7542179822922,\n",
" 'std': 288.82414498791394},\n",
" 'machine nr.2': {'avg': 500.82151508741117,\n",
" 'median': 501.29140400886536,\n",
" 'std': 288.6779254909906},\n",
" 'machine nr.3': {'avg': 499.7361731772423,\n",
" 'median': 499.83047103881836,\n",
" 'std': 288.8137098297796},\n",
" 'machine nr.4': {'avg': 499.68862091493605,\n",
" 'median': 499.7177765369415,\n",
" 'std': 288.8120833932828}}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# get delta from fastest machine\n",
"fastest_ts = df_sorted.head(1)[\"timestamp\"].values[0]\n",
"\n",
"df_delta = pd.DataFrame(columns=[\"source\", \"delta\"])\n",
"for index, row in df_sorted.iterrows():\n",
" delta = row[\"timestamp\"] - fastest_ts if not row[\"timestamp\"] == fastest_ts else 0\n",
" df_delta = pd.concat([df_delta, pd.DataFrame([[row[\"source\"], delta]], columns=[\"source\", \"delta\"])])\n",
"df_delta = df_delta.sort_values(\"delta\")\n",
"\n",
"# group by source\n",
"metrics = {}\n",
"grouped = df_delta.groupby(\"source\")\n",
"for name, group in grouped:\n",
" ip = name.split(\":\")[0]\n",
" name = \"machine nr.\" + ip.split(\".\")[3]\n",
" avg = group[\"delta\"].mean()\n",
" median = group[\"delta\"].median()\n",
" std = group[\"delta\"].std()\n",
" metrics[name] = {\n",
" \"avg\": avg,\n",
" \"median\": median,\n",
" \"std\": std\n",
" }\n",
"\n",
"metrics\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# plot metrics as bar chart\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"# Prepare data for plotting\n",
"labels = list(metrics.keys())\n",
"avg = [metrics[m]['avg'] for m in labels]\n",
"median = [metrics[m]['median'] for m in labels]\n",
"std = [metrics[m]['std'] for m in labels]\n",
"\n",
"x = np.arange(len(labels)) # the label locations\n",
"width = 0.3 # the width of the bars\n",
"\n",
"fig, ax = plt.subplots()\n",
"rects1 = ax.bar(x - width, avg, width, label='Avg')\n",
"rects2 = ax.bar(x, median, width, label='Median')\n",
"# rects3 = ax.bar(x + width, std, width, label='Std')\n",
"\n",
"# Add some text for labels, title and custom x-axis tick labels, etc.\n",
"ax.set_ylabel('Scores')\n",
"ax.set_title('Scores by machine and metric')\n",
"ax.set_xticks(x)\n",
"ax.set_xticklabels(labels)\n",
"ax.legend()\n",
"\n",
"ax.set_yscale('log')\n",
"fig.tight_layout()\n",
"\n",
"plt.show()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment