Skip to content

Instantly share code, notes, and snippets.

@tovask
Last active January 18, 2020 04:20
Show Gist options
  • Save tovask/f8ccd573a950fc47e3aaa311a8f012b9 to your computer and use it in GitHub Desktop.
Save tovask/f8ccd573a950fc47e3aaa311a8f012b9 to your computer and use it in GitHub Desktop.
Analyze block lists changes
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analyze lists changes from [PyFunceble](https://github.com/funilrys/PyFunceble)'s repo's git history"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> ### Jupyter help:\n",
"> \n",
"> Outline of some basics:\n",
"> \n",
"> * [Notebook Basics](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Notebook%20Basics.ipynb)\n",
"> * [IPython - beyond plain python](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Beyond%20Plain%20Python.ipynb)\n",
"> * [Markdown Cells](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Working%20With%20Markdown%20Cells.ipynb)\n",
"> * [Rich Display System](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Rich%20Output.ipynb)\n",
"> * [Custom Display logic](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/IPython%20Kernel/Custom%20Display%20Logic.ipynb)\n",
"> * [Running a Secure Public Notebook Server](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Running%20the%20Notebook%20Server.ipynb#Securing-the-notebook-server)\n",
"> * [How Jupyter works](https://nbviewer.jupyter.org/github/ipython/ipython-in-depth/blob/master/examples/Notebook/Multiple%20Languages%2C%20Frontends.ipynb) to run code in different languages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sources:\n",
"### [Ultimate Hosts Blacklist](https://github.com/Ultimate-Hosts-Blacklist)\n",
"Repositories for testing lists\n",
"\n",
"### [Dead Host](https://github.com/dead-hosts)\n",
"Repositories for testing [PyFunceble](https://github.com/funilrys/PyFunceble)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define whitch list (repo) will be analyzed (e.g. [Ads_Disconnect.me](https://github.com/Ultimate-Hosts-Blacklist/Ads_Disconnect.me))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"repo_url = \"https://github.com/Ultimate-Hosts-Blacklist/yoyo.org_domains\"\n",
"repo_dir = repo_url.split('/')[-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the latest version"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fetching origin\n",
"HEAD is now at 4616264b [Results] Testing for Ultimate Hosts Blacklist [ci skip]\n"
]
}
],
"source": [
"import os\n",
"if not os.path.exists(repo_dir):\n",
" !git clone $repo_url $repo_dir\n",
" os.chdir(repo_dir)\n",
"else:\n",
" os.chdir(repo_dir)\n",
" #!git pull origin master\n",
" !git fetch --all\n",
" !git reset --hard origin/master"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the interesting commits\n",
"(Not only final results commited, the state is autosaved periodically (every 15 minutes), see: [PyFunceble/auto_save.py](https://github.com/funilrys/PyFunceble/blob/master/PyFunceble/auto_save.py))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"git_log_format_param = \"--format=format:\\\"%H %T %at\\\"\"\n",
"def get_git_commits(git_log_command):\n",
" !git reset --hard origin/master\n",
" git_log = !{git_log_command}\n",
" return [{'commit_hash':commit[0],'tree_hash':commit[1],'timestamp':commit[2]} for commit in [ line.split(' ') for line in git_log] ]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"HEAD is now at 4616264b [Results] Testing for Ultimate Hosts Blacklist [ci skip]\r\n"
]
},
{
"data": {
"text/plain": [
"190"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result_commits = get_git_commits(\"git log --grep=\\\" \\\\[ci skip\\\\]\\\" \"+git_log_format_param)\n",
"len(result_commits)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Go back to the interesting commits, and get the status there"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check if the process finished \n",
"[continue.json](https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/output/continue.json)\n",
"[info.json](https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/info.json)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"\n",
"def check_not_in_continue():\n",
" with open('output/continue.json') as fd:\n",
" for list_file,status in json.load(fd).items():\n",
" if sum([count for s,count in status.items()]) != 0:\n",
" print('Warning: continue status not 0!',list_file,status)\n",
" !git log -n 1\n",
"\n",
"def check_test_finished():\n",
" with open('info.json') as fd:\n",
" for key,value in json.load(fd).items():\n",
" if key == \"currently_under_test\":\n",
" if value=='0':\n",
" return True\n",
" else:\n",
" print(\"Warning! Test not finished! \"+value)\n",
" !git log -n 1\n",
" return False\n",
" print(\"Warning! 'currently_under_test' not found in 'info.json'\")\n",
" return False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"helper for parse [percentage.txt](https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/output/logs/percentage/percentage.txt)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"def parse_percentage():\n",
" percentage = {}\n",
" with open('output/logs/percentage/percentage.txt') as fd:\n",
" for line in fd:\n",
" m = re.search(r'(?P<status>ACTIVE|INACTIVE|INVALID)\\s*(?P<percentage>\\d*)%\\s*(?P<numbers>\\d*)',line)\n",
" if not m:\n",
" continue\n",
" #percentage[ m.group('status') ] = m.groupdict()\n",
" percentage[ m.group('status') ] = {}\n",
" percentage[ m.group('status') ]['percentage'] = int(m.group('percentage'))\n",
" percentage[ m.group('status') ]['numbers'] = int(m.group('numbers'))\n",
" percentage['SUM_NUMBERS'] = sum([values['numbers'] for status,values in percentage.items()])\n",
" return percentage\n",
"# parse_percentage()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"loop throught the commits"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
".............................................................................................................................................................................................."
]
}
],
"source": [
"percentages = {}\n",
"for commit in result_commits:\n",
" print('.', end='', flush=True)\n",
" !git checkout -q {commit['commit_hash']} # bring back the repo to that commit\n",
" #check_not_in_continue() # safety, but how cares\n",
" #check_test_finished() # safety, but how cares\n",
" percentages[ commit['timestamp'] ] = parse_percentage()\n",
"# print(percentages)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'ACTIVE': {'percentage': 86, 'numbers': 4267},\n",
" 'INACTIVE': {'percentage': 13, 'numbers': 642},\n",
" 'INVALID': {'percentage': 0, 'numbers': 19},\n",
" 'SUM_NUMBERS': 4928}"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"percentages[list(percentages.keys())[0]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"#### Get the original lists changes"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"HEAD is now at 4616264b [Results] Testing for Ultimate Hosts Blacklist [ci skip]\r\n"
]
},
{
"data": {
"text/plain": [
"80"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"domains_change_commits = get_git_commits(\"git log \"+git_log_format_param+\" domains.list\")\n",
"len(domains_change_commits)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"................................................................................"
]
}
],
"source": [
"domains_counts = {}\n",
"for commit in domains_change_commits:\n",
" print('.', end='', flush=True)\n",
" !git checkout -q {commit['commit_hash']} # bring back the repo to that commit\n",
" domains_counts[ commit['timestamp'] ] = sum(1 for line in open('domains.list'))\n",
"# print(domains_counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"#### Prepare the datas for plotting"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import datetime\n",
"percentage_dates = [datetime.datetime.fromtimestamp(int(timestamp)) for timestamp,values in percentages.items() ]\n",
"active_percentage = [ value['ACTIVE']['percentage'] for timestamp,value in percentages.items() ]\n",
"active_numbers = [ value['ACTIVE']['numbers'] for timestamp,value in percentages.items() ]\n",
"inactive_percentage = [ value['INACTIVE']['percentage'] for timestamp,value in percentages.items() ]\n",
"inactive_numbers = [ value['INACTIVE']['numbers'] for timestamp,value in percentages.items() ]\n",
"invalid_percentage = [ value['INVALID']['percentage'] for timestamp,value in percentages.items() ]\n",
"invalid_numbers = [ value['INVALID']['numbers'] for timestamp,value in percentages.items() ]\n",
"sum_numbers = [ value['SUM_NUMBERS'] for timestamp,value in percentages.items() ]\n",
"\n",
"domain_count_dates = [datetime.datetime.fromtimestamp(int(timestamp)) for timestamp,value in domains_counts.items() ]\n",
"domain_count_values = [value for timestamp,value in domains_counts.items() ]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"try:\n",
" import matplotlib.pyplot as plt\n",
"except ImportError:\n",
" !pip install matplotlib\n",
" import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import matplotlib.dates as mdates\n",
"import matplotlib.ticker as ticker"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1332x756 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"%matplotlib inline\n",
"# or interactive:\n",
"# %matplotlib notebook\n",
"\n",
"# https://matplotlib.org/api/markers_api.html\n",
"# https://matplotlib.org/gallery/lines_bars_and_markers/line_styles_reference.html\n",
"\n",
"plt.plot( percentage_dates, active_numbers, label=\"active\" )\n",
"plt.plot( percentage_dates, inactive_numbers, label=\"inactive\" )\n",
"plt.plot( percentage_dates, invalid_numbers, label=\"invalid\" )\n",
"plt.plot( percentage_dates, sum_numbers, label=\"sum\" )\n",
"\n",
"plt.plot( domain_count_dates, domain_count_values, label=\"domains number (changes)\", linestyle=':', marker='s' )\n",
"\n",
"plt.title(repo_dir)\n",
"\n",
"plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y.%m.%d'))\n",
"plt.gca().xaxis.set_major_locator(ticker.MaxNLocator(20))\n",
"plt.xticks( rotation=45 )\n",
"\n",
"# plt.xlabel(\"Date\")\n",
"plt.ylabel(\"Count\")\n",
"\n",
"plt.legend()\n",
"\n",
"plt.gcf().set_size_inches(18.5, 10.5, forward=True)\n",
"plt.gcf().savefig('../'+repo_dir+'.png', dpi=100)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment