Skip to content

Instantly share code, notes, and snippets.

Steven Englehardt englehardt

Block or report user

Report or block englehardt

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@englehardt
englehardt / sqlite2parquet.py
Created Aug 16, 2019
A file for converting OpenWPM sqlite databases to parquet on S3. This also requires the appropriate `parquet_schema.py` file that matches the sqlite schema. See: https://github.com/mozilla/OpenWPM/blob/master/automation/DataAggregator/parquet_schema.py
View sqlite2parquet.py
""" This script reads a sqlite database and writes the content to a parquet
database on S3 formatted as OpenWPM would format. It's best to just run this
on AWS as it bottlenecks on the S3 upload. This is a lightly modified version
of OpenWPM's S3Aggregator class.
"""
import os
import sqlite3
import sys
from collections import defaultdict
@englehardt
englehardt / disconnect_parsing_example.py
Created May 14, 2019
Example of how to use the DisconnectParser included in `trackingprotection_tools` (https://pypi.org/project/trackingprotection-tools/)
View disconnect_parsing_example.py
from trackingprotection_tools import DisconnectParser
BLOCKLIST_URL = 'https://raw.githubusercontent.com/mozilla-services/shavar-prod-lists/master/disconnect-blacklist.json' # noqa
REMAPPING_URL = 'https://raw.githubusercontent.com/mozilla-services/shavar-list-creation/master/disconnect_mapping.json' # noqa
dc = DisconnectParser(
blocklist_url=BLOCKLIST_URL,
disconnect_mapping_url=REMAPPING_URL,
verbose=True
@englehardt
englehardt / generate_hash_list.py
Created Apr 18, 2019
Generate a list of safebrowsing hashes from the raw Disconnect list
View generate_hash_list.py
import base64
import hashlib
import json
import re
import urllib2
from trackingprotection_tools import DisconnectParser
TRACKER_CATEGORIES = [
'Advertising', 'Analytics', 'Social', 'Content', 'Disconnect'
@englehardt
englehardt / alexa_utils.py
Last active Jul 26, 2018
A utility file to retrieve and parse the Alexa Top 1 Million site list
View alexa_utils.py
from StringIO import StringIO
import requests
import zipfile
import random
import json
import os
EC2_LIST = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'
@englehardt
englehardt / gather_internal_links.py
Last active Nov 6, 2017
A requests-based crawler to gather internal links off of the homepage content of sites.
View gather_internal_links.py
@englehardt
englehardt / Twitter-Remove_Likes.user.js
Last active Sep 7, 2018
Greasemonkey userscript to remove tweets from timeline which only show up because they were liked by someone you follow.
View Twitter-Remove_Likes.user.js
// ==UserScript==
// @name Remove Likes on Twitter
// @namespace twitter
// @include https://twitter.com/
// @version 2
// @grant GM_addStyle
// ==/UserScript==
GM_addStyle('div.promoted-tweet, div[data-component-context=suggest_activity_tweet] {display: none !important}');
View merge_org_lists.py
from collections import defaultdict
import json
import dill
import os
DATA_DIR = './'
WEBXRAY_LIST = 'webxray_orgs.json'
DISCONNECT_LIST = 'disconnect_list.json'
OUT_LIST = 'merged_organizations.dill'
View organizations.json
{
"persianstat.com": ["persianstat.com"],
"marketgid": ["marketgid.com", "dt07.net", "dt00.net"],
"madvertise": ["madvertise.com"],
"voice2page": ["voice2page.com"],
"mixpanel": ["mixpanel.com"],
"automattic": ["wordpress.com", "polldaddy.com", "automattic.com", "wp.com", "gravatar.com", "intensedebate.com"],
"game advertising online": ["game-advertising-online.com"],
"adconion": ["amgdgt.com", "adconion.com", "smartclip.com", "euroclick.com"],
"sogou": ["sogou.com", "sogoucdn.com"],
@englehardt
englehardt / blocklistparser_utils.py
Created Sep 20, 2016
BlockListParser Utilities
View blocklistparser_utils.py
"""
This file contains a collection of utilities for working with BlockListParser
using http data, such as that collected by OpenWPM (https://github.com/citp/OpenWPM).
publicsuffix (https://pypi.python.org/pypi/publicsuffix/) is required
Example usage:
from publicsuffix import PublicSuffixList
from BlockListParser import BlockListParser
@englehardt
englehardt / selenium_http_auth.py
Created Dec 17, 2015
Submit HTTP Authentication credentials with Selenium. Note that although the methods exist, Selenium doesn't seem to support native HTTP Auth handling in Firefox.
View selenium_http_auth.py
"""
Steven Englehardt
github.com/englehardt
Some dependencies (probably not exhaustive):
sudo apt-get install python-Xlib scrot xserver-xephyr
sudo pip install pyautogui pyvirtualdisplay
This needs access to a Firefox binary, and hardcodes a relative location.
You can’t perform that action at this time.