Skip to content

Instantly share code, notes, and snippets.

View englehardt's full-sized avatar

Steven Englehardt englehardt

  • DuckDuckGo
View GitHub Profile
103092804.com
1rx.io
247realmedia.com
2leep.com
2mdn.net
2o7.net
33across.com
360yield.com
365media.com
3dstats.com
@englehardt
englehardt / alexa_utils.py
Last active March 31, 2022 23:36
A utility file to retrieve and parse the Alexa Top 1 Million site list
from StringIO import StringIO
import requests
import zipfile
import random
import json
import os
EC2_LIST = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip'
@englehardt
englehardt / get_alexa_category_list.py
Last active September 14, 2020 17:01
A scraper that grabs urls for the top 500 sites in each Alexa category. Requires python packages `dill` and `bs4`.
from collections import defaultdict
import dill
import requests
from bs4 import BeautifulSoup
alexa_categories = defaultdict(list)
BASE_URL = 'http://www.alexa.com/topsites/category'
print "Grabbing categories of top sites from %s" % BASE_URL
@englehardt
englehardt / abp-blocklist-parser-options-example.py
Last active April 23, 2020 16:32
ABP blocklist parser options parsing example
def get_option_dict(url, top_level_url, content_type=None):
"""Build an options dict for BlockListParser.
These options are checked here:
* https://github.com/englehardt/abp-blocklist-parser/blob/40f6bb5b91ea403b7b9852a16d6c57d5ec26cf7f/abp_blocklist_parser/RegexParser.py#L104-L117
* https://github.com/englehardt/abp-blocklist-parser/blob/40f6bb5b91ea403b7b9852a16d6c57d5ec26cf7f/abp_blocklist_parser/RegexParser.py#L240-L248
Parameters
----------
url : string
@englehardt
englehardt / blocklistparser_utils.py
Created September 20, 2016 18:33
BlockListParser Utilities
"""
This file contains a collection of utilities for working with BlockListParser
using http data, such as that collected by OpenWPM (https://github.com/citp/OpenWPM).
publicsuffix (https://pypi.python.org/pypi/publicsuffix/) is required
Example usage:
from publicsuffix import PublicSuffixList
from BlockListParser import BlockListParser
from urlparse import urlparse
from Crypto.Hash import MD2
import pandas as pd
import cookies as ck
import hackercodecs # noqa
import hashlib
import pyblake2
import urllib
import sha3
import mmh3
@englehardt
englehardt / export(8).csv
Created December 13, 2019 02:00
USE HORIZONTAL SCROLL TO SEE REDIRECT CHAINS
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 2.
top_level_hostname,url,redirect_chain
www.prabhasakshi.com,https://fcmatch.google.com/pixel?google_gm=AMnCDorgMK87r03e115XCkX55u3NGIsHQVGSw3sfqlf2vTyPC1FUd-EWM0O9WxVM7-EvH31H1yx2L5xh-p78KY4cOU_R6Gekf76P6ukigIDMufCzxoqAWbwfYeNFjxxmLcH56fmsIuWl,"[""https://e.dlx.addthis.com/e/a-1189/s-3614?redirect_provider_id=3614&ru=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Ddatalogix_dmp%26google_hm%3D%3CNA_ID%3E%26google_push%3DAHNF13If3D87PP63h-DtKCOgSghwXpmcwg4r08mF1ZsSUQ&google_gid=CAESEAbb_EW8Fb8b1FCVaJP9kFc&google_cver=1"",""https://e.dlx.addthis.com/e/a-1189/s-3614?redirect_provider_id=3614&ru=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Ddatalogix_dmp%26google_hm%3D%3CNA_ID%3E%26google_push%3DAHNF13If3D87PP63h-DtKCOgSghwXpmcwg4r08mF1ZsSUQ&google_gid=CAESEAbb_EW8Fb8b1FCVaJP9kFc&google_cver=1&rd=Y"",""https://cm.g.doubleclick.net/pixel?google_nid=datalogix_dmp&google_hm=MjAxOTA4MjcwNDI1MDk5ODY2ODEwMTQ0Mzk5Ng%3D%3D&google_push=AHNF13If3D87PP63h-DtKCOgSghwXpmcwg4r08mF1ZsSUQ"",""https://fcmatch.
@englehardt
englehardt / level_2_domains_2019-11-21.txt
Created November 21, 2019 18:21
Level 2 blocklist domains as of November 21st, 2019
5min.com
abmr.net
aboutecho.com
accounts.google.com
activengage.com
adap.tv
adobe.com
aim.com
akamai.com
akqa.com
@englehardt
englehardt / Twitter-Remove_Likes.user.js
Last active November 17, 2019 18:08
Greasemonkey userscript to remove tweets from timeline which only show up because they were liked by someone you follow.
// ==UserScript==
// @name Remove Likes on Twitter
// @namespace twitter
// @include https://twitter.com/
// @version 2
// @grant GM_addStyle
// ==/UserScript==
GM_addStyle('div.promoted-tweet, div[data-component-context=suggest_activity_tweet] {display: none !important}');
@englehardt
englehardt / sqlite2parquet.py
Created August 16, 2019 20:58
A file for converting OpenWPM sqlite databases to parquet on S3. This also requires the appropriate `parquet_schema.py` file that matches the sqlite schema. See: https://github.com/mozilla/OpenWPM/blob/master/automation/DataAggregator/parquet_schema.py
""" This script reads a sqlite database and writes the content to a parquet
database on S3 formatted as OpenWPM would format. It's best to just run this
on AWS as it bottlenecks on the S3 upload. This is a lightly modified version
of OpenWPM's S3Aggregator class.
"""
import os
import sqlite3
import sys
from collections import defaultdict