This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
103092804.com | |
1rx.io | |
247realmedia.com | |
2leep.com | |
2mdn.net | |
2o7.net | |
33across.com | |
360yield.com | |
365media.com | |
3dstats.com |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from StringIO import StringIO | |
import requests | |
import zipfile | |
import random | |
import json | |
import os | |
EC2_LIST = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip' | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import defaultdict | |
import dill | |
import requests | |
from bs4 import BeautifulSoup | |
alexa_categories = defaultdict(list) | |
BASE_URL = 'http://www.alexa.com/topsites/category' | |
print "Grabbing categories of top sites from %s" % BASE_URL |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def get_option_dict(url, top_level_url, content_type=None): | |
"""Build an options dict for BlockListParser. | |
These options are checked here: | |
* https://github.com/englehardt/abp-blocklist-parser/blob/40f6bb5b91ea403b7b9852a16d6c57d5ec26cf7f/abp_blocklist_parser/RegexParser.py#L104-L117 | |
* https://github.com/englehardt/abp-blocklist-parser/blob/40f6bb5b91ea403b7b9852a16d6c57d5ec26cf7f/abp_blocklist_parser/RegexParser.py#L240-L248 | |
Parameters | |
---------- | |
url : string |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
This file contains a collection of utilities for working with BlockListParser | |
using http data, such as that collected by OpenWPM (https://github.com/citp/OpenWPM). | |
publicsuffix (https://pypi.python.org/pypi/publicsuffix/) is required | |
Example usage: | |
from publicsuffix import PublicSuffixList | |
from BlockListParser import BlockListParser |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from urlparse import urlparse | |
from Crypto.Hash import MD2 | |
import pandas as pd | |
import cookies as ck | |
import hackercodecs # noqa | |
import hashlib | |
import pyblake2 | |
import urllib | |
import sha3 | |
import mmh3 |
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 2.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
top_level_hostname,url,redirect_chain | |
www.prabhasakshi.com,https://fcmatch.google.com/pixel?google_gm=AMnCDorgMK87r03e115XCkX55u3NGIsHQVGSw3sfqlf2vTyPC1FUd-EWM0O9WxVM7-EvH31H1yx2L5xh-p78KY4cOU_R6Gekf76P6ukigIDMufCzxoqAWbwfYeNFjxxmLcH56fmsIuWl,"[""https://e.dlx.addthis.com/e/a-1189/s-3614?redirect_provider_id=3614&ru=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Ddatalogix_dmp%26google_hm%3D%3CNA_ID%3E%26google_push%3DAHNF13If3D87PP63h-DtKCOgSghwXpmcwg4r08mF1ZsSUQ&google_gid=CAESEAbb_EW8Fb8b1FCVaJP9kFc&google_cver=1"",""https://e.dlx.addthis.com/e/a-1189/s-3614?redirect_provider_id=3614&ru=https%3A%2F%2Fcm.g.doubleclick.net%2Fpixel%3Fgoogle_nid%3Ddatalogix_dmp%26google_hm%3D%3CNA_ID%3E%26google_push%3DAHNF13If3D87PP63h-DtKCOgSghwXpmcwg4r08mF1ZsSUQ&google_gid=CAESEAbb_EW8Fb8b1FCVaJP9kFc&google_cver=1&rd=Y"",""https://cm.g.doubleclick.net/pixel?google_nid=datalogix_dmp&google_hm=MjAxOTA4MjcwNDI1MDk5ODY2ODEwMTQ0Mzk5Ng%3D%3D&google_push=AHNF13If3D87PP63h-DtKCOgSghwXpmcwg4r08mF1ZsSUQ"",""https://fcmatch. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5min.com | |
abmr.net | |
aboutecho.com | |
accounts.google.com | |
activengage.com | |
adap.tv | |
adobe.com | |
aim.com | |
akamai.com | |
akqa.com |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
// ==UserScript== | |
// @name Remove Likes on Twitter | |
// @namespace twitter | |
// @include https://twitter.com/ | |
// @version 2 | |
// @grant GM_addStyle | |
// ==/UserScript== | |
GM_addStyle('div.promoted-tweet, div[data-component-context=suggest_activity_tweet] {display: none !important}'); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" This script reads a sqlite database and writes the content to a parquet | |
database on S3 formatted as OpenWPM would format. It's best to just run this | |
on AWS as it bottlenecks on the S3 upload. This is a lightly modified version | |
of OpenWPM's S3Aggregator class. | |
""" | |
import os | |
import sqlite3 | |
import sys | |
from collections import defaultdict |
NewerOlder