Skip to content

Instantly share code, notes, and snippets.

Sebastian Nagel sebastian-nagel

Block or report user

Report or block sebastian-nagel

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
View sitemap-index-cc-224-trim-ws.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="">
<![CDATA[ ]]>
<![CDATA[ 2018-12-12 02:06:56 ]]>
sebastian-nagel / watlinks.path.freq.txt
Created Oct 19, 2017
Link path identifiers from a single Common Crawl WAT file
View watlinks.path.freq.txt
sebastian-nagel /
Last active Mar 9, 2018
Python script to export Common Crawl WARC records found via CDX to a file named my.warc.gz: `zgrep '...pattern...' cdx-*.gz | python3 >my.warc.gz`
import fileinput
import sys
import boto3
import botocore
import ujson as json
no_sign_request = botocore.client.Config(
# hanging executor on Spark 2.1.0 and Python 2.7
from pyspark import SparkContext
class BadEncodedException(Exception):
def __init__(self, reason):
self.msg = str(reason)
super(BadEncodedException, self).__init__(self.msg)
sebastian-nagel /
Last active Aug 18, 2016
Simple spam detection of Common Search host-level page rank list: detect blocks of hosts with similar rank and host names which ev. form link farms
import fileinput
import sys
import tldextract
from _collections import defaultdict
from math import log
# -*- coding: utf-8 -*-
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.
You can’t perform that action at this time.