Skip to content

Instantly share code, notes, and snippets.

Sebastian Nagel sebastian-nagel

Block or report user

Report or block sebastian-nagel

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
View sitemap-index-cc-224-trim-ws.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>
<![CDATA[ http://www.example.com/sitemap1.xml ]]>
</loc>
<lastmod>
<![CDATA[ 2018-12-12 02:06:56 ]]>
</lastmod>
</sitemap>
@sebastian-nagel
sebastian-nagel / watlinks.path.freq.txt
Created Oct 19, 2017
Link path identifiers from a single Common Crawl WAT file
View watlinks.path.freq.txt
@sebastian-nagel
sebastian-nagel / cdx_get_warc_record.py
Last active Mar 9, 2018
Python script to export Common Crawl WARC records found via CDX to a file named my.warc.gz: `zgrep '...pattern...' cdx-*.gz | python3 cdx_get_warc_record.py >my.warc.gz`
View cdx_get_warc_record.py
import fileinput
import sys
import boto3
import botocore
import ujson as json
no_sign_request = botocore.client.Config(
View pyspark_executor_hangup.py
# hanging executor on Spark 2.1.0 and Python 2.7
from pyspark import SparkContext
class BadEncodedException(Exception):
def __init__(self, reason):
self.msg = str(reason)
super(BadEncodedException, self).__init__(self.msg)
View get_dmoz_news_links.sh
@sebastian-nagel
sebastian-nagel / cs_despam_host_pagerank.py
Last active Aug 18, 2016
Simple spam detection of Common Search host-level page rank list: detect blocks of hosts with similar rank and host names which ev. form link farms
View cs_despam_host_pagerank.py
import fileinput
import sys
import tldextract
from _collections import defaultdict
from math import log
RANK_DIVERGENCE_THR = 0.02
HOST_LENGTH_DIVERGENCE_THR = 0.15
View common-crawl-cdx.py
# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.
You can’t perform that action at this time.