Skip to content

Instantly share code, notes, and snippets.

@sebastian-nagel
sebastian-nagel / jython_webgraph_commands.sh
Last active September 28, 2020 13:38
webgraph commands
### Jython
# install Jython (see https://www.jython.org/download)
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar
# clone pywebgraph (fork with modifications)
git clone https://github.com/commoncrawl/py-web-graph.git
cd py-web-graph
# copy console.py into current working directory so that "pywebgraph" is visible as package
cp pywebgraph/console.py .
@sebastian-nagel
sebastian-nagel / REAMDE.md
Created October 21, 2019 13:05
character set and content language correlations
from warcio.archiveiterator import ArchiveIterator
with open('path/to/file.wet.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'conversion':
url = record.rec_headers.get_header('WARC-Target-URI')
text = record.content_stream().read().decode('utf-8')
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>
<![CDATA[ http://www.example.com/sitemap1.xml ]]>
</loc>
<lastmod>
<![CDATA[ 2018-12-12 02:06:56 ]]>
</lastmod>
</sitemap>
@sebastian-nagel
sebastian-nagel / watlinks.path.freq.txt
Created October 19, 2017 14:20
Link path identifiers from a single Common Crawl WAT file
@sebastian-nagel
sebastian-nagel / cdx_get_warc_record.py
Last active March 9, 2018 08:59
Python script to export Common Crawl WARC records found via CDX to a file named my.warc.gz: `zgrep '...pattern...' cdx-*.gz | python3 cdx_get_warc_record.py >my.warc.gz`
import fileinput
import sys
import boto3
import botocore
import ujson as json
no_sign_request = botocore.client.Config(
# hanging executor on Spark 2.1.0 and Python 2.7
from pyspark import SparkContext
class BadEncodedException(Exception):
def __init__(self, reason):
self.msg = str(reason)
super(BadEncodedException, self).__init__(self.msg)
@sebastian-nagel
sebastian-nagel / cs_despam_host_pagerank.py
Last active November 9, 2022 22:17
Simple spam detection of Common Search host-level page rank list: detect blocks of hosts with similar rank and host names which ev. form link farms
import fileinput
import sys
import tldextract
from _collections import defaultdict
from math import log
RANK_DIVERGENCE_THR = 0.02
HOST_LENGTH_DIVERGENCE_THR = 0.15
# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.