Skip to content

Instantly share code, notes, and snippets.

View jython_webgraph_commands.sh
### Jython
# install Jython (see https://www.jython.org/download)
wget https://repo1.maven.org/maven2/org/python/jython-standalone/2.7.2/jython-standalone-2.7.2.jar
# clone pywebgraph (fork with modifications)
git clone https://github.com/commoncrawl/py-web-graph.git
cd py-web-graph
# copy console.py into current working directory so that "pywebgraph" is visible as package
cp pywebgraph/console.py .
@sebastian-nagel
sebastian-nagel / cs_despam_host_pagerank.py
Last active Jun 11, 2020
Simple spam detection of Common Search host-level page rank list: detect blocks of hosts with similar rank and host names which ev. form link farms
View cs_despam_host_pagerank.py
import fileinput
import sys
import tldextract
from _collections import defaultdict
from math import log
RANK_DIVERGENCE_THR = 0.02
HOST_LENGTH_DIVERGENCE_THR = 0.15
@sebastian-nagel
sebastian-nagel / REAMDE.md
Created Oct 21, 2019
character set and content language correlations
View REAMDE.md
View iterate_wet_file.py
from warcio.archiveiterator import ArchiveIterator
with open('path/to/file.wet.gz', 'rb') as stream:
for record in ArchiveIterator(stream):
if record.rec_type == 'conversion':
url = record.rec_headers.get_header('WARC-Target-URI')
text = record.content_stream().read().decode('utf-8')
View sitemap-index-cc-224-trim-ws.xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>
<![CDATA[ http://www.example.com/sitemap1.xml ]]>
</loc>
<lastmod>
<![CDATA[ 2018-12-12 02:06:56 ]]>
</lastmod>
</sitemap>
@sebastian-nagel
sebastian-nagel / cdx_get_warc_record.py
Last active Mar 9, 2018
Python script to export Common Crawl WARC records found via CDX to a file named my.warc.gz: `zgrep '...pattern...' cdx-*.gz | python3 cdx_get_warc_record.py >my.warc.gz`
View cdx_get_warc_record.py
import fileinput
import sys
import boto3
import botocore
import ujson as json
no_sign_request = botocore.client.Config(
@sebastian-nagel
sebastian-nagel / watlinks.path.freq.txt
Created Oct 19, 2017
Link path identifiers from a single Common Crawl WAT file
View watlinks.path.freq.txt
View pyspark_executor_hangup.py
# hanging executor on Spark 2.1.0 and Python 2.7
from pyspark import SparkContext
class BadEncodedException(Exception):
def __init__(self, reason):
self.msg = str(reason)
super(BadEncodedException, self).__init__(self.msg)
View get_dmoz_news_links.sh
View common-crawl-cdx.py
# -*- coding: utf-8 -*-
"""
common-crawl-cdx.py
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.