Skip to content

Instantly share code, notes, and snippets.

tfmorris /
Created Mar 29, 2016
Online variation and standard deviation using Welford's algorithm and Java 8 Streams - just a sketch! only lightly tested!!
import java.util.Collections;
import java.util.EnumSet;
import java.util.IntSummaryStatistics;
import java.util.Set;
import java.util.function.BiConsumer;
import java.util.function.BinaryOperator;
import java.util.function.Function;
import java.util.function.Supplier;
import java.util.function.ToIntFunction;
# -*- coding: utf-8 -*-
A simple example program to analyze the Common Crawl index.
This is implemented as a single stream job which accesses S3 via HTTP,
so that it can be easily be run from any laptop, but it could easily be
converted to an EMR job which processed the 300 index files in parallel.
import shutil
import urllib2
import platform
import tempfile
import urllib
import os
import subprocess
import webbrowser
import stat
tfmorris /
Last active Dec 19, 2015
BBC Desert Island Discs scraper for the current/old scraperwiki Until scraperwiki shutsdown original is at
# Scrape BBC Desert Island Discs data including songs, books, and luxury item, if available, for the celebrity "castaways"
# based on original work by Francis Irving with the following changes by Tom Morris July 2012:
# - updated to current BBC page format
# - switched from BeautifulSoup to lxml
# - updated deprecated database calls
# - restructured to run as a single integrated process and not rescrape data it already extracted
import scraperwiki
import scraperwiki.apiwrapper
import lxml.html
View abbyy2hocr.xsl
<?xml version='1.0' encoding='utf-8'?>
<xsl:stylesheet version='1.0' xmlns:xsl=''>
Author: Rod Page
<xsl:output method='html' version='1.0' encoding='utf-8' indent='yes'/>
<xsl:variable name="scale" select="800 div //page/@width" />