Skip to content

Instantly share code, notes, and snippets.

Andrew Otto ottomata

Block or report user

Report or block ottomata

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
View gist:d245072fb77f525799c71599a2d27866
== What onboarding projects are options? What about mixing some research questions with the task of processing data? (for example, find patterns of those who open an account on Wikipedia) @Joseph
* wikidump text analysis?
** category analysis?
Take Tiziano's code and use hadoop instead of wikidump text.
(1st, 2nd) ** historical redirect analysis, add to mediawiki_history (very useful for Analytics)
Please see:
ottomata /
Created Sep 10, 2019
Spark Streaming SQL demo with netflow
# From stat1004:
# pyspark2 --jars ~otto/spark-sql-kafka-0-10_2.11-2.3.1.jar,~otto/kafka-clients-1.1.0.jar
# Need spark-sql-kafka for DataStream source and kafka-clients for Kafka serdes.
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Declare a Spark schema that matches the JSONData.
# In a future MEP world this would be automatically loaded
# from a JSONSchema.
View async_dynamic_mocha_tests.js
function generateSchemaTests(title, majorVersion, schemaInfos) {
it(`All ${title} schemas should have title ${title} `, function() {
schemaInfos.forEach((info) => {
assert.equal(info.schema.title, title);
it(`All ${title} major version ${majorVersion} schemas should be ${majorVersion}.x.y`, function() {
schemaInfos.forEach((info) => {
assert.equal(semver.coerce(_.get(info.schema, '$id')).major, majorVersion);
View revision_score.current.yaml
title: mediawiki/revision/score
description: Represents a MW Revision Score event (from ORES).
$id: /mediawiki/revision/score/1.0.0
type: object
### revision-score does not include all revision/common fields, so we
### don't include revision/commmon schema, and instead specifically list
### the ones we need.
- $ref: /mediawiki/common/1.0.0
# Stop your Jupyter Notebook server from the JupyterHub UI.
# Move your old venv out of the way (or just delete it)
mv $HOME/venv $HOME/venv-old-$().$(date +%s)
# create a new empty venv
python3 -m venv --system-site-packages $HOME/venv
# Reinstall the jupyter venv
cd /srv/jupyterhub/deploy
$HOME/venv/bin/pip install --upgrade --no-index --force-reinstall --find-links=/srv/jupyterhub/deploy/artifacts/stretch/wheels --requirement=/srv/jupyterhub/deploy/frozen-requirements.txt
View missing integers task.txt
You are given a very very large list of unsorted integers. These
integers are supposed to be unique and, if sorted, contiguous. However, you
suspect that this is not the case, so you want to write code to check for
missing or duplicate integers. Write code to return these results:
- Are there any missing or duplicate integers?
- How many missing integers?
- How many duplicate integers?
- Which integers are missing?
- Which integers are duplicates, and how many duplicates of each
View 1.1.0.yaml
title: mediawiki/page/links-change
description: Represents a MW Page Links Change event.
$id: /mediawiki/page/links-change/1.1.0
$schema: ''
type: object
- $schema
- meta
- page_id
- page_is_redirect
View gist:e9b222597b64d693b35421e1f377f628
Downloaded ring-cors manually and put it in .m2/repositories from
~/atlas/apache-maven-3.6.1/bin/mvn -Dhttp.proxyHost=webproxy.eqiad.wmnet -Dhttp.proxyPort=8080 -Dhttps.proxyHost=webproxy.eqiad.wmnet -Dhttps.proxyPort=8080 -DskipTests package -Pdist,embedded-hbase-solr -pl \!:storm-bridge-shim
// use elasticsearch and berkeleydb
View generate-schema.js
#!/usr/bin/env node
'use strict';
const util = require('util');
const _ = require('lodash');
const yaml = require('js-yaml');
const path = require('path');
const glob = require('glob');
const semver = require('semver');
const NodeGit = require('nodegit');
View eventgate librdkafka prometheus statsd exporter
# HELP eventgate_rdkafka_producer_broker_int_latency Kafka Producer per broker window metric
# TYPE eventgate_rdkafka_producer_broker_int_latency gauge
eventgate_rdkafka_producer_broker_int_latency{broker_hostname="kafka-jumbo1001_eqiad_wmnet",broker_id="1001",broker_port="9092",producer_type="guaranteed",quantile="0.50",service="eventgate-analytics"} 0
eventgate_rdkafka_producer_broker_int_latency{broker_hostname="kafka-jumbo1001_eqiad_wmnet",broker_id="1001",broker_port="9092",producer_type="guaranteed",quantile="0.75",service="eventgate-analytics"} 0
eventgate_rdkafka_producer_broker_int_latency{broker_hostname="kafka-jumbo1001_eqiad_wmnet",broker_id="1001",broker_port="9092",producer_type="guaranteed",quantile="0.90",service="eventgate-analytics"} 0
eventgate_rdkafka_producer_broker_int_latency{broker_hostname="kafka-jumbo1001_eqiad_wmnet",broker_id="1001",broker_port="9092",producer_type="guaranteed",quantile="0.95",service="eventgate-analytics"} 0
You can’t perform that action at this time.