Skip to content

Instantly share code, notes, and snippets.

View milimetric's full-sized avatar

Dan Andreescu milimetric

  • Wikimedia Foundation
  • New York, NY
View GitHub Profile
def deduplicate(list_of_objects, key_function):
uniques = dict()
for o in list_of_objects:
key = key_function(o)
if not key in uniques:
uniques[key] = o
return uniques.values()
@milimetric
milimetric / umapi.parallel.test
Last active December 17, 2015 10:09
quick and dirty script to test parallelism on umapi
# fill in u and p to the proper usernames and passwords
username=u
password=p
htuser=u
htpass=p
curl --data "username=$username&password=$password" https://$htuser:$htpass@metrics.wikimedia.org/login -c ~/umapi.session
for cohort in test e2_aft5_cta4 e3_ob2b_gettingstarted_page-impression e3_ob4b_gettingstarted-addlinks_page-impression e3_ob4b_gettingstarted-clarify_page-impression e3_ob4b_gettingstarted-copyedit_page-impression
do
/srv/debugging.wmflabs.org/
/srv/dev-reportcard.wmflabs.org/
/srv/ee-dashboard.wmflabs.org/
/srv/gerrit-stats.wmflabs.org/
/srv/gp.wmflabs.org/
/srv/mobile-reportcard-dev.wmflabs.org/
/srv/mobile-reportcard.wmflabs.org/
/srv/test-reportcard.wmflabs.org/
REGISTER 'kraken-pig-0.0.2-SNAPSHOT.jar'
REGISTER 'kraken-generic-0.0.2-SNAPSHOT-jar-with-dependencies.jar'
REGISTER 'geoip-1.2.5.jar'
IMPORT 'include/load_webrequest.pig';
SET default_parallel 2;
DEFINE TO_HOUR org.wikimedia.analytics.kraken.pig.ConvertDateFormat('yyyy-MM-dd\'T\'HH:mm:ss', 'yyyy-MM-dd_HH');
DEFINE EXTRACT org.apache.pig.builtin.REGEX_EXTRACT_ALL();
DEFINE ZERO org.wikimedia.analytics.kraken.pig.Zero();
LOG_FIELDS = LOAD_WEBREQUEST('/wmf/raw/webrequest/webrequest-wikipedia-mobile/dt=2013-05-01*');
LOG_FIELDS = FILTER LOG_FIELDS BY (x_cs != '-');
self.create_test_cohort(
editor_count=4,
revisions_per_editor=3,
revision_timestamps=[
[
datetime(2012, 12, 31, 23, 0, 0),
datetime(2013, 1, 1, 0, 30, 0),
datetime(2013, 1, 1, 1, 0, 0),
],
[
@milimetric
milimetric / aggregate_daily.hql
Created October 4, 2013 19:57
Hive script to create an internal table and insert hourly data aggregated at the daily level.
DROP TABLE IF EXISTS milimetric_pagecounts_daily;
CREATE TABLE IF NOT EXISTS milimetric_pagecounts_daily(
project string,
page string,
views int,
bytes int,
year int,
month int,
day int
)
@milimetric
milimetric / differ.py
Last active December 25, 2015 00:28
python to diff two lists of datetime strings, each with their own format
from datetime import datetime
def diff_datewise(left, right, left_format=None, right_format=None):
"""
Parameters
left : a list of datetime strings or objects
right : a list of datetime strings or objects
left_format : None if left contains datetimes, or strptime format
right_format : None if right contains datetimes, or strptime format
@milimetric
milimetric / categoryEdgeList.sql
Last active December 25, 2015 07:09
Category Edge List for MediaWiki content
select p1.page_title as child_title
,p1.page_id as child_id
,p2.page_title as parent_title
,p2.page_id as parent_id
from categorylinks cl
inner join
page p1 on p1.page_id = cl.cl_from
inner join
page p2 on p2.page_title = cl.cl_to
and p2.page_id <> cl.cl_from
*swp
@milimetric
milimetric / redis.conf
Created December 11, 2013 17:01
wikimetrics configuration for redis
daemonize yes
pidfile /var/run/redis.pid
port 6379
timeout 0
loglevel debug
logfile /var/log/redis/redis-server.log
databases 16
save 900 1
save 300 10
save 60 20