Skip to content

Instantly share code, notes, and snippets.

View milimetric's full-sized avatar

Dan Andreescu milimetric

  • Wikimedia Foundation
  • New York, NY
View GitHub Profile
@milimetric
milimetric / basic signal timeout.py
Last active December 2, 2023 23:13
Set a timeout for executing python code in a with statement
import signal
import re
class TimeoutError(Exception):
pass
class timeout:
@milimetric
milimetric / Missing Sequence Numbers in Hive
Last active January 11, 2023 16:43
Two ways to find missing sequence numbers in huge Hive tables. First way - gets the left and right boundaries of each run of missing sequences. Second way - gets the count of boundaries, which if greater than 2 signifies missing sequences. The second way doesn't tell you which sequences are missing or how many are missing, but runs faster.
/* Common setup, two variants follow
*/
use test;
set tablename=webrequest_esams0;
add jar /home/otto/hive-serdes-1.0-SNAPSHOT.jar;
add jar /usr/lib/hive/lib/hive-contrib-0.10.0-cdh4.3.1.jar;
create temporary function rowSequence AS 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
@milimetric
milimetric / dag_gen.py
Created March 9, 2022 17:02
Thoughts on DAG generation
projectview_ready = HiveTriggeredHQLTaskFactory(
'run_hql_and_arhive',
default_args=default_args,
...
)
archive = ArchiveTaskFactory(...)
projectview_ready.sensors() >> projectview_ready.etl() >> archive()
@milimetric
milimetric / import_one_hour.sh
Last active January 7, 2020 15:35
Imports one hour of Domasz's pageview data into a partitioned Hive table. It takes roughly 12 seconds to import one hour of data, btw. This means roughly 8 days to import all 7 years of data.
#!/bin/bash
#
# This script does the following:
# 0. reads four arguments from the CLI, in order, as YEAR, MONTH, DAY, HOUR
# 1. downloads the specified hour worth of data from http://dumps.wikimedia.org/other/pagecounts-raw/
# 2. extracts the data into hdfs
# 3. creates a partition on a hive table pointing to this data
#
print_help() {
@milimetric
milimetric / query.sql
Created February 22, 2019 16:51
example query to mess around with
use wmf;
-- new data
select coalesce(c.country, g.country_code) as country,
sum(edit_count) as edits,
sum(namespace_zero_edit_count) as namespace_zero_edits
from geoeditors_edits_monthly g
inner join
(select distinct dbname
select ar_id, ar_namespace, ar_title, NULL as ar_text, NULL as ar_comment, NULL as ar_comment_id,
case when ar_deleted&4 != 0 then null when ar_actor = 0
then ar_user else COALESCE( actor_user, 0 ) END AS ar_user,
case when ar_deleted&4 != 0 then null when ar_actor = 0
then ar_user_text else actor_name END AS ar_user_text,
if(ar_deleted&4 <> 0,0,ar_actor) as ar_actor, ar_timestamp, ar_minor_edit, NULL as ar_flags, ar_rev_id,
case when ar_deleted&1 != 0 then null when content_id is NULL then ar_text_id
else content_id end as ar_text_id,
ar_deleted, if(ar_deleted&1 <> 0,null,ar_len) as ar_len,
@milimetric
milimetric / survive-1.sql
Created June 13, 2018 20:28
history query example
with users_with_revisions as (
select event_user_id,
event_timestamp
from mediawiki_history
where event_entity = 'revision'
and event_type = 'create'
and snapshot = '2018-05'
and wiki_db = 'enwiki'
)
@milimetric
milimetric / OojsUiCheckBoxInputWidget.vue
Created May 24, 2017 21:14
This is a quick example that shows how to wrap an oojs-ui component in a Vue component. It's nasty because of the lack of componentization of oojs-ui, but it's just a proof of concept.
<template>
<!-- In Vue, $el is this root element defined in the template section -->
<span></span>
</template>
<script>
// the script-loader webpack plugin has to be used to hack the oojs-ui files directly into script tags
// because they have no modularization whatsoever (AMD, ES6, etc.)
import 'script-loader!oojs/dist/oojs.jquery'
import 'oojs-ui/dist/oojs-ui-core'
{
"dataSources" : [
{
"spec" : {
"dataSchema" : {
"dataSource" : "pageviews-hourly",
"metricsSpec" : [
{
"name" : "view_count",
"type" : "longSum",
# NOTE: required for the following to work:
# !pip install pymysql\n",
# !git clone https://gerrit.wikimedia.org/r/p/operations/mediawiki-config\n",
# !cd mediawiki-config && git pull origin master"
import pymysql
import ipaddress
import os
connection = pymysql.connect(
host='analytics-store.eqiad.wmnet',