Skip to content

Instantly share code, notes, and snippets.

View milimetric's full-sized avatar

Dan Andreescu milimetric

  • Wikimedia Foundation
  • New York, NY
View GitHub Profile
@milimetric
milimetric / distinct-wikis-for-one-hour-of-pageviews
Created November 17, 2015 20:59
To help with finding the correct project name for the WMF Pageview API, here is a list of all distinct projects taken from one hour of pageviews.
aa.wikibooks
aa.wikipedia
aa.wiktionary
ab.wikipedia
ace.wikipedia
af.wikibooks
af.wikipedia
af.wikiquote
af.wiktionary
ak.wikipedia
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Column, Integer
engine = create_engine('mysql://test:test@localhost/test')
meta = MetaData()
table = sqlalchemy.Table(
'test_table',
meta,
Column('test_col', Integer),
-- From the Visibility Data
select event_country,
sum(if(event_issues = 0, 1, 0)) as 'No Issues',
sum(if(event_issues = 1, 1, 0)) as 'Issues with ş',
sum(if(event_issues = 2, 1, 0)) as 'Issues with ţ',
sum(if(event_issues = 3, 1, 0)) as 'Issues with both',
count(*) as 'Total'
from DiacriticsVisibility_15606222
group by event_country
;
val knownStates = sc.parallelize(Array.empty[PageState])
val a2b = new PageEvent( oldTitle = Some("A"), newTitle = Some("B"), timestamp = Some("2011"),
eventType = Some("move"), causedByUserId = Some(1), causedByUserName = Some("Dan")
)
val b2c = new PageEvent( oldTitle = Some("B"), newTitle = Some("C"), timestamp = Some("2015"),
eventType = Some("move"), causedByUserId = Some(1), causedByUserName = Some("Dan")
)
val d2e = new PageEvent( oldTitle = Some("D"), newTitle = Some("E"), timestamp = Some("2014"),
eventType = Some("move"), causedByUserId = Some(1), causedByUserName = Some("Dan")
-- this is a collection of odd findings while trying to reconstruct mediawiki history
-- 1. revisions that should stand for a page creation without a corresponding record in the page table
-- REASON: move_redir orphans records in the revision table instead of archiving them like the move_and_delete
-- MovePage::moveToInternal $moveOverRedirect, $newpage->doDeleteUpdates 503
-- as opposed to the correct: SpeicalMovepage.php $page->doDeleteArticleReal line 559
select count(*)
from revision
where rev_parent_id = 0
and rev_page not in (select page_id from page);
@milimetric
milimetric / namespace.map.py
Last active August 8, 2016 21:35
Map localized namespace prefixes to namespace integer codes for all (public) wikimedia projects.
import requests
import json
import csv
headers = {
'User-Agent': 'Bot Dan Andreescu parsing Special:PrefixIndex pages',
'From': 'dandreescu@wikimedia.org'
}
-- join oldTitle, oldNamespace, timestamp to title, namespace, end and vice versa, should match perfectly
select sum(if(log_title is null, 1, 0)) as history_without_log_line,
sum(if(page_title is null, 1, 0)) as log_line_without_history
from mforns.simplewiki_page_history
FULL OUTER JOIN
milimetric.simplewiki_logging ON cast(log_title as string) = page_title
AND log_namespace = page_namespace
AND cast(log_timestamp as string) = `end`
AND `end` is not null
-- ran this on the hadoop cluster (not available publicly)
select std(edits_per_page),
avg(edits_per_page),
std(days_from_first_to_last_edit),
avg(days_from_first_to_last_edit)
from (select rev_page pid,
(unix_timestamp(max(rev_timestamp),'yyyyMMddHHmmss') - unix_timestamp(min(rev_timestamp),'yyyyMMddHHmmss')) / 86400 days_from_first_to_last_edit,
count(*) edits_per_page
from mediawiki_revision
inner join
ADD JAR /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-hive-0.0.35.jar;
CREATE TEMPORARY FUNCTION ua_parser as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';
select ua_parser('Wikimedia Bot');
@milimetric
milimetric / getting wiki project pageviews.sql
Created September 20, 2016 21:16
This is a step-by-step query / command guide to getting pageviews for a particular wiki project. We will productionize this and make it part of the pageview API.
-- Amir's query to get all pages in WikiProject_Medicine, executed on analytics-store
select page_title
from (select tl_from, rd_from
from templatelinks
left join
redirect on rd_from = tl_from
and rd_title = 'WikiProject_Medicine'
and (rd_interwiki is null or rd_interwiki = '')
and rd_namespace = 10
inner join