Skip to content

Instantly share code, notes, and snippets.

View halfak's full-sized avatar

Aaron Halfaker halfak

View GitHub Profile
mysql:research@analytics-store.eqiad.wmnet [staging]> SELECT -6.4 BETWEEN -5.1 AND -3.1;
+----------------------------+
| -6.4 BETWEEN -5.1 AND -3.1 |
+----------------------------+
| 0 |
+----------------------------+
1 row in set (0.00 sec)
mysql:research@analytics-store.eqiad.wmnet [staging]> SELECT -3.4 BETWEEN -5.1 AND -3.1;
+----------------------------+
CREATE TABLE halfak.searches
SELECT dt, ip, user_agent, uri_host, uri_query
FROM webrequest
WHERE
uri_query LIKE "%title=Special%3ASearch%" AND
uri_query LIKE "%search=%" AND
uri_path = "/w/index.php" AND
year = 2014;
import json
import sys
"""
HEADERS = [
('index', 'index'),
('product/productId', 'product_id'),
('product/productTitle', 'product_title'),
('product/price', 'price'),
('review/userId', 'review_user_id'),
mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT COUNT(*) FROM revision WHERE rev_timestamp BETWEEN "20140101" AND "20140102";
+----------+
| COUNT(*) |
+----------+
| 138753 |
+----------+
1 row in set (0.47 sec)
mysql:research@analytics-store.eqiad.wmnet [enwiki]> SELECT COUNT(*) FROM revision WHERE rev_timestamp BETWEEN "2014-01-01" AND "2014-01-02";
+----------+
[halfak@stat1003: ~/projects/productivity]
$ rsync -rv simplewiki_20141025.fields_and_diffs.head.tsv stat1002.wikimedia.org::a/halfak/diffengine/
rsync: getaddrinfo: stat1002.wikimedia.org 873: Name or service not known
rsync error: error in socket IO (code 10) at clientserver.c(128) [sender=3.1.0]
[halfak@stat1003: ~/projects/productivity]
$ rsync -rv simplewiki_20141025.fields_and_diffs.head.tsv stat1002.eqiad.wmnet::a/halfak/diffengine/
@ERROR: access denied to a from stat1003.wikimedia.org (208.80.154.82)
rsync error: error starting client-server protocol (code 5) at main.c(1653) [sender=3.1.0]
gini <- function(x, unbiased = TRUE, na.rm = FALSE){
if (!is.numeric(x)){
warning("'x' is not numeric; returning NA")
return(NA)
}
if (!na.rm && any(na.ind <- is.na(x)))
stop("'x' contain NAs")
if (na.rm)
x <- x[!na.ind]
n <- length(x)
$ ssh wikimedia.altiscale
Last login: Wed Jan 14 17:02:27 2015 from 10.252.17.5
_ _ _ _
| | | | (_) | |
__ _ | |_| |_ _ ___ ___ __ _ | | ___
/ _` || |_ _| |/ __| / __| / _` || | / _ \
| (_| || | | | | |\__ \| (__ | (_| || || __/
\__,_||_| |_| |_||___/ \___| \__,_||_| \___|
[halfak@desktop-wikimedia ~]$ df -h
>>> import revscores
>>> dir(revscores)
['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
>>> from revscores import languages
>>> dir(languages)
['Language', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'english', 'language', 'portuguese']
Notice that the first "dir()" doesn't list out langauge. This is because language is not imported by default.
But when we run dir() on language, we can see "english", "portuguese" and "language". This is because these modules are imported by default.
(3.4) [halfak@stat1002: ~]
$ scp foo wikimedia.altiscale:
foo 100% 39 0.0KB/s 00:00
(3.4) [halfak@stat1002: ~]
$ ssh -N -L 14000:wikimedia.z42.altiscale.com:14000 wikimedia.altiscale &
[1] 13510
(3.4) [halfak@stat1002: ~]
$ hdfs dfs -ls webhdfs://localhost:14000/user/halfak/streaming/enwiki-20141106/json-bz2/
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
[20:46:05] <harej> halfak: as a gentle reminder: https://meta.wikimedia.org/wiki/Research:WikiProjects_and_Subject_Area_Activity_(English_Wikipedia)
[20:46:42] <halfak> Harej, did you want me to look at the methods section?
[20:46:50] <harej> I think that was what it was
[20:46:58] <halfak> What's a longitudinal factor?
[20:46:59] <harej> I am also interested in information about your quality heuristics!
[20:47:24] <halfak> logitudinal factor == https://en.wikipedia.org/wiki/Censoring_(statistics)
[20:47:52] <harej> the longitudinal factors that affect wikiprojects mostly have to do with how some wikiprojects were active years ago even if they are not active now; differing levels of activity throughout a project's life. To keep everything even from a time scale perspective I am just doing things from July 1 to December 31
[20:48:39] <halfak> I'm not sure this will help. Many WikiProjects will be in different lifecycle stages between July 1 and Dec. 31
[20:49:02] <halfak> Might we try to control for