Skip to content

Instantly share code, notes, and snippets.

@Ironholds
Ironholds / gist:428014d22edb7969ff5c
Created December 16, 2014 20:29
App UUID query
DROP VIEW IF EXISTS app_uuid_view;
CREATE VIEW app_uuid_view AS
SELECT
CASE WHEN user_agent LIKE('%iPhone%') THEN 'iOS'
ELSE 'Android' END AS platform,
parse_url(concat('http://bla.org/woo/', uri_query), 'QUERY', 'appInstallID') AS uuid
FROM wmf_raw.webrequest
WHERE uri_query LIKE('%sections=0%')
AND uri_query LIKE('%action=mobileview%')
AND uri_query LIKE('%appInstallID%')
//Example: this is an existing test
public void testIsPageviewApp() {
Text uriHost = new Text("en.wikipedia.org");
Text uriPath = new Text("/w/api.php?action=mobileview&sections=0");
Text httpStatus = new Text("200");
Text contentType = new Text("application/json");
Text userAgent = new Text("WikipediaApp/1.2.3");
IsPageviewUDF udf = new IsPageviewUDF();
assertTrue(udf.evaluate(uriHost, uriPath, httpStatus, contentType, userAgent).get());
}
SET hive.exec.compress.output=true;
SET whitelisted_mediawiki_projects = 'commons', 'meta', 'incubator', 'species';
CREATE TABLE ironholds.pageviews_sample_test(qualifier STRING, count_views INT);
INSERT OVERWRITE TABLE ironholds.pageviews_sample_test
SELECT
CONCAT(sub1.language_and_site, sub1.project_suffix) qualifier,
COUNT(*) count_views
FROM (
SELECT
regexp_extract(uri_host, '^([A-Za-z0-9-]+(\\.(zero|m))?)\\.[a-z]*\\.org$') language_and_site,
@Ironholds
Ironholds / hashing_benchmarks.R
Created November 28, 2014 02:18
Benchmarks for my upcoming string anonymisation package.
library(anonymise)
library(digest)
library(microbenchmark)
#Generate some unique character strings. Say, 30,000 of them.
uniques <- character(30000)
for(i in seq_along(uniques)){
uniques[i] <- paste(sample(c(0:9,letters,LETTERS), 30), collapse = "")
}
> library(WMUtils)
Loading required package: jsonlite
Attaching package: ‘jsonlite’
The following object is masked from ‘package:utils’:
View
Loading required package: RMySQL
org.apache.thrift.TApplicationException: Internal error processing FetchResults
at org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:71)
at org.apache.hive.service.cli.thrift.TCLIService$Client.recv_FetchResults(TCLIService.java:505)
at org.apache.hive.service.cli.thrift.TCLIService$Client.FetchResults(TCLIService.java:492)
at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:311)
at info.urbanek.Rpackage.RJDBC.JDBCResultPull.fetch(JDBCResultPull.java:70)
Error in .jcall(rp, "I", "fetch", stride) :
java.sql.SQLException: Error retrieving next row
country yoy_pattern most_recent
1: AD -23.595506 409000
2: AE -13.545211 33170000
3: AF -2.479339 1539000
4: AG -20.183486 431000
5: AI -40.350877 42000
@Ironholds
Ironholds / gist:38ba1e27017e544925df
Last active August 29, 2015 14:09
Benchmark ALL the things!
ips <- mysql_query("SELECT DISTINCT(cuc_ip) FROM cu_changes WHERE cuc_ip IS NOT NULL LIMIT 10000;","enwiki")$cuc_ip
Unit: milliseconds
expr min lq mean median uq max neval
{ test <- c_geo_city(ips) } 50.61829 57.15351 77.02546 60.63893 65.2781 319.1384 100
Unit: seconds
expr min lq mean median uq max neval
{ test <- geo_city(ips) } 1.499703 1.78602 1.935753 1.923845 2.067342 2.564509 100
@Ironholds
Ironholds / power_of_c.R
Created November 18, 2014 15:13
The power of C
Unit: milliseconds
expr min lq mean median uq max neval
{ test <- c_geo_country(ips) } 56.83126 61.43066 64.75143 63.72016 66.15943 136.7492 100
Unit: seconds
expr min lq mean median uq max neval
{ test <- geo_country(ips) } 5.597797 5.814648 6.260509 5.900368 6.118264 10.61174 100
> time = "1250-02-46"
> strptime(time, "%Y-%m-%j")
[1] "1250-02-15"