Skip to content

Instantly share code, notes, and snippets.

@ebernhardson
ebernhardson / README
Last active November 24, 2020 18:01
T263781
Notebook and output data for https://phabricator.wikimedia.org/T263781
@ebernhardson
ebernhardson / Poorly_Performing_Queries.ipynb
Last active April 29, 2019 17:54
Poorly Performing Queries notebook
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
public Dataset<Row> buildPairsForM0Prep(Dataset<Row> df, Dataset<Row> dfOld, GlentParams params) {
dfOld = dfOld
.where(col("part").equalTo(params.glentDfM0PrepPartOld)) // limit to previous portion of M0Prep dataframe
.drop(col("part"));
Column oldTsCondition = null;
if (dfOld.isEmpty()) {
oldTsCondition = lit(true);
} else {
Row[] oldTsRows = dfOld.agg(max("q1_ts").alias("tsmax")).collect();
@ebernhardson
ebernhardson / Tensorflow_on_SWAP.ipynb
Last active February 7, 2019 06:13
Tensorflow in SWAP
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@ebernhardson
ebernhardson / mlr.puml
Created January 15, 2019 19:53
MLR Pipeline Sequence Diagram
@startuml
== click log generation ==
oozie -> oozie: schedule label generation
note left
arrow signify initiator
of communication, not
data flow
end note
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@ebernhardson
ebernhardson / Dockerfile
Last active August 1, 2018 06:41
LightGBM + HDFS Demo
FROM docker-registry.wikimedia.org/wikimedia-jessie
ENTRYPOINT ["/bin/bash"]
COPY cloudera.list /etc/apt/sources.list.d/cloudera.list
COPY cloudera.pref /etc/apt/preferences.d/cloudera.pref
COPY archive.key /root/archive.key
ENV HADDOP_CONF=/etc/hadoop/conf
import argparse
import logging
import os
import re
from tempfile import TemporaryFile
import boto3
import botocore
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
def randomVec(r: Random, size: Int): Vector = {
val feats = for (i <- 0 to size) yield r.nextDouble
Vectors.dense(feats.toArray)
}