Skip to content

Instantly share code, notes, and snippets.

@avibryant
avibryant / gist:11376572
Last active March 13, 2016 01:51
Google <=> Open Source Rosetta Stone
GFS = HDFS
MapReduce = Hadoop
BigTable = HBase
Protocol Buffers = Thrift or Avro (serialization)
Stubby = Thrift or Avro (RPC)
ColumnIO = Parquet
Dremel = Impala
Chubby = Zookeeper
Omega = Mesos
Borg = Aurora
@danielcompton
danielcompton / gist:9719633
Created March 23, 2014 06:48
Data Science Learning
- [ ] Math
- [ ] Linear Algebra
- [ ] Lay - Linear Algebra
- [ ] 18.06 Linear Algebra
http://www.scotthyoung.com/blog/mit-challenge/#more-link1806
- [ ] Burke Lecture Notes
http://www.math.washington.edu/~burke/crs/407/lectures/
- [ ] Coding The Matrix
http://www.amazon.com/dp/0615880991/?tag=coursera-course198-20
- [ ] Quantifying Uncertainty
@pbailis
pbailis / reproducibility.md
Last active February 15, 2016 12:18
Reproducing (un)reproducibility results

edit: see http://cs.brown.edu/~sk/Memos/Examining-Reproducibility/

Not deserving of a full post, but nonetheless worth writing about: @ongardie, @aalevy, and a few others on Twitter were surprised by the number of papers that were flagged as "not reproducible" according to the recent study at http://reproducibility.cs.arizona.edu. Digging deeper, it appeared that 1.) "code builds" is the standard for reproducibility in this study and that 2.) many broken builds were the result of missing dependencies on the researchers' systems.

I tried reproducing a few of the authors' "unreproducible" results. It's hard to vet 600+ research code repositories, but, with a little effort (< ~10 minutes each?), I was able to get all of the following to actually build (on Ubuntu 13.10). This doesn't inspire confidence in the reproducibility of the study results.

Peter pbailis@cs.berkeley.edu

@jdmaturen
jdmaturen / vacuum.py
Last active December 28, 2015 02:09
Vacuum up Crunchbase
"""
Get a bunch of Crunchbase data, but respect the API limits.
Author JD Maturen
Apache 2 License
"""
import logging
from random import random
import sys
@diegopacheco
diegopacheco / maven3-scala-plugin.txt
Created October 16, 2013 14:48
Scala Maven Plugin Config
Inside your parent pom.xml:
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
</plugin>
@miketheman
miketheman / zook_grow.md
Created July 22, 2013 21:36
Adding nodes to a ZooKeeper ensemble

Adding 2 nodes to an existing 3-node ZooKeeper ensemble without losing the Quorum

Since many deployments may start out with 3 nodes and so little is known about how to grow a cluster from 3 memebrs to 5 members without losing the existing Quorum, here is an example of how this might be achieved.

In this example, all 5 nodes will be running on the same Vagrant host for the purpose of illustration, running on distinct configurations (ports and data directories) without the actual load of clients.

YMMV. Caveat usufructuarius.

Step 1: Have a healthy 3-node ensemble

@pbailis
pbailis / list.md
Last active April 15, 2018 08:54
Quick and dirty (incomplete) list of interesting, mostly recent data warehousing/"big data" papers

A friend asked me for a few pointers to interesting, mostly recent papers on data warehousing and "big data" database systems, with an eye towards real-world deployments. I figured I'd share the list. It's biased and rather incomplete but maybe of interest to someone. While many are obvious choices (I've omitted several, like MapReduce), I think there are a few underappreciated gems.

###Dataflow Engines:

Dryad--general-purpose distributed parallel dataflow engine
http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf

Spark--in memory dataflow
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

@philz
philz / unixtime.sh
Created November 30, 2012 17:59
seconds since epoch to readable
# $unixtime 1354232717
# local Thu 29 Nov 2012 03:45:17 PM PST utc Thu 29 Nov 2012 11:45:17 PM GMT
unixtime ()
{
gawk "BEGIN { print \"local\", strftime("'"'"%c"'"'", $1), \"utc\", strftime("'"'"%c"'"'", $1, 1) ; }"
}
# Alternately,
# $date -d @1354298146
# Fri Nov 30 09:55:46 PST 2012
@Mithrandir0x
Mithrandir0x / gist:3639232
Created September 5, 2012 16:15
Difference between Service, Factory and Provider in AngularJS
// Source: https://groups.google.com/forum/#!topic/angular/hVrkvaHGOfc
// jsFiddle: http://jsfiddle.net/pkozlowski_opensource/PxdSP/14/
// author: Pawel Kozlowski
var myApp = angular.module('myApp', []);
//service style, probably the simplest one
myApp.service('helloWorldFromService', function() {
this.sayHello = function() {
return "Hello, World!"