Skip to content

Instantly share code, notes, and snippets.

View ceteri's full-sized avatar

Paco Nathan ceteri

View GitHub Profile
@ccsevers
ccsevers / AvroReadExample.java
Created October 29, 2012 18:27
cascading.avro wordcount example
package cascading.avro.examples;
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowDef;
import cascading.flow.hadoop.HadoopFlowConnector;
import cascading.operation.aggregator.Count;
import cascading.operation.regex.RegexFilter;
import cascading.operation.regex.RegexSplitGenerator;
@ceteri
ceteri / Example3.scala
Last active December 10, 2015 02:58
Cascading for the Impatient, Part 8 -- Scalding examples
import com.twitter.scalding._
class Example3(args : Args) extends Job(args) {
Tsv(args("doc"), ('doc_id, 'text), skipHeader = true)
.read
.flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") }
.mapTo('token -> 'token) { token : String => scrub(token) }
.filter('token) { token : String => token.length > 0 }
.groupBy('token) { _.size('count) }
.write(Tsv(args("wc"), writeHeader = true))
@ceteri
ceteri / Cascalog._tutorial
Last active December 10, 2015 09:58
Cascading for the Impatient, Part 9
bash-3.2$ lein repl
Listening for transport dt_socket at address: 51539
nREPL server started on port 51542
REPL-y 0.1.0-beta10
Clojure 1.4.0
Exit: Control+D or (exit) or (quit)
Commands: (user/help)
Docs: (doc function-name-here)
(find-doc "part-of-name-here")
Source: (source function-name-here)
@ceteri
ceteri / Pattern test.log
Last active December 11, 2015 10:39
Pattern machine learning library for Cascading
bash-3.2$ pwd
/Users/ceteri/src/concur/pattern
bash-3.2$ java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
bash-3.2$ hadoop version
Warning: $HADOOP_HOME is deprecated.
Hadoop 1.0.3
@ceteri
ceteri / Cascalog.log
Last active December 11, 2015 18:28
City of Palo Alto Open Data app in Cascalog
bash-3.2$ lein version
Leiningen 2.0.0-preview10 on Java 1.6.0_43 Java HotSpot(TM) 64-Bit Server VM
bash-3.2$ hadoop version
Warning: $HADOOP_HOME is deprecated.
Hadoop 1.0.3
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192
Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012
From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be
bash-3.2$ lein clean
@ceteri
ceteri / cascalog_build.log
Last active December 14, 2015 22:29
Cascalog testing with Cascading 2.2-wip
bash-3.2$ lein do sub install, deps, compile, repl
Could not find artifact lein-newnew:lein-newnew:pom:0.3.5 in central (http://repo1.maven.org/maven2)
Retrieving lein-newnew/lein-newnew/0.3.5/lein-newnew-0.3.5.pom (3k)
from https://clojars.org/repo/
Could not find artifact stencil:stencil:pom:0.3.0 in central (http://repo1.maven.org/maven2)
Retrieving stencil/stencil/0.3.0/stencil-0.3.0.pom (3k)
from https://clojars.org/repo/
Retrieving org/clojure/clojure/1.3.0/clojure-1.3.0.pom (5k)
from http://repo1.maven.org/maven2/
Retrieving org/sonatype/oss/oss-parent/5/oss-parent-5.pom (4k)
@drewlanenga
drewlanenga / lm.pmml.xml
Created January 7, 2014 23:48
Exploring support for [transformations in PMML](http://www.dmg.org/v4-1/Transformations.html) with Pattern. (Environment notes: Running Vagrant with Cascading SDK 2.2 -- https://github.com/Cascading/vagrant-cascading-hadoop-cluster)
<?xml version="1.0"?>
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd">
<Header copyright="Copyright (c) 2014 lanenga" description="Linear Regression Model">
<Extension name="user" value="lanenga" extender="Rattle/PMML"/>
<Application name="Rattle/PMML" version="1.4"/>
<Timestamp>2014-01-07 15:33:34</Timestamp>
</Header>
<DataDictionary numberOfFields="4">
<DataField name="sepal_width" optype="continuous" dataType="double"/>
<DataField name="sepal_length" optype="continuous" dataType="double"/>
@johnynek
johnynek / gist:8961994
Last active August 29, 2015 13:56
Some Questions with Sketch Monoids

Unifying Sketch Monoids

As I discussed in Algebra for Analytics, many sketch monoids, such as Bloom filters, HyperLogLog, and Count-min sketch, can be described as a hashing (projection) of items into a sparse space, then using two different commutative monoids to read and write respectively. Finally, the read monoids always have the property that (a + b) <= a, b and the write monoids has the property that (a + b) >= a, b.

##Some questions:

  1. Note how similar CMS and Bloom filters are. The difference: bloom hashes k times onto the same space, CMS hashes k times onto a k orthogonal subspaces. Why the difference? Imagine a fixed space bloom that hashes onto k orthogonal spaces, or an overlapping CMS that hashes onto k * m length space. How do the error asymptotics change?
  2. CMS has many query modes (dot product, etc...) can those generalize to other sketchs (HLL, Bloom)?
  3. What other sketch or non-sketch algorithms can be expressed in this dual mo
@fperez
fperez / ProgrammaticNotebook.ipynb
Last active May 2, 2024 19:14
Creating an IPython Notebook programatically
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@tlockney
tlockney / Vagrantfile
Last active August 29, 2015 13:57
This setup allows for quick hacking with an sbt console on an EC2 instance -- very useful for trying out the AWS APIs when you need to try things out. As an example, I wanted to make sure I understood how to get the various bits of meta-data that are visible only on EC2. Create the following files and run setup.sh to run everything.
Vagrant.configure("2") do |config|
config.vm.box = "dummy"
config.vm.provider :aws do |aws, override|
aws.access_key_id = "..."
aws.secret_access_key = "..."
# you'll need to create the EC2 keypair used here -- I called it vagrant for easy tracking
aws.keypair_name = "vagrant"
# you'll want to use a group that has at least SSH open