Skip to content

Instantly share code, notes, and snippets.

View MLnick's full-sized avatar

Nick Pentreath MLnick

  • Automattic
  • Cape Town, South Africa
  • X @MLnick
View GitHub Profile
@MLnick
MLnick / config
Created May 16, 2014 16:59
Elasticsearch / Spark / Shark setup
Elasticsearch : "1.1.1"
Spark : "0.9.1-hadoop1"
Shark : "0.9.1-hadoop1"
elasticsearch-hadoop-hive : "elasticsearch-hadoop-hive-2.0.0.RC1.jar"
elasticsearch-hadoop : 2.0.0RC1
- Spark using ESInputFormat works fine. However the type returned from the "date" field ("_ts") is Text. I convert that toString the toLong to get the timestamp, which I can use as I wish within Spark
- Shark returns NULL for the timestamp field. There's nothing funny about the timestamps themselves as I can access them in Spark (as Text) and I can do date math on that field in elasticsearch queries. This also returns nulls in EC2, so it's not just on my machine.
@MLnick
MLnick / fromAvro.scala
Created September 18, 2014 07:24
Spark 1.1: Bringing Hadoop Input/Output Formats to PySpark
def fromAvro(obj: Any, schema: Schema): Any = {
if (obj == null) {
return null
}
schema.getType match {
case UNION => unpackUnion(obj, schema)
case ARRAY => unpackArray(obj, schema)
case FIXED => unpackFixed(obj, schema)
case MAP => unpackMap(obj, schema)
case BYTES => unpackBytes(obj)
@MLnick
MLnick / logstash.txt
Created October 16, 2014 10:07
Logstash crash
Failed to flush outgoing items {:outgoing_count=>1, :exception=>#<SocketError: initialize: java.net.SocketException: Too many open files>, :backtrace=>["org/jruby/ext/socket/RubySocket.java:190:in `initialize'", "org/jruby/RubyIO.java:852:in `new'", "/home/ec2-user/logstash-1.4.2/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:146:in `connect'", "org/jruby/RubyArray.java:1613:in `each'", "/home/ec2-user/logstash-1.4.2/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/connection.rb:139:in `connect'", "/home/ec2-user/logstash-1.4.2/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:406:in `connect'", "org/jruby/RubyProc.java:271:in `call'", "/home/ec2-user/logstash-1.4.2/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/pool.rb:48:in `fetch'", "/home/ec2-user/logstash-1.4.2/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:403:in `connect'", "/home/ec2-user/logstash-1.4.2/vendor/bundle/jruby/1.9/gems/ftw-0.0.39/lib/ftw/agent.rb:319:in `execute'", "/home/ec2-user/logstash-1.4.2/vendor/bund
@MLnick
MLnick / call_graphflow_shortcode_in_snippet.php
Last active August 29, 2015 14:16
Custom Graphflow Shortcode using Code Snippets Plugin
add_shortcode('graphflow_shortcode_demo', 'graphflow_shortcode_demo_display');
function graphflow_shortcode_demo_display($attr) {
ob_start();
if ( isset( $_REQUEST['cat'] ) ) {
$cat_id = $_REQUEST['cat'];
$term = get_term_by( 'id', $cat_id, 'product_cat', 'ARRAY_A' );
$cat_name = $term['name'];
echo do_shortcode( '[graphflow_recommendations title="Recommended for you in ' . $cat_name . '" columns=4 per_page=4 product_cat=' . $cat_id . ']' );
} else {
echo do_shortcode( '[graphflow_recommendations columns=4 per_page=4]' );
@MLnick
MLnick / SparkML.scala
Last active December 20, 2015 04:09
Spark Machine Learning API Design Notes
// An Example is an observation with optional target value and features in the form of a vector of Doubles
case class Example(target: Option[Double] = None, features: Vector[Double])
// Base model API looks something like:
abstract class BaseModel(val modelSettings: Settings)
extends Serializable
with Logging {
def fit(data: RDD[Example])
trait Configured {
val config = ConfigFactory.load().getConfig("spark")
}
object ConfigUtils {
def asOption[T](t: => T): Option[T] = {
try {
Option(t)
} catch {
case e: ConfigException.Missing => None
@MLnick
MLnick / JavaSparkContext.scala
Last active December 26, 2015 12:19
PySpark read arbitrary Hadoop InputFormats
def newHadoopFileAsText[K, V, F <: NewInputFormat[K, V]](
path: String,
inputFormatClazz: String,
keyClazz: String,
valueClazz: String,
delimiter: String): JavaRDD[String] = {
implicit val kcm = ClassManifest.fromClass(Class.forName(keyClazz))
implicit val vcm = ClassManifest.fromClass(Class.forName(valueClazz))
implicit val fcm = ClassManifest.fromClass(Class.forName(inputFormatClazz))
@MLnick
MLnick / output
Last active December 30, 2015 20:29
PySpark: svmlight Hadoop text files -> RDD[sparse vectors]
13/12/09 23:01:02 INFO spark.SparkContext: Job finished: runJob at PythonRDD.scala:288, took 0.175237 s
Count raw: 683; Count svml: 683
Raw sample: [u'2.000000 1:1000025.000000 2:5.000000 3:1.000000 4:1.000000 5:1.000000 6:2.000000 7:1.000000 8:3.000000 9:1.000000 10:1.000000', u'2.000000 1:1002945.000000 2:5.000000 3:4.000000 4:4.000000 5:5.000000 6:7.000000 7:10.000000 8:3.000000 9:2.000000 10:1.000000', u'2.000000 1:1015425.000000 2:3.000000 3:1.000000 4:1.000000 5:1.000000 6:2.000000 7:2.000000 8:3.000000 9:1.000000 10:1.000000', u'2.000000 1:1016277.000000 2:6.000000 3:8.000000 4:8.000000 5:1.000000 6:3.000000 7:4.000000 8:3.000000 9:7.000000 10:1.000000', u'2.000000 1:1017023.000000 2:4.000000 3:1.000000 4:1.000000 5:3.000000 6:2.000000 7:1.000000 8:3.000000 9:1.000000 10:1.000000', u'4.000000 1:1017122.000000 2:8.000000 3:10.000000 4:10.000000 5:8.000000 6:7.000000 7:10.000000 8:9.000000 9:7.000000 10:1.000000', u'2.000000 1:1018099.000000 2:1.000000 3:1.000000 4:1.000000 5:1.000000 6:2.000000 7
package example
import org.apache.avro.Schema.Parser
import java.io.{DataInput, DataOutput, File}
import org.apache.avro.generic.GenericData.Record
import org.apache.avro.generic.{GenericRecord, GenericDatumWriter}
import org.apache.avro.file.DataFileWriter
import org.apache.spark.SparkContext
import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
@MLnick
MLnick / JavaSparkContext.scala
Last active November 29, 2016 06:28
PySpark / Hadoop InputFormat interop
// Python RDD creation functions //
// SequenceFile converted to Text and then to String
def sequenceFileAsText(path: String) = {
implicit val kcm = ClassManifest.fromClass(classOf[Text])
implicit val fcm = ClassManifest.fromClass(classOf[SequenceFileAsTextInputFormat])
new JavaPairRDD(sc
.newAPIHadoopFile[Text, Text, SequenceFileAsTextInputFormat](path)
.map{ case (k, v) => (k.toString, v.toString) }