Skip to content

Instantly share code, notes, and snippets.

View jlewi's full-sized avatar

Jeremy Lewi jlewi

View GitHub Profile
@jlewi
jlewi / avro-ispark
Created May 24, 2014 02:16
Snippet showing how I try to create a setup a spark context using my custom kryo registrator for Avro generics.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val conf = new SparkConf().setMaster(“spark://spark-master:7077”).setAppName(“myapp”)
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "contrail.AvroGenericRegistrator")
sc.stop()
val sc = new SparkContext(conf)
val contigFile = contrail.AvroHelper.readAvro(sc,"hdfs://hadoop-nn//tmp/contrail.stages.CompressAndCorrect/part-*.avro")
@jlewi
jlewi / AvroSpecific
Last active August 29, 2015 14:01
Reading a FastQ file in hadoop
import contrail.AvroHelper
import contrail.spark.SerializableGraphNodeData
val contigFile = contrail.AvroHelper.readAvro(sc,"hdfs://hadoop-nn//tmp/contrail.stages.CompressAndCorrect/part-*.avro")
val datums = contigFile.map(r => new SerializableGraphNodeData(r._1.datum))
val keyedById = datums.map(r => (r.node_id.toString, r))
keyedById.cache()
keyedById.lookup(“4fw3YAOX8-lvVjIWgNPGYR3gc5CvsxI”)
@jlewi
jlewi / CharSequenceCoder.java
Created January 19, 2015 18:04
A Dataflow coder for CharSequence objects.
/*
* Copyright (C) 2014 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
@jlewi
jlewi / AvroMapperDoFn.java
Created January 19, 2015 19:20
A DoFn for wrapping AvroMapper
public static class AvroMapperDoFn<
MAPPER extends AvroMapper<I, Pair<OUT_KEY, OUT_VALUE>>, I, OUT_KEY, OUT_VALUE>
extends DoFn<I, KV<OUT_KEY, OUT_VALUE>> {
private Class avroMapperClass;
private byte[] jobConfBytes;
transient MAPPER mapper;
transient DataflowAvroCollector collector;
transient Reporter reporter;
transient JobConf jobConf;
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@jlewi
jlewi / gist:480fbb41b0ce77a4079a
Created March 17, 2015 22:07
Increasing the logging level.
import java.util.logging.ConsoleHandler;
import java.util.logging.Level;
import java.util.logging.Logger;
public class MyDataflowProgram {
public static void main(String[] args) {
ConsoleHandler consoleHandler = new ConsoleHandler();
consoleHandler.setLevel(Level.ALL);
Logger googleApiLogger = Logger.getLogger("com.google.api");
googleApiLogger.setLevel(Level.ALL);
googleApiLogger.setUseParentHandlers(false);
@jlewi
jlewi / gist:917a82ab8556d3bc4fb7
Created March 26, 2015 04:48
SerializableCoder - setCoder
package snippets;
import java.util.ArrayList;
import java.util.List;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.DefaultCoder;
import com.google.cloud.dataflow.sdk.coders.SerializableCoder;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
@jlewi
jlewi / convert_row_to_json.py
Last active June 10, 2016 17:14
Example converting a row of data to Example JSON
import json
from google.protobuf import json_format
from tensorflow.core.example import example_pb2
def convert_row_to_json(row):
e = example_pb2.Example()
e.features.feature['id'].bytes_list.value.append(str(row[id_column_index]))
e.features.feature["target"].int64_list.value.append(row[target_column_index])
# Add features to predict on.
@jlewi
jlewi / profile_row_conversion.py
Created June 11, 2016 00:32
profile_row_conversion
import argparse
import logging
import json
import time
from google.protobuf import json_format
from tensorflow.core.example import example_pb2
import google.cloud.dataflow as df
def convert_row_to_json(row):
@jlewi
jlewi / example_optimized.py
Created June 11, 2016 00:33
Cache the example proto
import argparse
import logging
import json
import time
from google.protobuf import json_format
from tensorflow.core.example import example_pb2
import google.cloud.dataflow as df
e = example_pb2.Example()