Skip to content

Instantly share code, notes, and snippets.

@mumrah
mumrah / RegexAutomatonTest.java
Last active July 9, 2017 16:13
Iterative regular expression building with Lucene's RegExp and Automaton
package default;
import org.apache.lucene.util.automaton.Automaton;
import org.apache.lucene.util.automaton.BasicAutomata;
import org.apache.lucene.util.automaton.RegExp;
public class RegexAutomatonTest {
public void testSSN() {
Automaton full = new RegExp("[0-9]{3}-[0-9]{2}-[0-9]{4}").toAutomaton();
@mumrah
mumrah / ivy.xml
Created July 24, 2013 13:25
Temporary workaround for bogus POM in Maven Central for Kafka 0.8 beta
<ivy-module version="2.0">
<info organisation="demo" module="trihug-kafka-demo"/>
<configurations>
<conf name="default"/>
</configurations>
<dependencies>
<dependency org="org.apache.kafka" name="kafka_2.9.2" rev="0.8.0-beta1" conf="default->default"/>
<exclude org="com.sun.jdmk"/>
<exclude org="com.sun.jmx"/>
</dependencies>
@mumrah
mumrah / pom.xml
Last active December 20, 2015 02:49
<?xml version='1.0' encoding='UTF-8'?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.9.2</artifactId>
<packaging>jar</packaging>
<description>kafka</description>
<version>0.8.0-beta1</version>
<name>kafka</name>
<organization>
@mumrah
mumrah / script.pig
Last active December 18, 2015 10:29 — forked from rohit-parimi/gist:5768968
/*Pig script to convert the user,movie,rating,timestamp data to a user-user graph for running adsorption algorithm.
The format of the input data is
1::122::5::838985046
*/
/*Loading the data into a table. The delimiter might be different for different inputs. */
@mumrah
mumrah / AddToMap.java
Created June 11, 2013 02:39
A Pig UDF that allows you to modify a map by adding additional key/value pairs
import java.io.IOException;
import java.util.Map;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
/**
* Simple UDF to allow modifying an existing map[] datum
*
* Usage:
@mumrah
mumrah / serialize.py
Created June 8, 2013 01:52
Serializing operations in Python, with blocking behavior on the caller side
from threading import Thread, Event
from Queue import Queue
class Proc(Thread):
def __init__(self, in_queue):
Thread.__init__(self)
self.in_queue = in_queue
self.die = Event()
def stop(self):
length, 29: 00 00 00 1d
api key: 00 07
api version: 00 00
correlation 42: 00 00 00 2a
clientId "foo": 00 03 66 6f 6f
group "test-group": 00 0a 74 65 73 74 2d 67 72 6f 75 70
array length: 00 00 00 00
@mumrah
mumrah / Indexer.java
Created March 28, 2013 18:21
Java class to index NYTimes data into Solr. Data available here http://archive.ics.uci.edu/ml/datasets/Bag+of+Words. Beware of hard coded paths and urls!
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
@mumrah
mumrah / gist:5265560
Created March 28, 2013 18:16
Snippet from schema.xml showing docValue fields
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
<!-- DocValue fields -->
<field name="threadId_dv" type="string" indexed="false" stored="false" docValues="true" default=""/>
<field name="docId_dv" type="tint" indexed="false" stored="false" docValues="true" default="0"/>
<field name="wordId_dv" type="tint" indexed="false" stored="false" docValues="true" default="0"/>
<field name="word_dv" type="string" indexed="false" stored="false" docValues="true" default=""/>
<field name="count_dv" type="tint" indexed="false" stored="false" docValues="true" default="0"/>
@mumrah
mumrah / SolrUpdater.java
Last active July 13, 2016 15:13
Example of batching and multi-threading Solr updates
public static class SolrUpdater implements Runnable {
private final UpdateRequest req = new UpdateRequest();
private final SolrServer solr;
private final BlockingQueue<String> strings;
private final AtomicLong id;
private final int batchSize = 100;
private volatile int batchedUpdates = 0;
public SolrUpdater(SolrServer solr, BlockingQueue<String> strings,
AtomicLong id) {