Skip to content

Instantly share code, notes, and snippets.

@nathanmarz
Created August 6, 2010 20:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nathanmarz/511938 to your computer and use it in GitHub Desktop.
Save nathanmarz/511938 to your computer and use it in GitHub Desktop.
Random number in Cascalog
import cascading.flow.FlowProcess;
import cascading.flow.hadoop.HadoopFlowProcess;
import cascading.operation.FunctionCall;
import cascading.operation.OperationCall;
import cascading.tuple.Tuple;
import java.util.Random;
import cascalog.CascalogFunction;
public class RandInt extends CascalogFunction {
long _seed;
Random _rand;
Integer _max;
public RandInt() {
_seed = new Random().nextLong();
_max = null;
}
public RandInt(int max) {
this();
_max = max;
}
public void prepare(FlowProcess flowProcess, OperationCall operationCall) {
_rand = new Random(_seed + ((HadoopFlowProcess) flowProcess).getCurrentTaskNum());
}
public void operate(FlowProcess flow_process, FunctionCall fn_call) {
int rand;
if(_max==null) rand = _rand.nextInt();
else rand = _rand.nextInt(_max);
fn_call.getOutputCollector().add(new Tuple(rand));
}
}
@mlimotte
Copy link

mlimotte commented Sep 9, 2010

I think this is related to your Randomness post at http://blog.rapleaf.com/dev/2009/08/14/using-random-numbers-in-mapreduce-is-dangerous/

The post makes sense, as you suggest distributing the seed to the tasks through the JobConf. But I don't see that happening in this gist. Seems to me that each instance of RandInt (one per task or tasktracker?) will get it's own seed, so if TaskTracker is lost, a re-execution would get a new seed.

Or is there some magic I'm missing, here?

@nathanmarz
Copy link
Author

The seed is chosen before the job starts (in the constructor). Cascading then serializes the RandInt instance for use in the tasks. The tasks then call "prepare" before they start processing records. If a task fails, it reuses the same RandInt instance with the same seed.

@mlimotte
Copy link

Oh.. cool. I didn't realize Cascading serialized the instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment