Skip to content

Instantly share code, notes, and snippets.

View goelprateek's full-sized avatar

prateek goel goelprateek

  • Lentra ai vt ltd
  • pune
View GitHub Profile
@goelprateek
goelprateek / 00-MapSideJoinLargeDatasets
Created December 23, 2017 07:11 — forked from airawat/00-MapSideJoinLargeDatasets
MapsideJoinOfTwoLargeDatasets(Old API) - Joining (inner join) two large datasets on the map side
**********************
**Gist
**********************
This gist details how to inner join two large datasets on the map-side, leveraging the join capability
in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution
through distributedcache, and can be implemented if both input datasets can be joined by the join key
and both input datasets are sorted in the same order, by the join key.
There are two critical pieces to engaging the join behavior:
@goelprateek
goelprateek / 00-ReduceSideJoin
Created December 21, 2017 19:06 — forked from airawat/00-ReduceSideJoin
ReduceSideJoin - Sample Java mapreduce program for joining datasets with cardinality of 1..1, and 1..many on the join key
My blog has an introduction to reduce side join in Java map reduce-
http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html
@goelprateek
goelprateek / Sharded mongodb environment on localhost
Created October 25, 2017 18:38 — forked from joewagner/Sharded mongodb environment on localhost
Bash shell script that sets up a sharded mongodb cluster on a single machine. Handy for testing or development when a sharded deployment is required. Notice that this will remove everything in the data/config and data/shard directories. If you are using those for something else, you may want to edit this...
# clean everything up
echo "killing mongod and mongos"
killall mongod
killall mongos
echo "removing data files"
rm -rf data/config
rm -rf data/shard*
# For mac make sure rlimits are high enough to open all necessary connections
ulimit -n 2048
@goelprateek
goelprateek / SparkJoin
Created August 13, 2017 17:45 — forked from amithn/SparkJoin
Example showing how to join 2 RDD's using Apache Spark's Java API
package com.voicestreams.spark;
import org.apache.commons.io.FileUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
@goelprateek
goelprateek / gist:af0809e358fee501340f2efb9a3fe66c
Created April 23, 2017 06:18 — forked from stuart11n/gist:9628955
rename git branch locally and remotely
git branch -m old_branch new_branch # Rename branch locally
git push origin :old_branch # Delete the old branch
git push --set-upstream origin new_branch # Push the new branch, set local branch to track the new remote