Skip to content

Instantly share code, notes, and snippets.

@snoremac
snoremac / symlink-java
Created May 1, 2015 03:18
Manage JDK symlinks on OS X
@snoremac
snoremac / emr-examples-list.sh
Created January 5, 2014 23:35
List all EMR clusters from the command line.
elastic-mapreduce --list
@snoremac
snoremac / emr-examples-hdfs-java.sh
Created January 5, 2014 22:13
Run a Java-based word count from data already on a running cluster's HDFS filesystem.
# Run the word count from local HDFS.
elastic-mapreduce -j j-2XP9O9IRLHHBU \
--jar s3n://emr-examples.dius.com.au/java/emr-examples.jar \
--main-class au.com.dius.emr.CommonCrawlTool \
--arg -D --arg target.words=hello,world \
--arg -D --arg base.uri=hdfs:///common-crawl \
--arg -D --arg max.segments=1
@snoremac
snoremac / emr-examples-distcp.sh
Created January 5, 2014 22:10
Bulk copy a single Common Crawl segment from S3 to an already running cluster using S3DistCp.
# Bulk copy a crawl segment from S3 to the running cluster.
elastic-mapreduce -j j-2XP9O9IRLHHBU \
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar \
--arg --src --arg s3://aws-publicdatasets/common-crawl/parse-output/segment/1346823845675 \
--arg --srcPattern --arg '.*textData.*' \
--arg --dest --arg hdfs:///common-crawl/parse-output/segment/1346823845675
@snoremac
snoremac / emr-examples-cli-prototype-full.sh
Created January 5, 2014 22:07
An example of prototyping Hadoop streaming using command line utilities.
./src/ruby/common_crawl_input.rb 2>/dev/null \
| ./src/ruby/common_crawl_mapper.rb hello,world \
| sort -t$'\t' -k1 \
| ./src/ruby/common_crawl_reducer.rb
@snoremac
snoremac / emr-examples-15-segment-java.sh
Last active January 2, 2016 08:09
Run a Java-based word count against 15 common crawl segments on an already running EMR cluster.
# Run the word against 15 crawl segments
elastic-mapreduce -j j-2XP9O9IRLHHBU \
--jar s3n://emr-examples.dius.com.au/java/emr-examples.jar \
--main-class au.com.dius.emr.CommonCrawlTool \
--arg -D --arg target.words=hello,world \
--arg -D --arg max.segments=15
@snoremac
snoremac / emr-examples-single-segment-java.sh
Last active January 2, 2016 08:09
Run a Java-based word count against an already running cluster.
# Run the word count against a single crawl segment.
elastic-mapreduce -j j-2XP9O9IRLHHBU \
--jar s3n://emr-examples.dius.com.au/java/emr-examples.jar \
--main-class au.com.dius.emr.CommonCrawlTool \
--arg -D --arg target.words=hello,world \
--arg -D --arg max.segments=1
@snoremac
snoremac / emr-exampes-10-node-spot.sh
Last active January 2, 2016 08:09
Launch a 10 node EMR cluster with keep-alive from the spot market.
# Launch a cluster from the spot market.
#
# This time we specify --alive to keep the cluster running until we
# manually terminate it.
elastic-mapreduce \
--create \
--name "Common Crawl word count" \
--alive \
--enable-debugging \
@snoremac
snoremac / emr-examples-single-segment-10-node-java.sh
Last active January 2, 2016 06:29
Launch a 10 node EMR cluster and run a Java word count against a single common crawl segment.
# Word count, Java-fied.
#
# In this implementation, the job knows how to discover its input URIs based
# on the max.segments argument, which specifies how many crawl segments
# to process.
#
# See the code for details.
$ elastic-mapreduce \
--create \
@snoremac
snoremac / emr-examples-single-segment-10-node-streaming.sh
Created January 5, 2014 03:10
Launch an EMR cluster and run a word count against a single common crawl segment.
# Launch a cluster and run the word count against a single crawl segment.
$ elastic-mapreduce \
--create \
--name "Common Crawl word count" \
--enable-debugging \
--stream \
--ami-version latest \
--instance-group master --instance-count 1 --instance-type m2.2xlarge \
--instance-group core --instance-count 10 --instance-type c1.xlarge \