Skip to content

Instantly share code, notes, and snippets.

@snoremac
Last active January 2, 2016 06:29
Show Gist options
  • Save snoremac/8263901 to your computer and use it in GitHub Desktop.
Save snoremac/8263901 to your computer and use it in GitHub Desktop.
Launch a 10 node EMR cluster and run a Java word count against a single common crawl segment.
# Word count, Java-fied.
#
# In this implementation, the job knows how to discover its input URIs based
# on the max.segments argument, which specifies how many crawl segments
# to process.
#
# See the code for details.
$ elastic-mapreduce \
--create \
--name "Common Crawl word count" \
--enable-debugging \
--ami-version latest \
--instance-group master --instance-count 1 --instance-type m2.2xlarge \
--instance-group core --instance-count 10 --instance-type c1.xlarge \
--jar s3n://emr-examples.dius.com.au/java/emr-examples.jar \
--main-class au.com.dius.emr.CommonCrawlTool \
--arg -D --arg target.words=hello,world \
--arg -D --arg max.segments=1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment