Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save snoremac/8263812 to your computer and use it in GitHub Desktop.
Save snoremac/8263812 to your computer and use it in GitHub Desktop.
Launch an EMR cluster and run a word count against a single common crawl segment.
# Launch a cluster and run the word count against a single crawl segment.
$ elastic-mapreduce \
--create \
--name "Common Crawl word count" \
--enable-debugging \
--stream \
--ami-version latest \
--instance-group master --instance-count 1 --instance-type m2.2xlarge \
--instance-group core --instance-count 10 --instance-type c1.xlarge \
--input hdfs:///common-crawl/parse-output/segment/1346823845675 \
--output s3n://emr-examples.dius.com.au/output \
--mapper 's3://emr-examples.dius.com.au/ruby/common_crawl_mapper.rb hello,world' \
--reducer s3://emr-examples.dius.com.au/ruby/common_crawl_reducer.rb \
--arg -inputformat --arg org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment