Skip to content

Instantly share code, notes, and snippets.

@snoremac
Created January 5, 2014 22:10
Show Gist options
  • Save snoremac/8274621 to your computer and use it in GitHub Desktop.
Save snoremac/8274621 to your computer and use it in GitHub Desktop.
Bulk copy a single Common Crawl segment from S3 to an already running cluster using S3DistCp.
# Bulk copy a crawl segment from S3 to the running cluster.
elastic-mapreduce -j j-2XP9O9IRLHHBU \
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar \
--arg --src --arg s3://aws-publicdatasets/common-crawl/parse-output/segment/1346823845675 \
--arg --srcPattern --arg '.*textData.*' \
--arg --dest --arg hdfs:///common-crawl/parse-output/segment/1346823845675
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment