Skip to content

Instantly share code, notes, and snippets.

@skonto
Created May 24, 2017 15:46
Show Gist options
  • Save skonto/734453320f039b28ce71867eafbc0197 to your computer and use it in GitHub Desktop.
Save skonto/734453320f039b28ce71867eafbc0197 to your computer and use it in GitHub Desktop.
terasort
Run against the normal package, not the beta one.
DATA=1 GB
Teragen:
17/05/24 15:30:57 INFO mapreduce.Job: Counters: 21
File System Counters
FILE: Number of bytes read=276327
FILE: Number of bytes written=565835
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=0
HDFS: Number of bytes written=10000000000
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Map-Reduce Framework
Map input records=100000000
Map output records=100000000
Input split bytes=83
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=505
Total committed heap usage (bytes)=278921216
org.apache.hadoop.examples.terasort.TeraGen$Counters
CHECKSUM=214760662691937609
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=10000000000
Terasort last few lines with summary:
duration: ~9min
17/05/24 15:45:53 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=282857735960
FILE: Number of bytes written=424889341747
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=182074929179
HDFS: Number of bytes written=44922543979
HDFS: Number of read operations=1359
HDFS: Number of large read operations=0
HDFS: Number of write operations=126
Map-Reduce Framework
Map input records=100000000
Map output records=100000000
Map output bytes=10200000000
Map output materialized bytes=10400000912
Input split bytes=1919
Combine input records=0
Combine output records=0
Reduce input groups=100000000
Reduce shuffle bytes=10400000912
Reduce input records=100000000
Reduce output records=100000000
Spilled Records=300000000
Shuffled Maps =152
Failed Shuffles=0
Merged Map outputs=152
GC time elapsed (ms)=3888
Total committed heap usage (bytes)=13784580096
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=10000000000
File Output Format Counters
Bytes Written=10000000000
17/05/24 15:45:53 INFO terasort.TeraSort: done
Commands:
./bin/hdfs dfs -rm -r -f hdfs://hdfs/teraInputTB hdfs://hdfs/teraOutputTB hdfs://hdfs/teraValidateTB
# TeraSort
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.9.1.jar teragen \
-Ddfs.block.size=536870912 \
-Dmapred.map.tasks=16 \
-Dmapred.reduce.tasks=8 \
-Dmapred.map.tasks.speculative.execution=true \
-Dmapred.compress.map.output=true \
$DATA_RECORDS hdfs://hdfs/teraInputTB
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.9.1.jar terasort \
-Ddfs.block.size=536870912 \
-Dio.file.buffer.size=32768 \
-Dmapred.map.tasks=16 \
-Dmapred.reduce.tasks=8 \
-Dio.sort.factor=48 \
-Dio.sort.record.percent=0.138 \
hdfs://hdfs/teraInputTB hdfs://hdfs/teraOutputTB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment