Skip to content

Instantly share code, notes, and snippets.

@ace-subido
Last active November 19, 2023 14:32
Show Gist options
  • Save ace-subido/0a9b219b2348921f6a87 to your computer and use it in GitHub Desktop.
Save ace-subido/0a9b219b2348921f6a87 to your computer and use it in GitHub Desktop.
Benchmarking a Hadoop Cluster

Requirements

  1. HDP-2.2 installed by Ambari
  2. Install HDFS Client
  3. Patience

Instructions

Setup

ssh into the machine and then run this on the command line:

$ sudo su hdfs

This allows you to impersonate the hdfs user, the hdfs user comes with the HDP installation. The benefits of this is gives you the power to run the terasort benchmarking workflow without much interruption.

Terasort tests

Look for the hadoop-mapreduce-examples.jar which would reside in your local machine. It's usually found under the /usr/hdp folder. You can run the following command to find it:

find /usr/hdp -name hadoop-*examples*.jar

If found, go to its directory and run:

hadoop jar hadoop-mapreduce-examples.jar

And you'll see the teragen, terasort, and teravalidate commands.

TestDFSIO

Look for the hadoop-mapreduce-client-jobclient-tests.jar which would reside in your local machine. It's usually found under the /usr/hdp folder. You can run the following command to find it:

If found, go to its directory and run:

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar

And you'll see the TestDFSIO commands.


What do they do?

  • teragen creates sample data and places it in an output directory for terasort. terasort runs through the directory and creates the reduce output on an output directory. teravalidate ensures that terasort reduced and mapped correctly.

  • TestDFSIO is a test for IO throughput of the cluster. -write creates sample files, -read reads them, and -clean deletes the test outputs.

Create a /benchmarks directory in the HDFS:

hadoop fs -mkdir /benchmarks

This is where the TestDFSIO, teragen, terasort and teravalidate commands would place their output and get their inputs.

Running Teragen > Terasort > Teravalidate

Important note: Make sure you are in the folder where hadoop-mapreduce-examples.jar is found as per the instructions above.

Running teragen

Run this command:

hadoop jar hadoop-mapreduce-examples.jar teragen 5000000000 /benchmarks/terasort-input // 500GB
hadoop jar hadoop-mapreduce-examples.jar teragen 50000000 /benchmarks/terasort-input // 5GB
hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /benchmarks/terasort-input // 10GB

This creates a 500GB file in the HDFS under the /benchmarks/terasort-input folder. It's for the terasort to run it's benchmark into.

Running terasort

Run this command:

hadoop jar hadoop-mapreduce-examples.jar terasort /benchmarks/terasort-input /benchmarks/terasort-output

This command runs a benchmarking mapreduce job on the data created by teragen in the /benchmarks/terasort-input folder. The output result of the reduce is written into files and placed under the /benchmarks/terasort-output/ folder, where teravalidate will check later.

Running teravalidate

Run this command:

hadoop jar hadoop-mapreduce-examples.jar teravalidate /benchmarks/terasort-output /benchmarks/terasort-validate

This command just ensures if the output of terasort was valid and without error.

Troubleshooting

  1. mapreduce.tar.gz is missing - just look for it on the local machine and put it in the folder where HDFS is looking. It's usually found under /usr/hdp/hadoop

Running TestDFSIO

Important note: Make sure you are in the folder where hadoop-mapreduce-examples.jar is found as per the instructions above.

Run -write

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 // Write 10 1GB files

This automatically creates a folder under /benchmarks/TestDFSIO where it writes out these files

Run -read

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

This reads the files produced by the -write step to read throughput

Run -clean

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean

Cleans out the /benchmarks/TestDFSIO folder in the HDFS

@priyank1930
Copy link

Hi Folks,

when i am running Terasort command it is giving me below error.

Sampling 0 splits of 0
18/01/29 10:57:07 ERROR terasort.TeraSort: / by zero

Any Help on this would be appreciated .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment