calmamani/initial-setup.md

## initial-setup.md

      
    Raw
  

              initial-setup.md
            
          
    Requirements


HDP-2.2 installed by Ambari
Install HDFS Client
Patience

Instructions

Setup

ssh into the machine and then run this on the command line:
$ sudo su hdfs

This allows you to impersonate the hdfs user, the hdfs user comes with the HDP installation. The benefits of this is gives you the power to run the terasort benchmarking workflow without much interruption.
Terasort tests

Look for the hadoop-mapreduce-examples.jar which would reside in your local machine. It's usually found under the /usr/hdp folder. You can run the following command to find it:
find /usr/hdp -name hadoop-*examples*.jar

If found, go to its directory and run:
hadoop jar hadoop-mapreduce-examples.jar

And you'll see the teragen, terasort, and teravalidate commands.
TestDFSIO

Look for the hadoop-mapreduce-client-jobclient-tests.jar which would reside in your local machine. It's usually found under the /usr/hdp folder. You can run the following command to find it:
If found, go to its directory and run:
hadoop jar hadoop-mapreduce-client-jobclient-tests.jar

And you'll see the TestDFSIO commands.

What do they do?


teragen creates sample data and places it in an output directory for terasort. terasort runs through the directory and creates the reduce output on an output directory. teravalidate ensures that terasort reduced and mapped correctly.


TestDFSIO is a test for IO throughput of the cluster. -write creates sample files, -read reads them, and -clean deletes the test outputs.


Create a /benchmarks directory in the HDFS:
hadoop fs -mkdir /benchmarks

This is where the TestDFSIO, teragen, terasort and teravalidate commands would place their output and get their inputs.

  
## teragen-terasort-teravalidate.md

      
    Raw
  

              teragen-terasort-teravalidate.md
            
          
    Running Teragen > Terasort > Teravalidate

Important note: Make sure you are in the folder where hadoop-mapreduce-examples.jar is found as per the instructions above.
Running teragen

Run this command:
hadoop jar hadoop-mapreduce-examples.jar teragen 5000000000 /benchmarks/terasort-input // 500GB
hadoop jar hadoop-mapreduce-examples.jar teragen 50000000 /benchmarks/terasort-input // 5GB
hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /benchmarks/terasort-input // 10GB

This creates a 500GB file in the HDFS under the /benchmarks/terasort-input folder. It's for the terasort to run it's benchmark into.
Running terasort

Run this command:
hadoop jar hadoop-mapreduce-examples.jar terasort /benchmarks/terasort-input /benchmarks/terasort-output

This command runs a benchmarking mapreduce job on the data created by teragen in the /benchmarks/terasort-input folder. The output result of the reduce is written into files and placed under the /benchmarks/terasort-output/ folder, where teravalidate will check later.
Running teravalidate

Run this command:
hadoop jar hadoop-mapreduce-examples.jar teravalidate /benchmarks/terasort-output /benchmarks/terasort-validate

This command just ensures if the output of terasort was valid and without error.
Troubleshooting


mapreduce.tar.gz is missing - just look for it on the local machine and put it in the folder where HDFS is looking. It's usually found under /usr/hdp/hadoop


## testdfsio.md

      
    Raw
  

              testdfsio.md
            
          
    Running TestDFSIO

Important note: Make sure you are in the folder where hadoop-mapreduce-examples.jar is found as per the instructions above.
Run -write

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000 // Write 10 1GB files

This automatically creates a folder under /benchmarks/TestDFSIO where it writes out these files
Run -read

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000

This reads the files produced by the -write step to read throughput
Run -clean

hadoop jar hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -clean

Cleans out the /benchmarks/TestDFSIO folder in the HDFS