Atlas7/hello-mpi.md Secret

## hello-mpi.md

      
    Raw
  

              hello-mpi.md
            
          
    Intel Colfax Cluster - Notes - Index Page

Introduction

Imagine we have a single-thread (Hello World) application. We would like to run this same application across multiple nodes within a cluster. This post summarises a simple exercise to implement this architecture on the Intel Colax Cluster.
Step 1 - SSH to Colfax Cluster

As usual, ssh to the Colfax cluster from our laptop.
Last login: Mon Aug 14 10:31:32 on ttys001
johnny@Chuns-MBP  $ ssh colfax
######################################################################
# Welcome to Colfax Cluster!
######################################################################
#
# Pre-compiled Machine Learning Frameworks, as well as other tools
# are available in the /opt/ directory
#
# If you have any questions or feature requests, post them on our
# forum at:
# https://colfaxresearch.com/discussion/forum/
#
# Colfax Research Team
######################################################################
Last login: Mon Aug 14 04:43:00 2017 from 10.5.0.7
[u4443@c001 ~]$

Navigate to a directory of our choice. In my case:
[u4443@c001 lec-05]$ pwd
/home/u4443/deepdive/lec-05

Step 2 - Create C++ Code

Create a C++ Code called hello-mpi-2.cc.
#include "mpi.h"
#include <cstdio>

int main (int argc, char *argv[]) {

  // Initialize MPI environment
  MPI_Init (NULL, NULL);  

  // Get rank ID of current proces
  int rank;
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);

  // Get number of processes
  int size;
  MPI_Comm_size (MPI_COMM_WORLD, &size);

  // Get cluster node name
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];
  MPI_Get_processor_name (name, &namelen);

  // Print on cluster node
  printf ("MPI world rank %d running on node %s!\n", rank, name);
  if (rank == 0) printf("MPI world size = %d processes\n", size);

  MPI_Finalize (); // Terminate MPI environment
}
There is a lot going on in this program. Let's break it up a bit.
#include "mpi.h"
#include <cstdio>
Here we import the required libraries. Note that we can write #include "mpi.h" as #include <mpi.h> as well. Same thing.
int main (int argc, char *argv[]) {
  //...
}
The default main program takes in two optional arguments:

int argc: number of arguments being passed into your program from the command line.
char *argv[]: an array of arguments (aka an array of pointers). This is equivalent to char **argv. See this stackoverflow for a clear explanation.

The above are just standard arguments to parse into a C++ main program. We can obmit these entirely and reduce the program to just int main(){ //... } (without the arguments) if we do not intend to parse in any arguments to the main program upon invocation. See this Stackoverflow forum for a very clear explanation for explanation and how to use arguments should we want to.
  MPI_Init (NULL, NULL);  
  //...
  MPI_Finalize ();
We wrap the application with the MPI_Init() and MPI_Finalize(). The coordination of Distributed Computing is performed on a single node via one single process.

MPI_Init(): this initialize the MPI environment. It creates global variable such as MPI_MAX_PROCESSOR_NAME (that describes the name of the host/node the process being run on) and MPI_COMM_WORLD (that desribed the entire distributed computing MPI world processes), plus more. (Note to self: what other variables does MPI_Init() create?). Here, to make the code simple and easier to read, we just use MPI_Init (NULL, NULL); to specify no arguments. Though as a standard we could have changed it to MPI_Init (&argc, &argv[]); to parse the arguments from main, if we want to.
MPI_Finalize(): this marks the step of the MPI (distributed computing) code. No more MPI codes may be written after this line.

Note that this same program will be run by the multiple requested cluster nodes. Each node will be assigned its unique memory address space to run the application. See this stackoverflow for more info.
  // Get rank ID of current proces
  int rank;
  MPI_Comm_rank (MPI_COMM_WORLD, &rank);
Here, we obtain the MPI rank of the process being run on the node. The following diagram of MPI_COMM_WORLD paints a nice picture, Obtained from MPI Course Book:

By doing the MPI_Init(), we've effectively created a MPI "World", consisting of parallel processes each assigned a rank ID. (0, 1, 2,...). We assign the rank value to the address space of int rank.
  // Get number of processes
  int size;
  MPI_Comm_size (MPI_COMM_WORLD, &size);
Here, we assign the number of parallel processes (rank) to the address space of int size.
  // Get cluster node name
  int namelen;
  char name[MPI_MAX_PROCESSOR_NAME];
  MPI_Get_processor_name (name, &namelen);
Here, we obtain the name of the cluster node and assign to the address space of char name[].
  // Print on cluster node
  printf ("MPI world rank %d running on node %s!\n", rank, name);
  if (rank == 0) printf("MPI world size = %d processes\n", size);
Here, we do the printing so we can visualise what is going on. Note that we can control which rank does what with the rank variable. In this case, we only print the 2nd line if it is of rank 0. (to ensure we only print this once).
Extra notes: if we have 10 processes distributed across 4 nodes to run, we will have rank 0, 1, ...8, 9 (i.e. 9 ranks in total). The if (rank == 0) ensures we only do that printing in rank 0 out of the 10 (0 to 9) ranks. Rank corresponds to process, not node.
Step 3 - Compile MPI C++ Code

We can use the Intel MPI Compiler to compile the C++ source code:
[u4443@c001 ~]$ mpiicpc -o hello-mpi-2 hello-mpi-2.cc

Now we've created the binary hello-mpi-2.
Step 4 - Create Shell Script

Create a shell script that invoke the binary. Here we will be adventurous and try out multiple settings and see what we get:
#PBS -l nodes=4:knl

cd $PBS_O_WORKDIR

echo "----------"
echo "Cluster nodes requested..."
cat $PBS_NODEFILE

echo "----------"
echo "Run 1 single-thread process on localhost..."
mpirun -host localhost -np 1 ./hello-mpi-2

echo "----------"
echo "Run 2 single-thread processes on localhost..."
mpirun -host localhost -np 2 ./hello-mpi-2

echo "----------"
echo "Run 1 single-thread process on each cluster node requested..."
mpirun -machinefile $PBS_NODEFILE ./hello-mpi-2

echo "----------"
echo "Run 1 single-thread process across available cluster node..."
mpirun -machinefile $PBS_NODEFILE -np 1 ./hello-mpi-2

echo "----------"
echo "Run 2 single-thread processes across available cluster nodes..."
mpirun -machinefile $PBS_NODEFILE -np 2 ./hello-mpi-2

echo "----------"
echo "Run 4 single-thread processes across available cluster nodes..."
mpirun -machinefile $PBS_NODEFILE -np 4 ./hello-mpi-2

echo "----------"
echo "Run 10 single-thread processes across available cluster nodes..."
mpirun -machinefile $PBS_NODEFILE -np 10 ./hello-mpi-2
A bit of explanation:

#PBS -l nodes=4:knl: we request 4 KNL (Knights Landing Xeon Phi) nodes.
cd $PBS_O_WORKDIR: navigate to the current directory (whehre we execute the qsub command later on. i.e. our working directory).
cat $PBS_NODEFILE: see the names of the cluster nodes successfully requested.
mpirun: use this to run MPI binary.

The -np option to specify number of processes to run.
The -machinefile option specifies the cluster nodes to run processes on.
The -host option specifies the cluster node to run on. localhost is the one with rank 0. (the one that coordinates the parallel distributed computing architecture).


If number of processes exceed the number of cluster nodes, we should expect to see some cluster nodes to run multiple processes (one after each other).

Step 5 - Submit Job to Colfax Cluster

In our working directory, submit job to the Colfax cluster via qsub:
[u4443@c001 ~]$ qsub hello-mpi-2.sh
21216.c001

Make a note of the job number retured.
Step 6 - View output

The job should take a second to run. View output hello-mpi-2.sh.o21216:
[u4443@c001 lec-05]$ cat hello-mpi-2.sh.o21216

########################################################################
# Colfax Cluster - https://colfaxresearch.com/
#      Date:           Mon Aug 14 05:05:39 PDT 2017
#    Job ID:           21216.c001
#      User:           u4443
# Resources:           neednodes=4:knl,nodes=4:knl,walltime=24:00:00
########################################################################

----------
Cluster nodes requested...
c001-n029
c001-n030
c001-n031
c001-n032
----------
Run 1 single-thread process on localhost...
MPI world rank 0 running on node c001-n029!
MPI world size = 1 processes
----------
Run 2 single-thread processes on localhost...
MPI world rank 1 running on node c001-n029!
MPI world rank 0 running on node c001-n029!
MPI world size = 2 processes
----------
Run 1 single-thread process on each cluster node requested...
MPI world rank 3 running on node c001-n032!
MPI world rank 0 running on node c001-n029!
MPI world size = 4 processes
MPI world rank 1 running on node c001-n030!
MPI world rank 2 running on node c001-n031!
----------
Run 1 single-thread process across available cluster node...
MPI world rank 0 running on node c001-n029!
MPI world size = 1 processes
----------
Run 2 single-thread processes across available cluster nodes...
MPI world rank 1 running on node c001-n030!
MPI world rank 0 running on node c001-n029!
MPI world size = 2 processes
----------
Run 4 single-thread processes across available cluster nodes...
MPI world rank 0 running on node c001-n029!
MPI world size = 4 processes
MPI world rank 1 running on node c001-n030!
MPI world rank 2 running on node c001-n031!
MPI world rank 3 running on node c001-n032!
----------
Run 10 single-thread processes across available cluster nodes...
MPI world rank 4 running on node c001-n029!
MPI world rank 1 running on node c001-n030!
MPI world rank 2 running on node c001-n031!
MPI world rank 3 running on node c001-n032!
MPI world rank 8 running on node c001-n029!
MPI world rank 5 running on node c001-n030!
MPI world rank 6 running on node c001-n031!
MPI world rank 7 running on node c001-n032!
MPI world rank 0 running on node c001-n029!
MPI world size = 10 processes
MPI world rank 9 running on node c001-n030!

########################################################################
# Colfax Cluster
# End of output for job 21216.c001
# Date: Mon Aug 14 05:05:57 PDT 2017
########################################################################

[u4443@c001 lec-05]$

Also view the error file hello-mpi-2.sh.e21216 (which I'm not sure how to interpret, yet):
[u4443@c001 lec-05]$ cat hello-mpi-2.sh.e21216
[3] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[3] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[3] DAPL startup: RLIMIT_MEMLOCK too small
[6] DAPL startup: RLIMIT_MEMLOCK too small
[7] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[5] DAPL startup: RLIMIT_MEMLOCK too small
[9] DAPL startup: RLIMIT_MEMLOCK too small
[u4443@c001 lec-05]$

(Note to self: how to interpret the above error file. Does this matter?)
Conclusion

In this post we've illustrated a simple yet powerful distributed computing and parallel programming example via a Hello World application. A single-threaded C++ program is spwaned as (dummy) parallel processes, distributed to run on multiple cluster nodes. We use MPI (Message Passing Interface) in our C++ program to enable coordination of the these parallel processes. The output at the end should enable us to understand a bit about how the application works.
Note: Should we wish to run multi-thread processes on each node, we may integrate the OpenMP framework to do this - this is out of scope of this post and we may come back to this in a separate post in future.
References


MPI Hello World
MPI Course Book
MPI Routines


Intel Colfax Cluster - Notes - Index Page