Intel Colfax Cluster - Notes - Index Page
Imagine we have a single-thread (Hello World) application. We would like to run this same application across multiple nodes within a cluster. This post summarises a simple exercise to implement this architecture on the Intel Colax Cluster.
As usual, ssh to the Colfax cluster from our laptop.
Last login: Mon Aug 14 10:31:32 on ttys001
johnny@Chuns-MBP $ ssh colfax
######################################################################
# Welcome to Colfax Cluster!
######################################################################
#
# Pre-compiled Machine Learning Frameworks, as well as other tools
# are available in the /opt/ directory
#
# If you have any questions or feature requests, post them on our
# forum at:
# https://colfaxresearch.com/discussion/forum/
#
# Colfax Research Team
######################################################################
Last login: Mon Aug 14 04:43:00 2017 from 10.5.0.7
[u4443@c001 ~]$
Navigate to a directory of our choice. In my case:
[u4443@c001 lec-05]$ pwd
/home/u4443/deepdive/lec-05
Create a C++ Code called hello-mpi-2.cc
.
#include "mpi.h"
#include <cstdio>
int main (int argc, char *argv[]) {
// Initialize MPI environment
MPI_Init (NULL, NULL);
// Get rank ID of current proces
int rank;
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
// Get number of processes
int size;
MPI_Comm_size (MPI_COMM_WORLD, &size);
// Get cluster node name
int namelen;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name (name, &namelen);
// Print on cluster node
printf ("MPI world rank %d running on node %s!\n", rank, name);
if (rank == 0) printf("MPI world size = %d processes\n", size);
MPI_Finalize (); // Terminate MPI environment
}
There is a lot going on in this program. Let's break it up a bit.
#include "mpi.h"
#include <cstdio>
Here we import the required libraries. Note that we can write #include "mpi.h"
as #include <mpi.h>
as well. Same thing.
int main (int argc, char *argv[]) {
//...
}
The default main program takes in two optional arguments:
int argc
: number of arguments being passed into your program from the command line.char *argv[]
: an array of arguments (aka an array of pointers). This is equivalent tochar **argv
. See this stackoverflow for a clear explanation.
The above are just standard arguments to parse into a C++ main
program. We can obmit these entirely and reduce the program to just int main(){ //... }
(without the arguments) if we do not intend to parse in any arguments to the main program upon invocation. See this Stackoverflow forum for a very clear explanation for explanation and how to use arguments should we want to.
MPI_Init (NULL, NULL);
//...
MPI_Finalize ();
We wrap the application with the MPI_Init()
and MPI_Finalize()
. The coordination of Distributed Computing is performed on a single node via one single process.
MPI_Init()
: this initialize the MPI environment. It creates global variable such asMPI_MAX_PROCESSOR_NAME
(that describes the name of the host/node the process being run on) andMPI_COMM_WORLD
(that desribed the entire distributed computing MPI world processes), plus more. (Note to self: what other variables doesMPI_Init()
create?). Here, to make the code simple and easier to read, we just useMPI_Init (NULL, NULL);
to specify no arguments. Though as a standard we could have changed it toMPI_Init (&argc, &argv[]);
to parse the arguments frommain
, if we want to.MPI_Finalize()
: this marks the step of the MPI (distributed computing) code. No more MPI codes may be written after this line.
Note that this same program will be run by the multiple requested cluster nodes. Each node will be assigned its unique memory address space to run the application. See this stackoverflow for more info.
// Get rank ID of current proces
int rank;
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
Here, we obtain the MPI rank of the process being run on the node. The following diagram of MPI_COMM_WORLD
paints a nice picture, Obtained from MPI Course Book:
By doing the MPI_Init()
, we've effectively created a MPI "World", consisting of parallel processes each assigned a rank ID. (0, 1, 2,...). We assign the rank value to the address space of int rank
.
// Get number of processes
int size;
MPI_Comm_size (MPI_COMM_WORLD, &size);
Here, we assign the number of parallel processes (rank) to the address space of int size
.
// Get cluster node name
int namelen;
char name[MPI_MAX_PROCESSOR_NAME];
MPI_Get_processor_name (name, &namelen);
Here, we obtain the name of the cluster node and assign to the address space of char name[]
.
// Print on cluster node
printf ("MPI world rank %d running on node %s!\n", rank, name);
if (rank == 0) printf("MPI world size = %d processes\n", size);
Here, we do the printing so we can visualise what is going on. Note that we can control which rank does what with the rank
variable. In this case, we only print the 2nd line if it is of rank 0. (to ensure we only print this once).
Extra notes: if we have 10 processes distributed across 4 nodes to run, we will have rank 0, 1, ...8, 9 (i.e. 9 ranks in total). The if (rank == 0)
ensures we only do that printing in rank 0 out of the 10 (0 to 9) ranks. Rank corresponds to process, not node.
We can use the Intel MPI Compiler to compile the C++ source code:
[u4443@c001 ~]$ mpiicpc -o hello-mpi-2 hello-mpi-2.cc
Now we've created the binary hello-mpi-2
.
Create a shell script that invoke the binary. Here we will be adventurous and try out multiple settings and see what we get:
#PBS -l nodes=4:knl
cd $PBS_O_WORKDIR
echo "----------"
echo "Cluster nodes requested..."
cat $PBS_NODEFILE
echo "----------"
echo "Run 1 single-thread process on localhost..."
mpirun -host localhost -np 1 ./hello-mpi-2
echo "----------"
echo "Run 2 single-thread processes on localhost..."
mpirun -host localhost -np 2 ./hello-mpi-2
echo "----------"
echo "Run 1 single-thread process on each cluster node requested..."
mpirun -machinefile $PBS_NODEFILE ./hello-mpi-2
echo "----------"
echo "Run 1 single-thread process across available cluster node..."
mpirun -machinefile $PBS_NODEFILE -np 1 ./hello-mpi-2
echo "----------"
echo "Run 2 single-thread processes across available cluster nodes..."
mpirun -machinefile $PBS_NODEFILE -np 2 ./hello-mpi-2
echo "----------"
echo "Run 4 single-thread processes across available cluster nodes..."
mpirun -machinefile $PBS_NODEFILE -np 4 ./hello-mpi-2
echo "----------"
echo "Run 10 single-thread processes across available cluster nodes..."
mpirun -machinefile $PBS_NODEFILE -np 10 ./hello-mpi-2
A bit of explanation:
#PBS -l nodes=4:knl
: we request 4 KNL (Knights Landing Xeon Phi) nodes.cd $PBS_O_WORKDIR
: navigate to the current directory (whehre we execute theqsub
command later on. i.e. our working directory).cat $PBS_NODEFILE
: see the names of the cluster nodes successfully requested.mpirun
: use this to run MPI binary.- The
-np
option to specify number of processes to run. - The
-machinefile
option specifies the cluster nodes to run processes on. - The
-host
option specifies the cluster node to run on.localhost
is the one with rank 0. (the one that coordinates the parallel distributed computing architecture).
- The
- If number of processes exceed the number of cluster nodes, we should expect to see some cluster nodes to run multiple processes (one after each other).
In our working directory, submit job to the Colfax cluster via qsub
:
[u4443@c001 ~]$ qsub hello-mpi-2.sh
21216.c001
Make a note of the job number retured.
The job should take a second to run. View output hello-mpi-2.sh.o21216
:
[u4443@c001 lec-05]$ cat hello-mpi-2.sh.o21216
########################################################################
# Colfax Cluster - https://colfaxresearch.com/
# Date: Mon Aug 14 05:05:39 PDT 2017
# Job ID: 21216.c001
# User: u4443
# Resources: neednodes=4:knl,nodes=4:knl,walltime=24:00:00
########################################################################
----------
Cluster nodes requested...
c001-n029
c001-n030
c001-n031
c001-n032
----------
Run 1 single-thread process on localhost...
MPI world rank 0 running on node c001-n029!
MPI world size = 1 processes
----------
Run 2 single-thread processes on localhost...
MPI world rank 1 running on node c001-n029!
MPI world rank 0 running on node c001-n029!
MPI world size = 2 processes
----------
Run 1 single-thread process on each cluster node requested...
MPI world rank 3 running on node c001-n032!
MPI world rank 0 running on node c001-n029!
MPI world size = 4 processes
MPI world rank 1 running on node c001-n030!
MPI world rank 2 running on node c001-n031!
----------
Run 1 single-thread process across available cluster node...
MPI world rank 0 running on node c001-n029!
MPI world size = 1 processes
----------
Run 2 single-thread processes across available cluster nodes...
MPI world rank 1 running on node c001-n030!
MPI world rank 0 running on node c001-n029!
MPI world size = 2 processes
----------
Run 4 single-thread processes across available cluster nodes...
MPI world rank 0 running on node c001-n029!
MPI world size = 4 processes
MPI world rank 1 running on node c001-n030!
MPI world rank 2 running on node c001-n031!
MPI world rank 3 running on node c001-n032!
----------
Run 10 single-thread processes across available cluster nodes...
MPI world rank 4 running on node c001-n029!
MPI world rank 1 running on node c001-n030!
MPI world rank 2 running on node c001-n031!
MPI world rank 3 running on node c001-n032!
MPI world rank 8 running on node c001-n029!
MPI world rank 5 running on node c001-n030!
MPI world rank 6 running on node c001-n031!
MPI world rank 7 running on node c001-n032!
MPI world rank 0 running on node c001-n029!
MPI world size = 10 processes
MPI world rank 9 running on node c001-n030!
########################################################################
# Colfax Cluster
# End of output for job 21216.c001
# Date: Mon Aug 14 05:05:57 PDT 2017
########################################################################
[u4443@c001 lec-05]$
Also view the error file hello-mpi-2.sh.e21216
(which I'm not sure how to interpret, yet):
[u4443@c001 lec-05]$ cat hello-mpi-2.sh.e21216
[3] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[3] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[2] DAPL startup: RLIMIT_MEMLOCK too small
[3] DAPL startup: RLIMIT_MEMLOCK too small
[6] DAPL startup: RLIMIT_MEMLOCK too small
[7] DAPL startup: RLIMIT_MEMLOCK too small
[1] DAPL startup: RLIMIT_MEMLOCK too small
[5] DAPL startup: RLIMIT_MEMLOCK too small
[9] DAPL startup: RLIMIT_MEMLOCK too small
[u4443@c001 lec-05]$
(Note to self: how to interpret the above error file. Does this matter?)
In this post we've illustrated a simple yet powerful distributed computing and parallel programming example via a Hello World application. A single-threaded C++ program is spwaned as (dummy) parallel processes, distributed to run on multiple cluster nodes. We use MPI (Message Passing Interface) in our C++ program to enable coordination of the these parallel processes. The output at the end should enable us to understand a bit about how the application works.
Note: Should we wish to run multi-thread processes on each node, we may integrate the OpenMP framework to do this - this is out of scope of this post and we may come back to this in a separate post in future.