Skip to content

Instantly share code, notes, and snippets.

@vivek-bala
Created October 10, 2018 15:01
Show Gist options
  • Save vivek-bala/7dd52330c798f9f5230aa2332d0f4b95 to your computer and use it in GitHub Desktop.
Save vivek-bala/7dd52330c798f9f5230aa2332d0f4b95 to your computer and use it in GitHub Desktop.
mpirun(1) General Commands Manual mpirun(1)
NAME
mpirun - Runs MPI programs
SYNOPSIS
mpirun [global_options] hp_spec [:hp_spec ...]
DESCRIPTION
The mpirun command is the primary job launcher for the Message Passing Toolkit (MPT) implementations of MPI and OpenSHMEM. The mpirun command must be used when a user wants to run one of these applications on SGI systems. In addition, Array Services software must be running to launch programs.
MPT implements the MPI 3.1 standard, as documented by the MPI Forum in the release of MPI: A Message Passing Interface Standard. However, several MPI implementations available today use a job launcher called mpirun, and because this command is not part of the MPI standard, each implementation's
mpirun command differs in both syntax and functionality.
You can run an application on the local host only (the host from which you issued mpirun) or distribute it to run on any number of hosts that you specify.
The mpirun command syntax consists of an optional global_options list followed by one or more groups of arguments called host-program specifications (hp_specs). The global_options apply to all MPI executable files on all specified hosts. Global options must be specified before local options
specific to a host-program combination (hp_spec).
The following global options are supported:
Global Option Description
-a[rray] array_name Specifies the array to use when launching an MPI application. By default for multihost jobs, Array Services uses the default array, which is identified by the ainfo dfltarray command. You can obtain other valid values by issuing the ainfo arrays command or by viewing the
/etc/array/arrayd.conf file.
-connect pid Connect to another mpirun allowing jobs from different mpiruns to interact. See the Multiple mpiruns section below. The pid argument is the process ID of the base mpirun that specified -server.
-configfile file Directs MPT to look for the specified file and apply any environment variable settings found within. MPT will check in order the $PWD, $HOME, and $MPI_ROOT/etc/ directories on the launching host for the file. If it finds the file in one directory, it will not check later
directories. This feature is applied after MPT applies any environment variable settings from sgimpt.conf files.
-cpr MPT enables its checkpoint-restart mode. Some performance optimizations may be disabled. See the mpt_checkpoint(1) man page.
-d[ir] path_name Specifies the working directory for all hosts. In addition to normal path names, the following special values are recognized:
. Translates into the absolute path name of the user's current working directory on the local host. This is the default.
~ Specifies the use of the value of $HOME as it is defined on each machine. In general, this value can be different on each machine.
-f[ile] file_name Specifies a text file that contains mpirun arguments.
-h[elp] Displays a list of options supported by the mpirun command.
-noconf Directs MPT to not search for any sgimpt.conf files. This has no effect on the configfile option.
-p[refix] prefix_string Specifies a string to prepend to each line of output from stderr and stdout for each MPI process. To delimit lines of text that come from different hosts, output to stdout must be terminated with a new line character. If a process's stdout or stderr streams do not end
with a new line character, there will be no prefix associated with the output or error streams of that process from the final new line to the end of the stream.
If the MPI_UNBUFFERED_STDIO environment variable is set, the prefix string is ignored.
Some strings have special meaning and are translated as follows:
* %g translates into the global rank of the process producing the output. This is usually equivalent to the rank of the process in MPI_COMM_WORLD. The global rank will be different than the MPI_COMM_WORLD rank if running in spawn capable mode, if MPT has coalesced some
shepherd groups together, or if a tool is being used to remap the ranks. In these latter cases, this translates to the rank of the process within the universe specified at job launch.
* %G translates into the number of processes in MPI_COMM_WORLD, or, if running in spawn capable mode, the value of the MPI_UNIVERSE_SIZE attribute.
* %h translates into the rank of the host on which the process is running, relative to the mpirun command line. This string is not relevant for processes started via MPI_Comm_spawn or MPI_Comm_spawn_multiple.
* %H translates into the total number of hosts in the job. This string is not relevant for processes started via MPI_Comm_spawn or MPI_Comm_spawn_multiple.
* %l translates into the rank of the process relative to other processes running on the same host.
* %L translates into the total number of processes running on the host.
* %w translates into the world rank of the process, i.e. its rank in a MPI_COMM_WORLD. When not running in spawn capable mode, this is equivalent to %g.
* %W translates into the total number of processes in MPI_COMM_WORLD. When not running in spawn capable mode, this is equivalent to %G.
* %@ translates into the name of the host on which the process is running.
For examples of the use of these strings, first consider the following code fragment:
main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
printf("Hello world\n");
MPI_Finalize();
}
Depending on how this code is run, the results of running the mpirun command will be similar to those in the following examples:
% mpirun -np 2 a.out
Hello world
Hello world
% mpirun -prefix ">" -np 2 a.out
>Hello world
>Hello world
% mpirun -prefix "%g" 2 a.out
0Hello world
1Hello world
% mpirun -prefix "[%g] " 2 a.out
[0] Hello world
[1] Hello world
% mpirun -prefix "<process %g out of %G> " 4 a.out
<process 1 out of 4> Hello world
<process 0 out of 4> Hello world
<process 3 out of 4> Hello world
<process 2 out of 4> Hello world
% mpirun -prefix "%@: " hosta,hostb 1 a.out
hosta: Hello world
hostb: Hello world
% mpirun -prefix "%@ (%l out of %L) %g: " hosta 2, hostb 3 a.out
hosta (0 out of 2) 0: Hello world
hosta (1 out of 2) 1: Hello world
hostb (0 out of 3) 2: Hello world
hostb (1 out of 3) 3: Hello world
hostb (2 out of 3) 4: Hello world
% mpirun -prefix "%@ (%h out of %H): " hosta,hostb,hostc 2 a.out
hosta (0 out of 3): Hello world
hostb (1 out of 3): Hello world
hostc (2 out of 3): Hello world
hosta (0 out of 3): Hello world
hostc (2 out of 3): Hello world
hostb (1 out of 3): Hello world
-progress Prints a bar across the screen displaying the progress of job launch. It is not printed if stderr is redirected to a non-terminal.
-server Causes mpirun to accept connections from other mpirun instances, allowing jobs from different mpiruns to interact. See the Multiple mpiruns section below.
-stats Prints statistics about the amount of data sent with MPI calls during the MPI_Finalize process. Data is sent to stderr. Users can combine this option with the -p option to prefix the statistics messages with the MPI rank. For more details, see the MPI_SGI_stat_print(3)
man page.
-up u_size Specifies the value of the MPI_UNIVERSE_SIZE attribute to be used in supporting MPI_Comm_spawn and MPI_Comm_spawn_multiple. This option must be set if either of these functions are to be used by the application being launched by mpirun - setting this option causes the MPI
job to be run in spawn capable mode. By default, additional MPI processes will be spawned on the localhost where mpirun is running. See the section Launching Spawn Capable Jobs below for information on how to spawn additional processes on other hosts.
-v[erbose] Displays comments on what mpirun is doing when launching the MPI application.
hp_spec Syntax
The host-program specification (hp_spec) describes a host on which to run a shepherd group, the number of processes to run, the program to run, and the local options for that shepherd group combination. A shepherd group is a set of processes on a single host that use shared memory inbetween each
other for communication. You can list any number of hp_spec items on the mpirun command line.
In the common case (Single Program Multiple Data (SPMD)), in which the same program runs with identical arguments on each host, usually only one hp_spec needs to be specified.
Each hp_spec has the following syntax:
[host_list] [local_options] [-np/-n] pcount program [args]
Component Description
host_list One or more host names separated by commas and optional
spaces. One shepherd group is started for each item. If
host_list is not provided, the MPI processes
are started on the local host.
local_options Local options that apply to these shepherd groups.
The following local options are
supported:
-f[ile] file_name
Specifies a text file that contains mpirun
arguments (same as global_options.) For
more details, see the subsection titled "Using a
File For mpirun Arguments" on this man page.
pcount Number of processes to start for the given program
in each shepherd group. This number may be optionally
preceded by "-np".
program Name of an executable program.
args Arguments to the executable program.
The hp_specs and the host list are processed in sequence when
assigning MPI rank numbers within MPI_COMM_WORLD to MPI processes
in the running application.
Using a File for mpirun Arguments
Because the full specification of a complex job can be lengthy, you can enter mpirun arguments in a file and use the -f option to specify the file on the mpirun command line, as in the following example:
mpirun -f my_arguments
The arguments file is a text file that contains argument segments. White space is ignored in the arguments file, so you can include spaces and newline characters for readability. An arguments file can also contain additional -f options.
Example: Running MPI Programs on the Local Host
To run an MPI application on the local host, simply enter mpirun with the number of processes and the executable name. The following command starts MPI application a.out using 3 processes on the local host:
mpirun 3 a.out
or
mpirun -np 3 a.out
If you have a heterogeneous application with 2 processes running as a.out and 4 processes running as b.out, you can use the following mpirun command:
mpirun 2 a.out : 4 b.out
On large systems, a batch scheduling system should normally associate the user job script and the mpirun command with a cpuset created for the job. Set the MPI_DSM_DISTRIBUTE environment variable so that MPI processes run on CPUs with best ccNUMA placement. See the mpi(1) man page for more
about MPI_DSM_DISTRIBUTE.
Example: Running MPI Programs on Clusters
You can use mpirun to launch a program that consists of any number of executable files and processes and distribute it to any number of hosts in your cluster. The hosts in your cluster are known to Array Services, and can be identified by the following command:
ainfo machines
or the following file:
/etc/array/arrayd.conf
The following examples show various ways to launch an application that consists of multiple MPI executable files or multiple hosts.
The following example runs MPI application a.out with 8 processes on host_a:
mpirun host_a 8 a.out
The following example runs MPI application fred with a total of 30 processes, distributed 10 per host on 3 hosts. The program takes two command line arguments. This will result in MPI_COMM_WORLD ranks 0-9 running on host_a, 10-19 on host_b and 20-29 on host_c.
mpirun host_a,host_b,host_c 10 fred arg1 arg2
The following command runs MPI program foo with 2 processes on host_x and 4 processes on host_y. An array name "cluster2" is specified in this case, because host_x and host_y are assumed not to be in the default array (see ainfo(1)).
mpirun -a cluster2 host_x 2 foo : host_y 4 foo
Launching Spawn Capable Jobs
When running MPI applications that use the MPI spawn functions MPI_Comm_spawn or MPI_Comm_spawn_multiple, the user needs to specify information that enables the spawn capability. This information includes the maximum number of of MPI processes in the job, called the MPI job universe size, the
hosts on which to spawn the processes, and the number of processes to spawn on each host.
The MPI job universe size can be specified explictly by using the -up option on mpirun(1) or by setting the MPI_UNIVERSE_SIZE environment variable. It can also be established implicitly by the more detailed information in the MPI_UNIVERSE variable described below.
The hosts on which MPI launches the new MPI processes are established as follows:
- If the -spawn option is used with mpiexec_mpt, then it will automatically set MPI_UNIVERSE from the set of resources reserved for the job.
- If the MPI_UNIVERSE environment variable is set, it identifies the hosts to launch on. The syntax for this variable is list of hp_specs, but without a specified application or argument list. The following are examples of valid MPI_UNIVERSE values:
"host_a, host_b"
"host_a 8, host_b 16"
"host_a, host_b 12"
"host_a, host_b -np 16"
If MPI_UNIVERSE is not specified, MPI spawn requests will place new processes on the local host.
- The user may pass host information to MPI_Comm_spawn or MPI_Comm_spawn_multiple in info objects. When this method is used, the hosts identified may be outside of the list of hosts from MPI_UNIVERSE. The supported info keys and the values associated with them are:
Info Key Value String
hostfile Name of file containing the list of hosts on which to spawn MPI processes. Host names are separated by space or tab characters.
MPI_SGI_NODELIST Space or tab-separated list of hosts on which to spawn MPI processes.
MPI Spawn Examples
1. This example shows how to set the MPI_UNIVERSE shell variable to enable spawning of processes on two hosts with hostnames host_a and host_b:
setenv MPI_UNIVERSE "host_a, host_b 8"
mpirun -up 16 -np 1 coupler
In this example, the MPI_UNIVERSE_SIZE is 16. Processes started via mpirun count as part of the universe. The coupler runs on the local host where the mpirun was executed. In this case, the localhost must either be host_a or host_b. Note the use of the -up argument is not required if the
MPI_UNIVERSE shell variable is set.
2. This example shows host to set the MPI_UNIVERSE shell variable to allow different numbers of MPI processes on different hosts:
setenv MPI_UNIVERSE "host_a 16, host_b 8"
mpirun host_b 1 coupler
In this example, the MPI_UNIVERSE_SIZE is 24. mpirun is used to start one instance of the coupler application on host_b. The -up argument has been omitted as this is not necessary if the MPI_UNIVERSE shell variable is set.
3. This example calls MPI_Comm_spawn using the MPI_SGI_NODELIST and hostlist info objects to specify the hosts on which to launch MPI processes.
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "MPI_SGI_NODELIST", "host_b host_b");
MPI_Comm_spawn("b.out", MPI_ARGV_NULL, 2, info,...);
char * list = "host_b host_b";
int fd = open("list.txt", O_WRONLY);
write(fd, list, strlen(list) + 1);
MPI_Info_set(info, "hostfile", "list.txt");
MPI_Comm_spawn("b.out", MPI_ARGV_NULL, 2, info,...);
export MPI_UNIVERSE="host_a 4"
mpirun r1i0n0 -up 6 host_a -np 3 coupler
In both cases, two b.out processes will be run on host_b. If the coupler program launches any spawn processes that do not specify the desired hosts in their info argument then they will be placed within the defined MPI_UNIVERSE. At any single point in time, the sum of starting processes,
processes launched into the hosts in MPI_UNIVERSE, and hosts launched onto specified hosts cannot be greater than MPI_UNIVERSE_SIZE.
Multiple mpiruns
MPI applications started by different mpirun instances may interact if so desired. One mpirun must run in spawn-capable mode and specify the -server option to cause it to act as a server and listen for connections. Other mpiruns on the same host may connect to it by specifying the -connect pid
option specifying the server's process ID. Applications run by the mpiruns are then operating in the same MPI universe and may use the standard MPI-2 operations to discover and interact with each other. The MPI universe must contain enough process slots to accomodate the requested processes from
all the different mpiruns.
Launching Non-MPT Applications
mpirun can also be used to launch non-MPT applications. These are non-MPI / non-OpenSHMEM applications. MPT will launch copies of these applications the same way it launches regular parallel applications. Stdin is not supported in this environment. MPI_SHEPHERD should be set to true in order to
achieve the needed launch behavior.
Prefixing Commands
Some applications may whish to launch some helper binary on each node right before the main executable is started. This binary wraps and launches the rest of the command. This is normally accomplished by modifying the command line. This may also be accomplished by setting the MPI_PREFIX_CMD
environmental variable. As an example:
$ mpirun -np 2 helper -l a.out
Alternative method:
$ export MPI_PREFIX_CMD="helper -l"
$ mpirun -np 2 a.out
Job Control
It is possible to terminate, suspend, and/or resume an entire MPI application (potentially running across multiple hosts) by using the same control characters that work for serial programs. For example, sending a SIGINT signal to mpirun terminates all processes in an MPI job. Similarly, sending a
SIGTSTP signal to mpirun suspends an MPI job and sending a SIGCONT signal resumes a job.
Signal Propagation
It is possible to send some user signals to all processes in an MPI application (potentially running across multiple hosts). Presently, mpirun supports four user-defined signals: SIGURG, SIGUSR1, SIGINT, and SIGTERM. To make use of this feature, the application needs to have a signal handler that
catches the respective signal. When the signal is sent to the mpirun process ID, mpirun will catch the signal and propagate it to all MPI / SHMEM processes. If SIGINT is sent twice, then the application will be forcibly terminated.
Troubleshooting
Problems you encounter when launching MPI jobs often result in a "could not run executable" error message from mpirun. There are many possible causes for this message, including (but not limited to) the following reasons:
* The . is missing from the user's search path and the program name was specified on the mpirun command line without a full path specification.
* A dynamic library in the user program cannot be found. Try "ldd a.out" to check for dynamic link errors or missing libraries.
* A problem with the XPMEM or InfiniBand interconnect was detected. View /var/log/messages for recent MPI diagnostic messages.
* The system has not been configured for MPT. See the "System Configuration" section in the README.relnotes MPT Release Notes file for more information.
* You have prefixed the executable command with a profiling tool. In many cases tools expect a separate fork and exec of the application for each rank. MPT does an exec once on each host and then forks for each rank. To give the behavior these tools expect, run with MPI_SHEPHERD=true.
* On partitioned systems, XPC might not be activated. See the "Installing Partitioning Software and Configuring Partitions" section of the Linux Configuration and Operation Guide.
* Array Services or Secure Array Services is not installed or has been incorrectly configured; use the following commands to check your configuration:
rpm -qa '*array*'
ainfo machines
ascheck
array uptime
See more troubleshooting help in Chapter 8, "Troubleshooting and Frequently Asked Questions", in the Message Passing Toolkit (MPT) User's Guide.
Limitations
The following practices will break the mpirun parser:
* Using machine names that are numbers (for example, 3, 127, and so on)
* Using MPI applications whose names match mpirun options (for example, -d, -f, and so on)
* Using MPI applications that use a colon (:) in their command-lines.
* Array Services is running with authentication NOREMOTE but a hostname to run on was specified to mpirun. See arrayconfig(8).
NOTES
Running an MPI job in the background is supported only when stdin is redirected.
The mpirun process is still connected to the tty when a job is placed in the background. One of the things that mpirun polls for is input from stdin. If it happens to be polling for stdin when a user types in a window after putting an MPI job in the background, and stdin has not been redirected,
the job will abort upon receiving a SIGTTIN signal. This behavior is intermittent, depending on whether mpirun happens to be looking for and sees any stdin input.
The following examples show how to run an MPI job in the background.
For a job that uses input_file as stdin:
mpirun -np 2 ./a.out < input_file > output &
For a job that does not use stdin:
mpirun -np 2 ./a.out < /dev/null > output &
RETURN VALUES
On exit, mpirun returns the appropriate error code to the run environment.
SEE ALSO
mpiexec_mpt(1)
mpi(1) has MPI run-time information and pointers to MPI documentation.
termio(7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment