An elementary introduction to MPI programming

Here you will find brief tutorial notes on the most important (i.e., most frequently used) methods of MPI. For a complete reference, and plenty or related information, please visit the links page.

To make it easier to parse, the following colorscheme is used in this document:

function - a function name
args - arguments of a function
concept - a concept where it is defined
constant - a defined MPI constant
class - a type or class

MPI (Message Passing Interface) is paradigm of parralel programming that allows multiple processes to communicate with each other by means of exchanging messages; we will also refer by this name to the library that provides tools for writing programs in this paradigm.
There are (known to me) implementations of MPI that allow coding in FORTRAN, C and C++. These notes deal with C++ implementation only.

General structure of an MPI program

There is a single program, that is executed by all processors; the control flow within the code is determined by the processor ID, so this is the programmer's job.

Groups and communicators

A group is an ordered set of processes; each process has its own ID, called its rank. Ranks are contiguous and start from 0. Groups are objects of class MPI::Group. While it is possible to access some parameters of the group directly, it is much more common to operate through an intragroup communicator associated with the group.

The processes communicate with each other through communicators, which are objects of either intragroup communicator class MPI::Intracomm (point-to-point communication within a single group), or intergroup communicator class MPI::Intecomm (communication between two groups of processes). Both classes are derived from an abstract base class MPI::Comm.

The total number of the processes in the group associated with an MPI::Intracomm object is returned by

int MPI::Comm::Get_size() const,

and the rank of a process within that group, which obviously can be between 0 and the return value of Get_size(), is returned by

int MPI::Comm::Get_rank() const.

Given an intracommunicator, the corresponding group can be partitioned into disjoint subgroups by calling

MPI::Intracomm MPI::Intracomm::Split(int color, int key) const.

For each value of color, a new subgroup is created, that contains the processes with the same color. The second argument, key specifies the rank that should be assigned to the process within the new group (ties broken by comparing th rank in the "old" group). A new communicator is created for each new group, and is returned by Split(). If the value of color supplied by a process is MPI::UNDEFINED, that process will not be included in any of the new groups.

MPI provides a predefined communicator MPI::COMM_WORLD, which includes all the processes running when program execution begins.

Message handling

A message consists of two parts: an envelope and the data. The envelope contains the following information:

tag
source	the rank of the sending process
destination	the rank of the intended receiving process
communicator

The basic tool for sending a message is

MPI::Comm::Send(const void* buf, int count, const MPI::Datatype& datatype, int dest, int tag) const;

where
buf is the address of the buffer with the data to be sent;
count is the number of instances of datatype to be sent;
tag is the tag (message type); it is an integer between 0 and MPI::TAG_UB;
dest is the rank of the process to which the message is sent.

The basic tool for receiving messages is

MPI::Comm:: Recv( void* buf, int max_count, const MPI:Datatype& datatype, int source, int tag, MPI::Status& status) const;

where
buf is the address where the received data should be placed (must be able to contain at least max_count elements of type datatype);
tag is the tag of the message (Recv will block until a message with an envelope mathing the values of tag and source arrives);
status (optional) contains the info on the result of the operation.
Both source and tag values allow a wildcard specification: MPI::ANY_SOURCE and MPI:ANY_TAG, respectively.

An object of class MPI::Status aloows following quieries:

int MPI::Status::Get_source() const;
int MPI::Status::Get_tag() const;
int MPI::Status::Get_error() const;

The number of entries received in the call to Recv is available through a call to
int MPI::Status::Get_count(const MPI::Datatype& datatype) const.

Organizing a group of processes

Synchronization between the processes in a group is achieved through call to

void MPI::Comm::Barrier(void) const

- the calling process is blocked until all the group members have called the Barrier.

Broadcast

The collective communications usually involve a single process (the root) communicating wih all processes in the group (itself included). There is a number of different functionalities in such communications. First, the root might need to "broadcast" a value to all processes. This is done by

MPI::Intracomm::Bcast( void* buffer, int count, const MPI::Datatype& datatype, int root) const;

(the value of root must be identical at all processes). E.g., after the call
comm.Bcast(int_arr, 100, MPI::INT, 0)
the 100 elements of the integer array int_arr in all processes are copies of these elements in the process 0.

Gather

In "gathering", each process (root itself included) sends a value to the root, which stores all the incoming values:

void MPI::Intracomm::Gather(const void* sendbuf, int sendcount, const MPI::Datatype& sendtype, void* recbuf, int recvcount, constMPI::Datatype& recvtype, int root) const;

- the semantics being that each proccess sends a message (sendcount elements of type sendtype) from its sendbuf, and the root concatenates all the received messages (which are assumed to contain at most recvcount elements of recvtype) at its recbuf. Note that the root must have recvbuf large enough to contain all the messages (usually the return value of Comm::Get_size() is used in allocation), while the rest of the nodes ignore the recvbuf and thus need not allocate it at all.

An alternative, more flexible variant is provided in the form of vector gather:

MPI::Intracomm::Gatherv( const void* sendbuf, int sendcount, const MPIDatatype& sendtype, void* recvbuf, const int recvcounts[], const int displs[], const MPI::Datatype& recvtype, int root) const;

Here each processor can communicate a message of different length.

Reduction

By reduction we mean an operation such as sum, max, min etc., that reduces a set of values to a single value. The syntax for global reduction in MPI is

void MPI::Intracomm::Reduce( const void* sendbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op, int root) const;

The semantics: the values of all processes' sendbufs are combined using op and placed in root's recvbuf. MPI::Op is a class that handles reduce operations, and the operator must be compatible with datatype argument.
The following operations are predefined in MPI:

Operation	Meaning
`MPI::MAX`	Maximum
`MPI::MIN`	Minimum
`MPI::SUM`	Sum
`MPI::PROD`	Product
`MPI::LAND`	Logical (bolean) and
`MPI::LOR`	Logical or
`MPI::MINLOC`	Minimum, and the location of the minimum (the rank of the process that has the min). Similar to MATLAB's `[maxval,maxind]=max(...)`
`MPI::MAXLOC`	Maximum, and the location of the maximum

The simpler, sungle-result operations are compatible with most scalar datatypes (MPI::DOUBLE etc.); MINLOC and MAXLOC operate on pairs that represent a value and an index, and return the pair corresponding top the minimal/maximal value. MPI provides predefined types for such pairs: MPI::DOUBLE_INT,MPI::FLOAT_INT,MPI::LONG_INT and MPI::TWOINT (pair of int).
Furthermore, MPI allows user-created reduce operations, which are constructed by

void MPI::Op::Init( MPI::User_function* function, bool commute);

(if commute=false, the order of operations is assumed to be ascending beginning with process 0). The user-defined function must adher to the following prototype:

typedef void MPI::User_function(const void* invec, void* inoutvec, int len, const Datatype& datatype);

(for example, the MIN function will perform inoutvec[i] = min(invec[i],inoutvec[i]) for each i=0,...,len-1.)

As a convenience, MPI provides an operation equivalent to a sequence of reduction and broadcast:

void MPI::Intracomm::Allreduce( const void* sendbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI:Op& op) const;

which is similar to Reduce except that the result appears in recvbuf of all processes in the group.

Data types

Datatype in MPI messages is an object of the class MPI::Datatype. There is a number of predefined datatypes, the most important of which are listed below, with their C++ equivalents:

MPI::CHAR	`signed char`
MPI::INT	`signed int`
MPI::DOUBLE	`double`
MPI::BOOL	`bool`

Address is represented by values of class MPI::Aint. Note that since types are represented by instances of Datatype, it is impossible to get a size of an object of a particular type by calling, say, sizeof(MPI::DOUBLE) (this would give the size of the object representing the type DOUBLE). Instead, one should use
void MPI::Datatype::Get_size(void) const
e.g., MPI::Double.Get_size() will return the size of a double in bytes (usually 8).

It is possible to derive a new datatype from the existing ones, using one of the consturctors:
MPI::Datatype MPI::Datatype::Create_contiguous(int count) const
creates a type with values being a contiguous replication of values of the old type. For example,
MPI::Datatype newtype = MPI::DOUBLE.Create_contiguous(4);
makes newtype an array of 4 doubles.

The most general type constructor is
static MPI::Datatype MPI::Datatype::Create_struct(int count, const int array_of_blocklength[], const MPI::Aint array_of__displacements[], const MPI::Datatype array_of_types[])
For example, let B = [2,1,4], D = [0,16,24], T = [ MPI::DOUBLE, MPI::INT, MPI::CHAR]. Then, the call to MPI::Datatype.Create_struct(3,B,D,T) creates the following layout in memory:
[<double 0..7><double
8..15>][<int 16..19>][... 20..23][<char 24><char 25><char 26><char
27>]
It is much easier to compute the displacements using the function
MPI::Aint MPI::Get_address(void* location) e.g., in the example above it can be done as follows:

struct S {
  double a;
  double b;
  int x;
  char c[4];
};
MPI::Datatype type[3] = {MPI::DOUBLE, MPI::INT, MPI:CHAR};
int blocklen[3] = {2,1,4};
MPI::Aint disp[3];

struct S data[10];

disp[0] = MPI::Get_address(&data[0].a);
disp[1] = MPI::Get_address(&data[0].x);
disp[2] = MPI::Get_address(&data[0].c);

MPI::Datatype newtype = MPI::Create_struct(3,blocklen,disp,type);
newtype.Commit();
...

Before a derived type can be used in MPI communications, it should be commited:
void MPI::Datatype::Commit(void)
and the handle can be freed, after the program is done using the type, by calling
void MPI::Datatype::Free(void)