MPI: parallel computing 1/2
Introduction
The Message Passing Interface (MPI) is a C
and Fortran
library, built for running multiple processes in a distributed-memory fashion (using communication). It is flexible enough to also work with shared memory, if needed.
It offers lots of different functions, but you can already get programs working with only a minimal subset of about 6 commands.
Structure of a MPI program
The following structure applies to all programs:
- include the MPI header file
- declare variables
- MPI_Init(&argc, &argv);
- compute, communicate, etc.
- MPI_Finalize();
Putting everything together, we get the following structure:
// include the MPI header file
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv){
// declare variables
int my_rank, n_processes;
// init the MPI environment
MPI_Init(&argc, &argv);
// compute, communicate, etc.
// close the MPI environment
MPI_Finalize();
return 0;
}
Program execution
Running the program first requires building it via
mpicc program.c -o program.out
After that, running is as easy as just calling
mpirun -n 2 ./program.out
where -n
specifies the number of processes.
MPI constants and handles
MPI Constants are usually integers and always capitalized (MPI_COMM_WORLD
, MPI_INT
, MPI_SUCCESS
, etc.).
The constants MPI_COMM_WORLD
and MPI_SUCCESS
are also “handles”. A handle refers to internal MPI datastructes, i.e.
when you pass MPI_COMM_WORLD
as an argument to a subroutine, MPI uses the refered communicator that includes all processes.
Rank
How to get your rank:
int my_rank;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
Comm_sz (number of processes)
How to get the size of the communicator:
int comm_sz;
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
Point-to-point communication
MPI_Send
- blocks
MPI_Send(
send_buf,
count,
dtype,
dest,
tag,
comm
);
MPI_Recv
- blocks
MPI_Recv(
recv_buf,
count,
dtype,
source,
tag,
comm,
status /* MPI_STATUS_IGNORE */
);
Collective Communication routines
- collective communication involves all processes of a communicator
- the routines can be classified into
one-to-many
,many-to-one
andmany-to-many
structures
MPI_Barrier
- blocks all processes until every process called the Barrier routine
- semantics: synchronizes all processes (e.g. wait until a task is finished)
- use MPI_Barrier as little as possible to reduce overhead
Pseudocode:
- first send a Message in a cycle (P0 to P1, P1 to P2, …, P$_\text{n-1}$ to P0)
- after the cycle finishes let P0 broadcast to every process that all processes have reached the barrier
int MPI_Barrier(comm){
// declare variables
int my_rank, comm_sz;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
if(my_rank==0){
// process 0: send a message to process 1
MPI_Send(NULL, 0, MPI_INT, (my_rank+1)%comm_sz, 0, comm);
// and wait until the message went around the circle (in a 1d-torus fashion)
MPI_Recv(NULL, 0, MPI_INT, MPI_ANY_SOURCE, 0, comm, MPI_STATUS_IGNORE);
}
// all other processes
else{
// wait for all predecessors to reach the barrier so that the
// direct predecessor sends a message
MPI_Recv(NULL, 0, MPI_INT, MPI_ANY_SOURCE, 0, comm, MPI_STATUS_IGNORE);
// send a message to the process with the next rank
MPI_Send(NULL, 0, MPI_INT, (my_rank+1)%comm_sz, 0, comm);
}
// finally have all proccesses blocked until every other process reached the barrier
// by waiting for process 0 to get the message from the last rank and then calling
// broadcast to let every process return from this subroutine
MPI_Bcast(NULL, 0, MPI_INT, 0, comm);
return MPI_SUCCESS;
}
MPI_Bcast
- one-to-many
Example call:
MPI_Bcast(
message, /* buffer */
1, /* count */
MPI_INT, /* dtype */
3, /* root */
MPI_COMM_WORLD /* comm */
);
Pseudo-Implementation:
int MPI_Bcast(message, count, dtype, root, comm){
// declare variables
int my_rank, comm_sz;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
// send the message to every process
if(my_rank==root){
for(int i=0; i<comm_sz; i++){
MPI_Send(message, count, dtype, i, 0, comm);
}
}
// all other processes: receive the message
else{
MPI_Recv(message, count, dtype, root, 0, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
MPI_Reduce
MPI_Reduce(
send_buf, /* send buffer */
recv_buf, /* receive buffer */
1, /* count */
MPI_INT, /* dtype */
MPI_SUM, /* operation */
3, /* root */
MPI_COMM_WORLD /* comm */
);
List of possible reduction operations:
- MPI_MIN/ MPI_MAX
- MPI_MINLOC/ MPI_MAXLOC (gives the min value and the index of the process, like argmin)
- MPI_SUM/ MPI_PROD
- MPI_LAND/ MPI_LOR (logical and/or)
- etc.
Pseudo-Implementation:
int MPI_Reduce(send_buf, recv_buf, count, dtype, operation, root, comm){
// declare variables
int my_rank, comm_sz;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
if(my_rank==root){
int result;
// collect items from every process
for(int i=0; i<comm_sz; i++){
MPI_Recv(recv_buf, count, dtype, i, 0, comm, MPI_STATUS_IGNORE);
// apply the reduction operation
result = operation(result, recv_buf);
}
// copy the result to the recv_buf
recv_buf = result;
}
// all other processes: send an item
else{
MPI_Send(send_buf, count, dtype, root, 0, comm);
}
return MPI_SUCCESS;
}
MPI_Allreduce
- all processes get all datapieces
- combine MPI_Reduce to do the reduction and MPI_Bcast to send the result to all processes
int MPI_Allreduce(send_buf, recv_buf, count, dtype, operation, comm){
// perform reduction to process 0
MPI_Reduce(
send_buf, /* send buffer */
recv_buf, /* receive buffer */
count, /* count */
dtype, /* dtype */
operation, /* operation */
0, /* root */
comm); /* comm */
// broadcast the result from proccess 0 to all other processes
MPI_Bcast(
recv_buf, /* result is stored in the recv_buf */
count, /* count */
dtype, /* dtype */
0, /* root */
comm); /* comm */
return MPI_SUCCESS;
}
MPI_Gather
- many-to-one
- the root process gets a piece of data from each process (including itself) and stores all the data in order of the ranks
Pseudo-Implementation:
int MPI_Gather(send_buf, send_count, send_dtype, recv_buf, recv_count, recv_dtype, root, comm){
// declare variables
int my_rank, comm_sz;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
// receive items from all processes and store them
if(my_rank==root){
for(int i=0; i<comm_sz; i++){
MPI_Recv((dtype*) &recv_buf+i*(recv_count), recv_count, recv_dtype, i, 0, comm, MPI_STATUS_IGNORE);
}
}
// all processes: send an item to the root process
else{
MPI_Send(send_buf, send_count, recv_dtype, root, 0, comm);
}
return MPI_SUCCESS;
}
MPI_Scatter
- one-to-many
- the data of one process is divided among all the processes
Pseudo-Implementation:
int MPI_Scatter(send_buf, send_count, send_dtype, recv_buf, recv_count, recv_dtype, root, comm){
// declare variables
int my_rank, comm_sz;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
// root process: send items to all processes
if(my_rank==root){
for(int i=0; i<comm_sz; i++){
MPI_Send((dtype*) &send_buf+i*(send_count), send_count, send_dtype, i, 0, comm);
}
}
// all processes: receive items from the root process
else{
MPI_Recv(recv_buf, recv_count, recv_dtype, root, 0, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
MPI_Alltoall
- many-to-many
Pseudo-Implementation:
int MPI_Alltoall(send_buf, send_count, send_dtype, recv_buf, recv_count, recv_dtype, comm){
// declare variables
int my_rank, comm_sz;
// get my rank
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
// get the number of processes
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
// all processes: send and receive items from all processes
for(int i=0; i<comm_sz; i++){
for(int j=0; j<comm_sz; j++){
MPI_Sendrecv(
(dtype*) &send_buf+i*(send_count), /* send_buf */
send_count, /* send_count */
send_dtype, /* send_dtype */
j, /* dest */
0, /* send_tag*/
(dtype*) &recv_buf+i*(recv_count), /* recv_buf*/
recv_count, /* recv_count */
recv_dtype, /* recv_dtype */
i, /* source */
0, /* recv_tag */
comm, /* comm */
MPI_STATUS_IGNORE); /* status */
}
}
return MPI_SUCCESS;
}
MPI_Allgather
To build this function, we can just combine the following two functions:
- MPI_Gather to Master (implementation above)
- MPI_Bcast from Master to all others (implementation above)
References
- Based on the University of Saskatchewan’s CMPT851: slides for MPI, OpenMP and more.