OMP: parallel computing 2/2

Introduction

The Open Multiprocessing (OpenMP) library is a set of functions and compiler directives for C, C++ and Fortran, designed to run with multiple threads on shared memory machines. The idea is that we have a serial program as the basis and can incrementally parallelize it (using #pragma omp compiler directives).

Structure of an OpenMP program

Parallel OpenMP programs have a master thread (thread with rank 0) that runs the entire time and slave threads (all other threads) that are forked/joined in parallel sections (see image below). Together all threads form a team. It is usually recommended to leave thread-unsafe operations like I/O to the master thread.

Note that every time the processes join at the end of a parallel section, they first have to synchronize (that means waiting for all threads to reach the barrier; this is done via an implicit barrier, e.g. at the end of every work-sharing construct).

Program execution

You can compile your program using

gcc -g -Wall -fopenmp program.c -o program

where -fopenmp is required to build the program (the other two flags are optional, but can be helpful: -g for more debug info and -Wall for more warnings).

and run it without any special commands:

./program

Rank

To get the rank of a thread, you can use the omp_get_thread_num() function.

int my_rank = omp_get_thread_num();

Thread_count

Similarly, we can obtain the number of threads using the omp_get_num_threads() function.

int thread_count = omp_get_num_threads();

We could also read the thread count via &argv input:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char** argv)
{
	int thread_count = strtol(argv[1], NULL, 10); /* read the first arg (base=10) */
}

Compiler directives (pragmas)

Parallel section

The most widely used pragma is for the parallel section, where we want to execute a block of code in parallel (by multiple threads).

#pragma omp parallel num_threads(thread_count)
{
	// local (private) variables
	int my_rank = omp_get_thread_num();
	double local_result;

	// calculate stuff...
	// e.g. use part of some input data like data[my_rank]
}

Critical

When writing to shared variables in a parallel section, we have to carefully handle race conditions, e.g. via a critical section that ensures mutually exclusive access to the shared variable:

double total_area = 0;

#pragma omp parallel num_threads(thread_count)
{
	// local (private) variables
	int my_rank = omp_get_thread_num();
	double local_result;

	// calculate stuff...

	// add the partial result to the total result with mutually exclusive access
	#pragma omp critical
	{
		total_result += local_result;
	}
}

Variable scopes

scope	description	how to init
shared	Can be seen by all threads	Outside of parallel sections or explicitly with shared(variable). Also static variables are always shared.
private	Can only be seen by one thread	Inside of parallel sections or explicitly with private(variable).
firstprivate	same as private, but will init the value of the local variable with the value that the shared variable from before the parallel section has	firstprivate(variable)
lastprivate	same as private, but will set the value of the shared variable after the parallel section to the local value that it has in the last iteration in the parallel section	lastprivate(variable)

Scheduling

#pragma omp parallel (...) schedule(type, chunksize)

default: schedule(static, 1) $\rightarrow$ good if all tasks take the same time
worth trying: schedule(dynamic, 1) $\rightarrow$ good if tasks have a lot of variation in runtime

type	description
static	schedule assigned before loop is executed
dynamic	after a thread completes its current chunk, it can request another
auto	schedule determined by compiler
runtime	schedule determined at runtime

Reduction

If you use the #pragma omp parallel (...) reduction(op:result) statement, the result variable will have a local scope within the parallel block. At the end of the parallel block, it will be reduced automatically using the specified operation op and stored in the same result variable (which is shared after the parallel block).

Common operations for reduction are +, *, &, and |.

Example call:

int thread_count = 2;
int total_result = 4;

#pragma omp parallel num_threads(thread_count) reduction(*:total_result)
{
	// calculate stuff..
	total_result = 5 + 5;	/* partial result of each thread */
}

printf("result: %d\n", total_result);

result: 400

The result is 4 * 10 * 10 = 400. The first factor (4) is the value that the result had before the parallel section. This is multiplied by the partial (local) results in each thread (here 5+5). So in a typical usecase where you just want to multiply the partial results, you should declare the variable as int total_result = 1 beforehand and similarly for addition you would use int total_result = 0.

Sample OpenMP program

Combining everything, we get the following program.

#include <omp.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char** argv){

	int thread_count = 4; 			/* shared */
	int total_result = 0; 			/* shared */

	printf("hi from master in the serial section!\n");

	#pragma omp parallel num_threads(thread_count)
	{
		// local (private) variables
		int my_rank;			/* private */
		int partial_result = 0;		/* private */

		// get the threads rank
		my_rank = omp_get_thread_num();

		// print the rank
		printf("hi from thread %d of %d!\n", my_rank, thread_count);

		// calculate stuff
		for(int i=0; i<1000; i++){
			partial_result++;
		}

		// add the partial result to the total result 
		// with mutually exclusive access to avoid a race condition
		#pragma omp critical
		{
			total_result += partial_result;
		}
	}

	// master (rank 0): print result
	printf("result: %d\n", total_result);

	return 0;
}

hi from master in the serial section!
hi from thread 0 of 4!
hi from thread 3 of 4!
hi from thread 1 of 4!
hi from thread 2 of 4!
result: 4000

Shared work constructs

Parallel for-loop

In the following program, we are splitting a for loop (also known as work sharing) using the directive #pragma omp parallel for and a reduction at the end.

Note:

the number of iterations has to be clear at the start (while loops and for loops with a break cannot be parallelized)
be aware of data dependencies (if any variables are accessed via write or read in different iterations, the code will compile but yield wrong results)

#include <omp.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char** argv){

	int thread_count = 24;	/* shared */
	int total_result = 0;	/* reduction variable */

	#pragma omp parallel for num_threads(thread_count) \
	reduction(+: total_result) shared(thread_count) schedule(static, 1)

	// shared work construct: partition the for-loop 
	// into (n/thread_count) iterations per process
	// and reduce the individual results with a sum at the end
	for(int i=0; i<thread_count; i++){
		total_result++;
	}
	// note: there is an implicit barrier at the end
	// of shared work constructs (here)

	// master (rank 0): print result
	printf("result: %d\n", total_result);

	return 0;
}

result: 24

Measuring execution time

We can measure the time with the omp_get_wtime() function by calling it at the start and at the end of the program. The execution time is then the difference between the two timepoints: execution_time = end_time - start_time.

#include <omp.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char** argv){

	double start_time = omp_get_wtime();

	int x = 10;
	// calculate stuff
	for(int i=0; i<2000000000; i++){
		x = x*x;
	}

	double end_time = omp_get_wtime();

	// print the time diff
	printf("execution time: %f seconds\n", end_time - start_time);

	return 0;
}

execution time: 1.222812 seconds

References

Based on the University of Saskatchewan’s CMPT851: slides for MPI, OpenMP and more.
OpenMP logo.
OpenMP program structure image.