Performance Tuning

Executing with GPUDirect

GPU direct support for MPI+CUDA, to enable (on the OPS side) add -gpudirect when running the executable. You may also have to use certain environmental flags when using different MPI distributions. For an example of the required flags and environmental settings on the Cambridge Wilkes2 GPU cluster see:
https://docs.hpc.cam.ac.uk/hpc/user-guide/performance-tips.html

Cache-blocking Tiling

OPS has a code generation (ops_gen_mpi_lazy) and build target for tiling. Once compiled, to enable, use the OPS_TILING runtime parameter. This will look at the L3 cache size of your CPU and guess the correct tile size. If you want to alter the amount of cache to be used for the guess, use the OPS_CACHE_SIZE=XX runtime parameter, where the value is in Megabytes. To manually specify the tile sizes, use the OPS_TILESIZE_X, OPS_TILESIZE_Y, and OPS_TILESIZE_Z runtime arguments.

When MPI is combined with OpenMP tiling can be extended to the MPI halos. Set OPS_TILING_MAXDEPTH to increase the the halo depths so that halos for multiple ops_par_loops can be exchanged with a single MPI message (see TPDS2017 for more details)
To test, compile CloverLeaf under OPS/apps/c/CloverLeaf, modify clover.in to use a \(6144^2\) mesh, then run as follows:
For OpenMP with tiling:

export OMP_NUM_THREADS=xx; numactl -physnodebind=0 ./cloverleaf_tiled OPS_TILING

For MPI+OpenMP with tiling:

export OMP_NUM_THREADS=xx; mpirun -np xx ./cloverleaf_mpi_tiled OPS_TILING OPS_TILING_MAXDEPTH=6

To manually specify the tile sizes (in number of grid points), use the OPS_TILESIZE_X, OPS_TILESIZE_Y, and OPS_TILESIZE_Z runtime arguments:

export OMP_NUM_THREADS=xx; numactl -physnodebind=0 ./cloverleaf_tiled OPS_TILING OPS_TILESIZE_X=600 OPS_TILESIZE_Y=200

OpenMP and OpenMP+MPI

It is recommended that you assign one MPI rank per NUMA region when executing MPI+OpenMP parallel code. Usually for a multi-CPU system a single CPU socket is a single NUMA region. Thus, for a 4 socket system, OPS’s MPI+OpenMP code should be executed with 4 MPI processes with each MPI process having multiple OpenMP threads (typically specified by the OMP_NUM_THREAD flag). Additionally on some systems using numactl to bind threads to cores could give performance improvements (see OPS/scripts/numawrap for an example script that wraps the numactl command to be used with common MPI distributions).

CUDA arguments

The CUDA (and OpenCL) thread block sizes can be controlled by setting the OPS_BLOCK_SIZE_X, OPS_BLOCK_SIZE_Y and OPS_BLOCK_SIZE_Z runtime arguments. For example,

./cloverleaf_cuda OPS_BLOCK_SIZE_X=64 OPS_BLOCK_SIZE_Y=4

OpenCL arguments

OPS_CL_DEVICE=XX runtime flag sets the OpenCL device to execute the code on.

Usually OPS_CL_DEVICE=0 selects the CPU and OPS_CL_DEVICE=1 selects GPUs.