Performance Tuning

Executing with GPUDirect

GPUDirect support for MPI+CUDA and MPI+HIP: To enable it on the OPS side, add -gpudirect flag when running the executable. Depending on the MPI distribution, you may also need to set specific environment variables. For reference, see the example flags and environment setting provided for the Cambridge Wilkes2 GPU cluster:
https://docs.hpc.cam.ac.uk/hpc/user-guide/performance-tips.html Enabling this flag activates MPI-aware GPU-to-GPU communication on both AMD and NVIDIA GPUs.

On NVIDIA GPUs, OPS leverages NVIDIA GPUDirect RDMA technology, which allows direct data transfers between GPU memory and the network interface without staging through host memory.
On AMD GPUs, OPS utilizes the ROCm RDMA technology, which provides similar direct communication capabilities, enabling efficient GPU-to-GPU data movement over high-performance interconnects.

Note: The OPS -gpudirect option is an abstraction that activates MPI-aware GPU communications across both vendor platforms. It should not be confused with NVIDIA’s GPUDirect technology specifically - rather, it is a portable OPS runtime option that selects the appropriate mechanism depending on the underlying hardware and software stack.

Cache-blocking Tiling

OPS has a code generation (ops_gen_mpi_lazy) and build target for tiling. Once compiled, to enable tiling, use the OPS_TILING runtime parameter. This will look at the L3 cache size of your CPU and guess the correct tile size. If you want to alter the amount of cache to be used for the guess, use the OPS_CACHE_SIZE=XX runtime parameter, where the value is in Megabytes. To manually specify the tile sizes, use the OPS_TILESIZE_X, OPS_TILESIZE_Y, and OPS_TILESIZE_Z runtime arguments.

When MPI is combined with OpenMP, tiling can be extended to the MPI halos. Set OPS_TILING_MAXDEPTH to increase the halo depths so that halos for multiple ops_par_loops can be exchanged with a single MPI message (see TPDS2017 for more details).
To test, compile CloverLeaf under OPS/apps/c/CloverLeaf, modify clover.in to use a \(6144^2\) mesh, then run as follows:
For OpenMP with tiling:

export OMP_NUM_THREADS=xx; numactl -physnodebind=0 ./cloverleaf_tiled OPS_TILING

For MPI+OpenMP with tiling:

export OMP_NUM_THREADS=xx; mpirun -np xx ./cloverleaf_mpi_tiled OPS_TILING OPS_TILING_MAXDEPTH=6

To manually specify the tile sizes (in number of grid points), use the OPS_TILESIZE_X, OPS_TILESIZE_Y, and OPS_TILESIZE_Z runtime arguments:

export OMP_NUM_THREADS=xx; numactl -physnodebind=0 ./cloverleaf_tiled OPS_TILING OPS_TILESIZE_X=600 OPS_TILESIZE_Y=200

OpenMP and OpenMP+MPI

It is recommended that you assign one MPI rank per NUMA region when executing MPI+OpenMP parallel code. Usually, for a multi-CPU system, a single CPU socket is a single NUMA region. Thus, for a 4-socket system, OPS MPI+OpenMP code should be executed with 4 MPI processes, with each MPI process having multiple OpenMP threads (typically specified by the OMP_NUM_THREADS environment variable). Additionally, on some systems, using numactl to bind threads to cores can provide performance improvements (see OPS/scripts/numawrap for an example script that wraps the numactl command for use with common MPI distributions).

CUDA arguments

The CUDA (and HIP) thread block sizes can be controlled by setting the OPS_BLOCK_SIZE_X, OPS_BLOCK_SIZE_Y, and OPS_BLOCK_SIZE_Z runtime arguments. For example,

./cloverleaf_cuda OPS_BLOCK_SIZE_X=64 OPS_BLOCK_SIZE_Y=4