Key Concepts and Structure
An OPS application can generally be divided into two key parts: initialisation and parallel execution. During the initialisation phase, one or more blocks (ops_block) are defined: these only have a dimensionality (i.e. 1D, 2D, etc.), and serve to group datasets together. Datasets are defined on a block, and have a specific size (in each dimension of the block), which may be slightly different across different datasets (e.g. staggered grids), in some directions they may be degenerate (a size of 1), or they can represent data associated with different multigrid levels (where their size if a multiple or a fraction of other datasets). Datasets can be declared with empty (NULL) pointers, then OPS will allocate the appropriate amount of memory, may be passed non-NULL pointers (currently only supported in non-MPI environments), in which case OPS will assume the memory is large enough for the data and the block halo, and there are HDF5 dataset declaration routines which allow the distributed reading of datasets from HDF5 files. The concept of blocks is necessary to group datasets together, as in a multi-block problem, in a distributed memory environment, OPS needs to be able to determine how to decompose the problem.
The initialisation phase usually also consists of defining the stencils to be used later on (though they can be defined later as well), which describe the data access patterns used in parallel loops. Stencils are always relative to the “current” point; e.g. if at iteration \((i,j)\), we wish to access \((i{-}1,j)\) and \((i,j)\), then the stencil will have two points: \(\{(-1, 0), (0, 0)\}\). To support degenerate datasets (where in one of the dimensions the dataset’s size is 1), as well as for multigrid, there are special strided, restriction, and prolongation stencils: they differ from normal stencils in that as one steps through a grid in a parallel loop, the stepping is done with a non-unit stride for these datasets. For example, in a 2D problem, if we have a degenerate dataset called xcoords, size \((N,1)\), then we will need a stencil with stride \((1,0)\) to access it in a regular 2D loop.
Finally, the initialisation phase may declare a number of global
constants - these are variables in global scope that can be accessed
from within user kernels, without having to pass them in explicitly.
These may be scalars or small arrays, generally for values that do not
change during execution, though they may be updated during execution
with repeated calls to ops_decl_const.
The initialisation phase is terminated by a call to ops_partition.
The bulk of the application consists of parallel loops, implemented
using calls to ops_par_loop. These constructs work with datasets,
passed through the opaque ops_dat handles declared during the
initialisation phase. The iterations of parallel loops are semantically
independent, and it is the responsibility of the user to enforce this:
the order in which iterations are executed cannot affect the result
(within the limits of floating point precision). Parallel loops are
defined on a block, with a prescribed iteration range that is always
defined from the perspective of the dataset written/modified (the sizes
of datasets, particularly in multigrid situations, may be very
different). Datasets are passed in using ops_arg_dat, and during
execution, values at the current grid point will be passed to the user
kernel. These values are passed wrapped in a templated ACC<> object
(templated on the type of the data), whose parentheses operator is
overloaded, which the user must use to specify the relative offset to
access the grid point’s neighbours (which accesses have to match the the
declared stencil). Datasets written may only be accessed with a
one-point, zero-offset stencil (otherwise the parallel semantics may be
violated).
Other than datasets, one can pass in read-only scalars or small arrays
that are iteration space invariant with ops_arg_gbl (typically
weights, \(\delta t\), etc. which may be different in different loops).
The current iteration index can also be passed in with ops_arg_idx,
which will pass a globally consistent index to the user kernel (i.e.
also under MPI).
Reductions in loops are done using the ops_arg_reduce argument, which
takes a reduction handle as an argument. The result of the reduction can
then be acquired using a separate call to ops_reduction_result. The
semantics are the following: a reduction handle after it was declared is
in an “uninitialised” state. The first time it is used as an argument to
a loop, its type is determined (increment/min/max), and is initialised
appropriately \((0,\infty,-\infty)\), and subsequent uses of the handle in
parallel loops are combined together, up until the point, where the
result is acquired using ops_reduction_result, which then sets it back
to an uninitialised state. This also implies, that different parallel
loops, which all use the same reduction handle, but are otherwise
independent, are independent and their partial reduction results can be
combined together associatively and commutatively.
OPS takes responsibility for all data, its movement and the execution of parallel loops. With different execution hardware and optimisations, this means OPS will re-organise data as well as execution (potentially across different loops), and therefore any data accesses or manipulation may only be done through the OPS API.
This restriction is exploited by a lazy execution mechanism in OPS. The idea is that OPS API calls that do not return a result can be not executed immediately, rather queued, and once an API call requires returning some data, operations in the queue are executed, and the result is returned. This allows OPS to analyse and optimise operations in the queue together. This mechanism is fully automated by OPS, and is used with the various _tiled executables. For more information on how to use this mechanism for improving CPU performance, see the Optimizations - tiling section. Some API calls triggering the execution of queued operations include ops_reduction_result, and the functions in the data access API.