# Developing an OPS Application
This page provides a tutorial in the basics of using OPS for multi-block structured mesh application development. This is taken from a [presentation](https://op-dsl.github.io/docs/OPS/tutorial.pdf) given initially in April 2018 and subsequently updated for the latest release of OPS. 

## OPS Abstraction
OPS is a Domain Specific Language embedded in C/C++ and Fortran, targeting the development of multi-block structured mesh computations. The abstraction has two distinct components: the definition of the mesh, and operations over the mesh.
* Defining a number of 1-3D blocks, and on them a number of datasets, which have specific extents in the different dimensions.
* Describing a parallel loop over a given block, with a given iteration range, executing a given "kernel function" at each mesh point, and describing what datasets are going to be accessed and how.
* Additionally, one needs to declare stencils (access patterns) that will be used in parallel loops to access datasets, and any global constants (read-only global scope variables).

Data and computations expressed this way can be automatically managed and parallelized by the OPS library. Higher dimensions are supported in the backend, but are not currently supported by the code generator.

## Example Application
In this tutorial we will use an example application, a simple 2D iterative Laplace equation solver. 
* Go to the `OPS/apps/c/laplace2dtutorial/original` directory and open the `laplace2d.cpp` file
* It uses an $imax$ x $jmax$ mesh, with an additional 1 layer of boundary cells on all sides
* There are a number of loops that set the boundary conditions along the four edges
* The bulk of the simulation is spent in a while loop, repeating a stencil kernel with a maximum reduction, and a copy kernel
* Compile and run the code!

Note: The following tutorial details the step-by-step approach for using OPS for Laplace (C version) application development. A similar step-by-step approach is also followed for the Laplace Fortran version and can be found at `OPS/apps/fortran/laplace2dtutorial`.

## Original - Initialization
The original code begins with initializing the data arrays used in the calculation:
```c
//Size along y
int jmax = 4094;
//Size along x
int imax = 4094;
//Max number of iterations
int iter_max = 100;

double pi  = 2.0 * asin(1.0);
const double tol = 1.0e-6;
double error     = 1.0;

double *A;
double *Anew;
double *y0;

A    = (double *)malloc((imax+2) * (jmax+2) * sizeof(double));
Anew = (double *)malloc((imax+2) * (jmax+2) * sizeof(double));
y0   = (double *)malloc((imax+2) * sizeof(double));

memset(A, 0, (imax+2) * (jmax+2) * sizeof(double));
```
## Original - Boundary loops
The application then sets boundary conditions:
```c
for (int i = 0; i < imax+2; i++)
    A[(0)*(imax+2)+i]   = 0.0;

for (int i = 0; i < imax+2; i++)
    A[(jmax+1)*(imax+2)+i] = 0.0;

for (int j = 0; j < jmax+2; j++) {
    A[(j)*(imax+2)+0] = sin(pi * j / (jmax+1));
}

for (int j = 0; j < imax+2; j++) {
    A[(j)*(imax+2)+imax+1] = sin(pi * j / (jmax+1))*exp(-pi);
}
```  
Note how in the latter two loops the loop index is used.

## Original - Main iteration
The main iterative loop is a while loop iterating until the error tolerance is at a set level and the number of iterations is less than the maximum set. 
```c
while ( error > tol && iter < iter_max ) {
  error = 0.0;
  for( int j = 1; j < jmax+1; j++ ) {
    for( int i = 1; i < imax+1; i++) {
      Anew[(j)*(imax+2)+i] = 0.25f * 
      ( A[(j)*(imax+2)+i+1] + A[(j)*(imax+2)+i-1]
      + A[(j-1)*(imax+2)+i] + A[(j+1)*(imax+2)+i]);
      error = fmax( error, fabs(Anew[(j)*(imax+2)+i]-A[(j)*(imax+2)+i]));
    }
  }
  for( int j = 1; j < jmax+1; j++ ) {
    for( int i = 1; i < imax+1; i++) {
      A[(j)*(imax+2)+i] = Anew[(j)*(imax+2)+i];    
    }
  }
  if(iter % 10 == 0) printf("%5d, %0.6f\n", iter, error);        
  iter++;
}
```
## Build OPS
Build OPS using instructions in the [Getting Started](https://ops-dsl.readthedocs.io/en/markdowndocdev/installation.html#getting-started) page. 

## Step 1 - Preparing to use OPS
Firstly, include the appropriate header files, then initialize OPS, and at the end finalize it.
* Define that this application is 2D, include the OPS header file, and create a header file where the outlined "elemental kernels" will live.
```c
#define OPS_2D
#include <ops_seq.h>
#include "laplace_kernels.h" 
```
* Initialize and finalize OPS
```c  
int main(int argc, const char** argv) {
  //Initialize the OPS library, passing runtime args, and setting diagnostics level to low (1)
  ops_init(argc, argv,1);
  ...
  ...
  //Finalizing the OPS library
  ops_exit();
}  
```  
By this point you need OPS set up - take a look at the Makefile in step1, and observe that the include and library paths are added, and we link against `ops_seq`.

## Step 2 - OPS declarations
Now declare a block and data on the block:
```c
//The 2D block
ops_block block = ops_decl_block(2, "my_grid");

//The two datasets
int size[] = {imax, jmax};
int base[] = {0,0};
int d_m[] = {-1,-1};
int d_p[] = {1,1};
ops_dat d_A    = ops_decl_dat(block, 1, size, base,
                               d_m, d_p, A,    "double", "A");
ops_dat d_Anew = ops_decl_dat(block, 1, size, base,
                               d_m, d_p, Anew, "double", "Anew");
```
Datasets have a size (number of mesh points in each dimension). There is padding for halos or boundaries in the positive (`d_p`) and negative directions (`d_m`). Here we use a 1-thick boundary layer. The base index can be defined as it may be different from 0 (e.g., in Fortran). With a 0 base index and a 1-wide halo, these datasets can be indexed from `−1` to `size+1`.

OPS supports gradual conversion of applications to its API, but in this case the described data sizes will need to match the allocated memory and its extents need to be correctly described to OPS. In this example we have two `(imax+2) * (jmax+2)` size arrays, and the total size in each dimension needs to match size `[i] + d_p[i] − d_m[i]`. This is only supported for the sequential and OpenMP backends. If a `NULL` pointer is passed, OPS will allocate the data internally.

We also need to declare the stencils that will be used - in this example most loops use a simple 1-point stencil, and one uses a 5-point stencil:
```c
//Two stencils, a 1-point, and a 5-point
int s2d_00[] = {0,0};
ops_stencil S2D_00 = ops_decl_stencil(2,1,s2d_00,"0,0");
int s2d_5pt[] = {0,0, 1,0, -1,0, 0,1, 0,-1};
ops_stencil S2D_5pt = ops_decl_stencil(2,5,s2d_5pt,"5pt");
```  
Different names may be used for stencils in your code, but we suggest using some convention.

## Step 3 - First parallel loop
You can now convert the first loop to use OPS:
```c
for (int i = 0; i < imax+2; i++)
    A[(0)*(imax+2)+i]   = 0.0;
```    
This is a loop on the bottom boundary of the domain, which is at the −1 index for our dataset, therefore our iteration range will be over the entire domain, including halos in the X direction, and the bottom boundary in the Y direction. The iteration range is given as beginning (inclusive) and end (exclusive) indices in the x, y, etc. directions.
```c
int bottom_range[] = {-1, imax+1, -1, 0};
```
Next, we need to outline the “elemental” kernel into `laplace_kernels.h`, and place the appropriate access objects - `ACC<double> &A`, in the kernel’s formal parameter list, and `(i,j)` are the stencil offsets in the X and Y directions respectively:
```c
void set_zero(ACC<double> &A) {
  A(0,0) = 0.0;
}
```
The OPS parallel loop can now be written as follows:
```c
ops_par_loop(set_zero, "set_zero", block, 2, bottom_range,
      ops_arg_dat(d_A, 1, S2D_00, "double", OPS_WRITE));
```
The loop will execute `set_zero` at each mesh point defined in the iteration range, and write the dataset `d_A` with the 1-point stencil. The `ops_par_loop` implies that the order in which mesh points will be executed will not affect the end result (within machine precision).

There are three more loops which set values to zero; they can be trivially replaced with the code above, only altering the iteration range. In the main while loop, the second simpler loop simply copies data from one array to another, this time on the interior of the domain:
```c
int interior_range[] = {0,imax,0,jmax};
ops_par_loop(copy, "copy", block, 2, interior_range,
    ops_arg_dat(d_A,    1, S2D_00, "double", OPS_WRITE),
    ops_arg_dat(d_Anew, 1, S2D_00, "double", OPS_READ));
```
And the corresponding outlined elemental kernel is as follows:
```c
void copy(ACC<double> &A, const ACC<double> &Anew) {
  A(0,0) = Anew(0,0);
}
```
## Step 4 - Indexes and global constants
There are two sets of boundary loops which use the loop variable j - this is a common technique to initialize data, such as coordinates `(x = i*dx)`. OPS has a special argument `ops_arg_idx` which gives us a globally coherent (including over MPI) iteration index - between the bounds supplied in the iteration range.
```c
ops_par_loop(left_bndcon, "left_bndcon", block, 2, left_range,
      ops_arg_dat(d_Anew, 1, S2D_00, "double", OPS_WRITE),
      ops_arg_idx());
```
And the corresponding outlined user kernel is as follows. Observe the `idx` argument and the +1 offset due to the difference in indexing:
```c
void left_bndcon(ACC<double> &A, const int *idx) {
  A(0,0) = sin(pi * (idx[1]+1) / (jmax+1));
}
```
This kernel also uses two variables, `jmax` and `pi`, that do not depend on the iteration index - they are iteration space invariant. OPS has two ways of supporting this:

(1) Global scope constants, through `ops_decl_const`, as done in this example: we need to move the declaration of the `imax`, `jmax` and `pi` variables to global scope (outside of main), and call the OPS API:
```c
//declare and define global constants
ops_decl_const("imax",1,"int",&imax);
ops_decl_const("jmax",1,"int",&jmax);
ops_decl_const("pi",1,"double",&pi);
```
These variables do not need to be passed into the elemental kernel; they are accessible in all elemental kernels.

(2) The other option is to explicitly pass it to the elemental kernel with `ops_arg_gbl`: this is for scalars and small arrays that should not be in global scope.

## Step 5 - Complex stencils and reductions
There is only one loop left, which uses a 5-point stencil and a reduction. It can be outlined as usual, and for the stencil, we will use `S2D_5pt`.
```c
ops_par_loop(apply_stencil, "apply_stencil", block, 2, interior_range,
        ops_arg_dat(d_A,    1, S2D_5pt, "double", OPS_READ),
        ops_arg_dat(d_Anew, 1, S2D_00, "double", OPS_WRITE),
        ops_arg_reduce(h_err, 1, "double", OPS_MAX))
```
The corresponding outlined elemental kernel is as follows. Observe the stencil offsets used to access the adjacent 4 points:
```c
void apply_stencil(const ACC<double> &A, ACC<double> &Anew, double *error) {
  Anew(0,0) = 0.25f * ( A(1,0) + A(-1,0)
      + A(0,-1) + A(0,1));
  *error = fmax( *error, fabs(Anew(0,0)-A(0,0)));
}
```
The loop also has a special argument for the reduction, `ops_arg_reduce`. As the first argument, it takes a reduction handle, which has to be defined separately:
```c
ops_reduction h_err = ops_decl_reduction_handle(sizeof(double), "double", "error");
``` 
Reductions may be increment (`OPS_INC`), min (`OPS_MIN`) or max (`OPS_MAX`). The user kernel will have to perform the reduction operation, reducing the passed-in value as well as the computed value.

The result of the reduction can be queried from the handle as follows:
```c
ops_reduction_result(h_err, &error);
```

Multiple parallel loops may use the same handle, and their results will be combined, until the result is queried by the user. Parallel loops that only have the reduction handle in common are semantically independent.

## Step 6 - Handing it all over to OPS

We have now successfully converted all computations on the mesh to OPS parallel loops. In order for OPS to manage data and parallelizations better, we should let OPS allocate the datasets - instead of passing in the pointers to memory allocated by us, we just pass in NULL (`A` and `Anew`). Parallel I/O can be done using HDF5 - see the ops_hdf5.h header.

All data and parallelization is now handed over to OPS. We can now also compile the developer MPI version of the code - see the Makefile, and try building `laplace2d_mpi`.

## Step 7 - Code generation

Now that the developer versions of our code work, it’s time to generate code. On the console, type: 
```bash
python3 $OPS_INSTALL_PATH/../ops_translator/ops-translator -I $OPS_INSTALL_PATH/c/include/ --file_paths laplace2d.cpp
```
We have provided a Makefile which can use several different compilers (gnu, intel, cray, pgi, clang, etc.); we suggest modifying it for your own applications. Try building CUDA, OpenMP, MPI+CUDA, MPI+OpenMP, and other versions of the code. You can take a look at the generated kernels for different parallelizations under the appropriate subfolders. 

If you add the `-OPS_DIAGS=2` runtime flag, at the end of execution, OPS will report timings and achieved bandwidth for each of your kernels. For more options, see [Runtime Flags and Options](https://ops-dsl.readthedocs.io/en/markdowndocdev/devanapp.html#runtime-flags-and-options).

## Code generated versions
OPS will generate and compile a large number of different versions.
* `laplace2d_dev_seq` and `laplace2d_dev_mpi`: these do not use code generation, they are intended for development only.
* `laplace2d_seq`: baseline sequential implementation.
* `laplace2d_openmp`: baseline OpenMP implementation.
* `laplace2d_tiled`: optimized implementation with OpenMP that improves spatial and temporal locality.
* `laplace2d_mpi`: MPI implementation.
* `laplace2d_mpi_tiled`: MPI implementation optimized for halo exchanges and spatial and temporal locality.
* `laplace2d_cuda`, `laplace2d_hip`, `laplace2d_ompoffload`, `laplace2d_sycl`: implementations targeting GPUs`**`.
  * `**Note**`: The generated code (CUDA, HIP, OpenMP Offload, SYCL) is dependent on the compilation environment setup. Ensure that the appropriate compiler flags, libraries, and toolchains for the desired parallelization model are correctly configured during the build process. MPI versions can also be generated when an appropriate MPI compiler (e.g., ``mpicc``, ``mpicxx``) is set in the build environment.


## Re-engineering Fortran Applications
Similar steps can be followed to re-engineer Fortran applications using the OPS API. A detailed, step-by-step conversion tutorial is available in the `OPS/apps/fortran/laplace2dtutorial` directory.

In addition to the standard CPU targets, OPS also enables GPU code generation for Fortran applications. Specifically, OPS can generate:

* CUDA Fortran kernels,
* CUDA C kernels via Fortran-to-C bindings, and
* HIP C kernels, also through Fortran-to-C bindings.

The automatic code generation framework that enables F90-to-CUDA/HIP translation is a distinctive capability of OPS, providing seamless support for GPU acceleration of Fortran applications. Despite the lack of direct HIP support for Fortran, OPS leverages Fortran-to-C bindings to generate HIP C kernels, ensuring portability across both NVIDIA and AMD GPU architectures. It establishes a robust and portable toolchain for deploying large-scale Fortran applications on modern heterogeneous architectures.

## Optimizations - general
Try the following performance tuning options:
* `laplace2d_cuda`: you can set the `OPS_BLOCK_SIZE_X`, `OPS_BLOCK_SIZE_Y`, and `OPS_BLOCK_SIZE_Z` runtime arguments to control thread block or work group sizes

## Optimizations - tiling

Tiling uses lazy execution: as parallel loops follow one another, they are not executed, but put in a queue, and only once some data needs to be returned to the user (e.g., result of a reduction) do these loops have to be executed.

With a chain of loops queued, OPS can analyze them together and come up with a tiled execution schedule.

This works over MPI as well: OPS extends the halo regions, and does one big halo exchange instead of several smaller ones. In the current `laplace2d` code, every stencil application loop is also doing a reduction, therefore only two loops are queued. Try modifying the code so the reduction only happens every 10 iterations! On a Xeon E5-2650, one can get a 2.5x speedup.

The following versions can be executed with the tiling optimizations.

* `laplace2d_tiled`, `laplace2d_mpi_tiled`: add the `OPS_TILING` runtime flag, and use `-OPS_DIAGS=3` to see the cache blocking tiling at work. For some applications, such as this one, the initial guess produces tiles that are too large; try setting `OPS_CACHE_SIZE` to a lower value (in MB, for the L3 cache size). Thread affinity control and using 1 process per socket is strongly recommended.
E.g.
```bash
OMP_NUM_THREADS=20 numactl --cpunodebind=0 ./laplace2d_tiled -OPS_DIAGS=3 OPS_TILING OPS_CACHE_SIZE=5
```
Over MPI, you will have to set `OPS_TILING_MAX_DEPTH` to extend halo regions.

To manually specify tile sizes (in number of grid points), use the `OPS_TILESIZE_X`, `OPS_TILESIZE_Y`, and `OPS_TILESIZE_Z` runtime arguments. E.g.
```bash
OMP_NUM_THREADS=20 numactl --cpunodebind=0 ./cloverleaf_tiled OPS_TILING OPS_TILESIZE_X=600 OPS_TILESIZE_Y=200
```

## Optimizations - GPU Direct

GPU Direct support for MPI+CUDA can be enabled by adding `-gpudirect` when running the executable. You may also need to set certain environment variables depending on the MPI distribution in use.

## Performance measurement and profiling

### Energy Consumption Measurement

OPS now provides support for reporting CPU and GPU energy consumption during the execution of an application. This is integrated into the existing `ops_timing_output()` routine.

When enabled, the following information is reported:

* Total CPU energy consumed (measured via RAPL), and the share consumed by DRAM
* Total GPU energy consumed

Example output:
```bash
Total CPU energy consumed (RAPL): 123.45 J, of which DRAM energy: 12.34 J
Total GPU energy consumed: 56.78 J
```

Usage
To measure energy only for the main computation (and exclude initialization/IO), the user should:

1. Reset power counters just before the main compute section:
```bash
OPS_instance->reset_power_counters();
```
2. Run the computation kernels.
3. Call timing output at the end to print results:
```bash
ops_timing_output(stdout);
```

Example

The CloverLeaf-3D application demonstrates this usage.
See: `apps/c/CloverLeaf_3D/clover_leaf.cpp`