# Developer Guide

This guide provides an overview of OPS internals for developers who wish to contribute to OPS, add new backends, or understand how the code generation and runtime library work.

## Architecture Overview

OPS consists of two main components:

1. **Code Generator** (`ops_translator/`): A Python-based source-to-source translator that parses user applications (using libclang for C++ and fparser2 for Fortran) and generates parallel code for various backends.
2. **Runtime Library** (`ops/c/` and `ops/fortran/`): Backend-specific implementations that handle data management, parallelization, and communication.

```
┌─────────────────────────────────────────────────────────────────────┐
│                         User Application                            │
│                    (ops_par_loop calls + kernels)                   │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Code Generator (ops_translator)                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐  │
│  │   Parser    │───>│   Scheme    │───>│   Jinja2 Templates      │  │
│  │ (libclang/  │    │  (target    │    │  (loop_host, master_    │  │
│  │  fparser2)  │    │   logic)    │    │   kernel, etc.)         │  │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Generated Parallel Code                       │
│      (CUDA, HIP, SYCL, OpenMP, OpenMP Offload + MPI variants)       │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Runtime Library (ops/c/src/)                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐             │
│  │  core/   │  │  cuda/   │  │  sycl/   │  │   mpi/   │    ...      │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘             │
└─────────────────────────────────────────────────────────────────────┘
```

---

## Code Generator (ops_translator)

The code generator is located in `ops_translator/ops-translator/` and uses Python with Clang bindings (libclang) for C++ parsing and fparser2 for Fortran parsing.

### Directory Structure

```
ops_translator/
├── ops-translator/          # Main translator code
│   ├── __main__.py          # Entry point & CLI argument handling
│   ├── scheme.py            # Code generation schemes (genLoopHost)
│   ├── target.py            # Target definitions (Cuda, Sycl, Hip, etc.)
│   ├── ops.py               # OPS constructs (Loop, Arg, Dat, etc.)
│   ├── store.py             # Application, Program, ParseError classes
│   ├── util.py              # Utilities, KernelProcess class
│   ├── language.py          # Language definitions (C++, Fortran)
│   ├── jinja_utils.py       # Jinja2 environment setup
│   ├── cpp/                 # C++ specific code
│   │   ├── parser.py        # Clang-based C++ parser
│   │   ├── schemes.py       # C++ target scheme implementations
│   │   └── translator/      # Kernel/program translators
│   └── fortran/             # Fortran specific code
├── resources/               # Code generation resources
│   └── templates/           # Jinja2 templates
│       ├── cpp/             # C++ templates
│       │   ├── loop_host.cpp.j2      # Base loop host template
│       │   ├── master_kernel.cpp.j2  # Master kernel file
│       │   ├── cuda/                 # CUDA-specific templates
│       │   ├── sycl/                 # SYCL-specific templates
│       │   ├── mpi_openmp/           # MPI+OpenMP templates
│       │   └── ...
│       └── fortran/         # Fortran templates
└── .python/                 # Python virtual environment for Makefile builds (generated by `make python`, not in version control)
                             # CMake builds create ops_venv/ under ${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ instead
```

### Key Classes

#### Target (`target.py`)

Defines code generation targets and their configurations:

```python
class Target(Findable):
    name: str                    # Target identifier (e.g., "cuda", "sycl")
    kernel_translation: bool     # Whether kernel code needs transformation
    config: Dict[str, Any]       # Target-specific configuration
```

Available targets:
| Target Class    | Name             | Description               |
|-----------------|------------------|---------------------------|
| `MPIOpenMP`     | `mpi_openmp`     | CPU sequential/OpenMP     |
| `Cuda`          | `cuda`           | NVIDIA GPUs via CUDA      |
| `Hip`           | `hip`            | AMD GPUs via HIP          |
| `Sycl`          | `sycl`           | Intel/AMD/NVIDIA via SYCL |
| `OpenMPOffload` | `openmp_offload` | GPU via OpenMP target     |
| `F2CCuda`       | `f2c_cuda`       | Fortran-to-C CUDA         |
| `F2CHip`        | `f2c_hip`        | Fortran-to-C HIP          |
| `F2CSycl`       | `f2c_sycl`       | Fortran-to-C SYCL         |

#### Scheme (`scheme.py`)

Orchestrates code generation for a language/target combination:

```python
class Scheme(Findable):
    lang: Lang                   # Language (C++, Fortran)
    target: Target               # Target backend
    loop_host_template: Path     # Template for loop host code
    
    def genLoopHost(...) -> Tuple[str, str, str]:
        """Generate loop host code from template"""
        # 1. Translate kernel if needed
        # 2. Process kernel text (KernelProcess)
        # 3. Render Jinja2 template
        return (generated_code, extension, kernel_func)
```

#### KernelProcess (`util.py`)

Handles kernel text transformations for different backends:

```python
class KernelProcess:
    def clean_kernel_func_text(kernel_func)     # Remove OPS-specific markers
    def cuda_complex_numbers(kernel_func)       # Handle complex number support
    def sycl_kernel_func_text(kernel_func, consts)  # SYCL-specific transforms
    def get_kernel_body_and_arg_list(kernel_func)   # Extract body and args
```

#### Parser (`cpp/parser.py`)

Uses libclang to parse C++ source files:

```python
def parseLoops(translation_unit, program) -> None:
    """Parse ops_par_loop calls from C++ source"""
    # Find macro instantiations and function calls
    # Extract loop information (kernel, block, range, arguments)
```

### Jinja2 Templates

Templates use Jinja2 syntax with OPS-specific filters and tests. Key template variables:

| Variable           | Description                                      |
|--------------------|--------------------------------------------------|
| `lh`               | Loop host object (kernel name, args, ndim, etc.) |
| `kernel_func`      | Original kernel function text                    |
| `kernel_body`      | Extracted kernel body                            |
| `args_list`        | Argument name list                               |
| `target`           | Current target object                            |
| `consts_in_kernel` | Constants used in kernel                         |

Example template structure (`loop_host.cpp.j2`):
```jinja2
{% block host_prologue %}
    // Setup code: args, dimensions, pointers
{% endblock %}

{% block kernel_call %}
    // Parallel launch code (varies by target)
{% endblock %}

{% block host_epilogue %}
    // Cleanup, timing, diagnostics
{% endblock %}
```

### Adding a New Backend

To add a new backend (e.g., "newgpu"):

1. **Define Target** in `target.py`:
```python
class NewGPU(Target):
    name = "newgpu"
    kernel_translation = True
    config = {"grouped": True, "device": 11}

Target.register(NewGPU)
```

2. **Create Scheme** in `cpp/schemes.py`:
```python
class CppNewGPU(CppScheme):
    target = NewGPU()
    loop_host_template = Path("cpp/newgpu/loop_host.cpp.j2")
    master_kernel_template = Path("cpp/newgpu/master_kernel.cpp.j2")
    loop_kernel_extension = "newgpu.cpp"

Scheme.register(CppNewGPU)
```

3. **Create Templates** in `resources/templates/cpp/newgpu/`:
   - `loop_host.cpp.j2` - Loop host wrapper
   - `master_kernel.cpp.j2` - Master include file

4. **Add Runtime Support** in `ops/c/src/newgpu/` (if needed)

5. **Update Makefiles** in `makefiles/` directory

---

## Runtime Library (ops/c/)

The runtime library provides backend implementations for data management, parallel execution, and communication.

### Directory Structure

```
ops/c/
├── include/                 # Public headers
│   ├── ops_lib_core.h       # Core OPS API
│   ├── ops_seq.h            # Sequential backend header
│   ├── ops_cuda.h           # CUDA backend header
│   ├── ops_hip.h            # HIP backend header
│   ├── ops_sycl.h           # SYCL backend header
│   └── ...
├── src/                     # Implementation
│   ├── core/                # Core library (shared across backends)
│   │   ├── ops_lib_core.cpp # Core API implementation
│   │   ├── ops_lazy.cpp     # Lazy execution & tiling
│   │   └── ops_instance.cpp # OPS instance management
│   ├── sequential/          # Sequential backend
│   ├── cuda/                # CUDA backend
│   ├── hip/                 # HIP backend
│   ├── sycl/                # SYCL backend
│   ├── mpi/                 # MPI support for all backends
│   │   ├── ops_mpi_core.cpp
│   │   ├── ops_mpi_partition.cpp  # Domain decomposition
│   │   ├── ops_mpi_rt_support_cuda.cpp
│   │   ├── ops_mpi_rt_support_sycl.cpp
│   │   └── ...
│   ├── ompoffload/          # OpenMP offload backend
│   └── tridiag/             # Tridiagonal solver support
└── lib/                     # Compiled libraries
```

### Core Components

#### ops_lib_core.cpp
- `ops_init()` / `ops_exit()` - Initialization and cleanup
- `ops_decl_block()` - Block declaration
- `ops_decl_dat()` - Dataset declaration
- `ops_decl_stencil()` - Stencil declaration
- `ops_partition()` - MPI partitioning trigger

#### ops_lazy.cpp
- Lazy execution queue management
- Tiling plan computation
- Communication-avoiding optimizations
- Key structures: `ops_kernel_list`, `tiling_plan`

#### MPI Support (ops/c/src/mpi/)
- Domain decomposition (`ops_mpi_partition.cpp`)
- Halo exchange management
- Backend-specific MPI+GPU support:
  - `ops_mpi_rt_support_cuda.cpp` - CUDA+MPI
  - `ops_mpi_rt_support_sycl.cpp` - SYCL+MPI
  - `ops_mpi_rt_support_hip.cpp` - HIP+MPI

### Adding Runtime Support for a New Backend

1. **Create backend directory**: `ops/c/src/newgpu/`
2. **Implement required functions**:
   - Device memory allocation/deallocation
   - Data transfer (host ↔ device)
   - Kernel launch wrappers
3. **Add MPI support** (if needed): `ops/c/src/mpi/ops_mpi_rt_support_newgpu.cpp`
4. **Update build system**:
   - Add `makefiles/Makefile.newgpu`
   - Update `CMakeLists.txt`

---

## Build System

OPS supports two build systems: **CMake** (recommended) and **Makefiles**. Both produce the same set of backend libraries and application binaries. For full build instructions see [installation.md](installation.md).

### CMake Build System

The top-level [CMakeLists.txt](../CMakeLists.txt) orchestrates the entire build: compiler detection, dependency discovery, backend library compilation, translator installation, and optional application builds.

#### Key CMake Options

| Option                  | Default          | Description                                          |
|-------------------------|------------------|------------------------------------------------------|
| `CMAKE_BUILD_TYPE`      | —                | Build type: `Release`, `Debug`, or empty for default |
| `BUILD_OPS_CXX`         | `ON`             | Build the C/C++ backend libraries                    |
| `BUILD_OPS_FORTRAN`     | `OFF`            | Build the Fortran backend libraries                  |
| `BUILD_OPS_APPS`        | `OFF`            | Build sample applications (library CMake only)       |
| `OPS_TEST`              | `OFF`            | Enable CTest-based tests                             |
| `OPS_HIP`               | `OFF`            | Enable the HIP backend                               |
| `LEGACY_CODEGEN`        | `OFF`            | Use the legacy code generator                        |
| `ENABLE_IEEE`           | `OFF`            | Enable strict IEEE floating-point flags              |
| `OPS_VERBOSE_WARNING`   | `OFF`            | Show verbose output during build                     |
| `CMAKE_INSTALL_PREFIX`  | `/usr/local`     | Library installation directory                       |
| `APP_INSTALL_DIR`       | `$HOME/OPS-APPS` | Application installation directory                   |
| `OPS_INSTALL_DIR`       | —                | Path to installed OPS library (app CMake only)       |
| `GPU_NUMBER`            | —                | Number of GPUs for tests                             |
| `GPU_ARCH`              | `70`             | CUDA compute capability (e.g., `80` for A100)        |
| `LIBTRID_PATH`          | —                | Path to tridiagonal solver library (optional)        |

#### Dependency Detection

CMake automatically discovers:  MPI (`find_package(MPI)`), HDF5 (`find_package(HDF5)`), CUDA (`find_package(CUDAToolkit)`), OpenMP (`find_package(OpenMP)`), HIP (when `OPS_HIP=ON`), and Python 3.8+. The translator's Python virtual environment is set up differently depending on the build system. For the **CMake build**, `ops_translator/CMakeLists.txt` copies the translator tree to `${CMAKE_INSTALL_PREFIX}/translator/ops_translator/` and runs `setup_venv_cmake.sh` directly (using `python3 -m venv`) to create the venv at `${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ops_venv/` — it does **not** call `make python`. For the **Makefile build**, `make python` inside `ops_translator/` creates the venv under `ops_translator/.python/`.

#### Build Structure

```
CMakeLists.txt                  # Top-level: compiler flags, dependencies, options
├── ops_translator/CMakeLists.txt   # Installs translator + sets up Python venv
├── ops/c/CMakeLists.txt            # Backend libraries (ops_seq, ops_cuda, ops_mpi, etc.)
├── ops/fortran/CMakeLists.txt      # Fortran backend libraries
├── apps/c/CMakeLists.txt           # C/C++ example applications
│   ├── CloverLeaf/CMakeLists.txt
│   ├── shsgc/CMakeLists.txt
│   └── ...
└── apps/fortran/CMakeLists.txt     # Fortran example applications
```

#### Library Targets

The CMake build in `ops/c/CMakeLists.txt` produces these library targets:

| CMake Target     | Condition                | Description           |
|------------------|--------------------------|-----------------------|
| `ops_seq`        | Always                   | Sequential + OpenMP   |
| `ops_cuda`       | `CUDAToolkit_FOUND`      | CUDA single-node      |
| `ops_hip`        | `OPS_HIP` + `HIP_FOUND`  | HIP single-node       |
| `ops_ompoffload` | NVHPC compiler + CUDA    | OpenMP Offload        |
| `ops_mpi`        | `MPI_FOUND`              | MPI + sequential      |
| `ops_mpi_cuda`   | MPI + CUDA               | MPI + CUDA            |
| `ops_mpi_hip`    | MPI + HIP                | MPI + HIP             |
| `ops_hdf5_seq`   | `HDF5_FOUND`             | HDF5 I/O (sequential) |
| `ops_hdf5_mpi`   | HDF5 + MPI               | HDF5 I/O (MPI)        |

All libraries are installed under `${CMAKE_INSTALL_PREFIX}/lib` with CMake export files at `${CMAKE_INSTALL_PREFIX}/lib/cmake`, allowing downstream projects to use `find_package(OPS)`.

#### Typical Build Workflow

```bash
# Build everything (library + apps)
mkdir build && cd build
cmake .. -DBUILD_OPS_APPS=ON -DCMAKE_INSTALL_PREFIX=$HOME/OPS-INSTALL \
         -DAPP_INSTALL_DIR=$HOME/OPS-APPS -DGPU_NUMBER=1
make
make install

# Or build library and apps separately
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/OPS-INSTALL
make && make install

mkdir appbuild && cd appbuild
cmake ../../apps/c -DOPS_INSTALL_DIR=$HOME/OPS-INSTALL -DAPP_INSTALL_DIR=$HOME/OPS-APPS
make
```

#### Adding a New Backend to CMake

1. Add a new library target in `ops/c/CMakeLists.txt` (following the pattern of existing backends)
2. Add conditional detection logic in the top-level `CMakeLists.txt` if a new dependency is needed
3. Use `installtarget()` macro to register the target for installation and export
4. Add MPI variant if applicable (create `ops_mpi_<backend>` target)

### Makefile System

The Makefile-based build uses modular includes:

```
makefiles/
├── Makefile.common          # Common flags and definitions
├── Makefile.c_app           # Main C application makefile
├── Makefile.cuda            # CUDA-specific flags
├── Makefile.hip             # HIP-specific flags
├── Makefile.sycl            # SYCL flags (via Makefile.icx)
├── Makefile.mpi             # MPI flags
└── Makefile.<compiler>      # Compiler-specific settings
```

### Build Targets

For an application named `APP`, the following targets are generated:

| Target                   | Description                          |
|--------------------------|--------------------------------------|
| `$(APP)_dev_seq`         | Development sequential (no code-gen) |
| `$(APP)_dev_mpi`         | Development MPI (no code-gen)        |
| `$(APP)_seq`             | Sequential with generated kernels    |
| `$(APP)_openmp`          | OpenMP parallel                      |
| `$(APP)_mpi`             | MPI distributed                      |
| `$(APP)_mpi_openmp`      | MPI + OpenMP hybrid                  |
| `$(APP)_tiled`           | Lazy execution with tiling           |
| `$(APP)_cuda`            | CUDA single GPU                      |
| `$(APP)_mpi_cuda`        | MPI + CUDA                           |
| `$(APP)_sycl`            | SYCL single device                   |
| `$(APP)_mpi_sycl`        | MPI + SYCL                           |
| `$(APP)_hip`             | HIP single GPU                       |
| `$(APP)_mpi_hip`         | MPI + HIP                            |
| `$(APP)_ompoffload`      | OpenMP Offload single GPU            |
| `$(APP)_mpi_ompoffload`  | MPI + OpenMP Offload                 |

---

## Debugging Tips

### Code Generator Debugging

```bash
# Verbose output
python3 ops-translator -v --file_paths source.cpp

# Dump parsed structure as JSON
python3 ops-translator -d --file_paths source.cpp

# Target specific backend only
python3 ops-translator -t cuda --file_paths source.cpp
```

### Runtime Debugging

```bash
# Enable diagnostics
./app_cuda -OPS_DIAGS=2

# Check block decomposition (MPI)
./app_mpi_cuda -OPS_DIAGS=2

# Timing breakdown
ops_timing_output(stdout);
```

### Common Issues

| Issue                            | Cause                             | Solution                      |
|----------------------------------|-----------------------------------|-------------------------------|
| `GET_MACRO` redefined            | Name collision with Intel headers | Harmless warning, ignore      |
| `printf` in SYCL kernel          | Variadic functions not allowed    | Guard with `#ifndef OPS_SYCL` |
| Preprocessor directives stripped | Code generator limitation         | Use runtime conditionals      |

---

## Contributing

To contribute to OPS, please use the following steps:
1. Clone the [OPS](https://github.com/OP-DSL/OPS) repository on your local system.
2. Create a new branch in your cloned repository.
3. Make changes or contributions in your new branch.
4. Submit your changes by creating a pull request to the `develop` branch of the OPS repository.

Contributions in the `develop` branch will be merged into the `master` branch when a new release is created.