# Developer Guide This guide provides an overview of OPS internals for developers who wish to contribute to OPS, add new backends, or understand how the code generation and runtime library work. ## Architecture Overview OPS consists of two main components: 1. **Code Generator** (`ops_translator/`): A Python-based source-to-source translator that parses user applications (using libclang for C++ and fparser2 for Fortran) and generates parallel code for various backends. 2. **Runtime Library** (`ops/c/` and `ops/fortran/`): Backend-specific implementations that handle data management, parallelization, and communication. ``` ┌─────────────────────────────────────────────────────────────────────┐ │ User Application │ │ (ops_par_loop calls + kernels) │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Code Generator (ops_translator) │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ Parser │───>│ Scheme │───>│ Jinja2 Templates │ │ │ │ (libclang/ │ │ (target │ │ (loop_host, master_ │ │ │ │ fparser2) │ │ logic) │ │ kernel, etc.) │ │ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Generated Parallel Code │ │ (CUDA, HIP, SYCL, OpenMP, OpenMP Offload + MPI variants) │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Runtime Library (ops/c/src/) │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ core/ │ │ cuda/ │ │ sycl/ │ │ mpi/ │ ... │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ └─────────────────────────────────────────────────────────────────────┘ ``` --- ## Code Generator (ops_translator) The code generator is located in `ops_translator/ops-translator/` and uses Python with Clang bindings (libclang) for C++ parsing and fparser2 for Fortran parsing. ### Directory Structure ``` ops_translator/ ├── ops-translator/ # Main translator code │ ├── __main__.py # Entry point & CLI argument handling │ ├── scheme.py # Code generation schemes (genLoopHost) │ ├── target.py # Target definitions (Cuda, Sycl, Hip, etc.) │ ├── ops.py # OPS constructs (Loop, Arg, Dat, etc.) │ ├── store.py # Application, Program, ParseError classes │ ├── util.py # Utilities, KernelProcess class │ ├── language.py # Language definitions (C++, Fortran) │ ├── jinja_utils.py # Jinja2 environment setup │ ├── cpp/ # C++ specific code │ │ ├── parser.py # Clang-based C++ parser │ │ ├── schemes.py # C++ target scheme implementations │ │ └── translator/ # Kernel/program translators │ └── fortran/ # Fortran specific code ├── resources/ # Code generation resources │ └── templates/ # Jinja2 templates │ ├── cpp/ # C++ templates │ │ ├── loop_host.cpp.j2 # Base loop host template │ │ ├── master_kernel.cpp.j2 # Master kernel file │ │ ├── cuda/ # CUDA-specific templates │ │ ├── sycl/ # SYCL-specific templates │ │ ├── mpi_openmp/ # MPI+OpenMP templates │ │ └── ... │ └── fortran/ # Fortran templates └── .python/ # Python virtual environment for Makefile builds (generated by `make python`, not in version control) # CMake builds create ops_venv/ under ${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ instead ``` ### Key Classes #### Target (`target.py`) Defines code generation targets and their configurations: ```python class Target(Findable): name: str # Target identifier (e.g., "cuda", "sycl") kernel_translation: bool # Whether kernel code needs transformation config: Dict[str, Any] # Target-specific configuration ``` Available targets: | Target Class | Name | Description | |-----------------|------------------|---------------------------| | `MPIOpenMP` | `mpi_openmp` | CPU sequential/OpenMP | | `Cuda` | `cuda` | NVIDIA GPUs via CUDA | | `Hip` | `hip` | AMD GPUs via HIP | | `Sycl` | `sycl` | Intel/AMD/NVIDIA via SYCL | | `OpenMPOffload` | `openmp_offload` | GPU via OpenMP target | | `F2CCuda` | `f2c_cuda` | Fortran-to-C CUDA | | `F2CHip` | `f2c_hip` | Fortran-to-C HIP | | `F2CSycl` | `f2c_sycl` | Fortran-to-C SYCL | #### Scheme (`scheme.py`) Orchestrates code generation for a language/target combination: ```python class Scheme(Findable): lang: Lang # Language (C++, Fortran) target: Target # Target backend loop_host_template: Path # Template for loop host code def genLoopHost(...) -> Tuple[str, str, str]: """Generate loop host code from template""" # 1. Translate kernel if needed # 2. Process kernel text (KernelProcess) # 3. Render Jinja2 template return (generated_code, extension, kernel_func) ``` #### KernelProcess (`util.py`) Handles kernel text transformations for different backends: ```python class KernelProcess: def clean_kernel_func_text(kernel_func) # Remove OPS-specific markers def cuda_complex_numbers(kernel_func) # Handle complex number support def sycl_kernel_func_text(kernel_func, consts) # SYCL-specific transforms def get_kernel_body_and_arg_list(kernel_func) # Extract body and args ``` #### Parser (`cpp/parser.py`) Uses libclang to parse C++ source files: ```python def parseLoops(translation_unit, program) -> None: """Parse ops_par_loop calls from C++ source""" # Find macro instantiations and function calls # Extract loop information (kernel, block, range, arguments) ``` ### Jinja2 Templates Templates use Jinja2 syntax with OPS-specific filters and tests. Key template variables: | Variable | Description | |--------------------|--------------------------------------------------| | `lh` | Loop host object (kernel name, args, ndim, etc.) | | `kernel_func` | Original kernel function text | | `kernel_body` | Extracted kernel body | | `args_list` | Argument name list | | `target` | Current target object | | `consts_in_kernel` | Constants used in kernel | Example template structure (`loop_host.cpp.j2`): ```jinja2 {% block host_prologue %} // Setup code: args, dimensions, pointers {% endblock %} {% block kernel_call %} // Parallel launch code (varies by target) {% endblock %} {% block host_epilogue %} // Cleanup, timing, diagnostics {% endblock %} ``` ### Adding a New Backend To add a new backend (e.g., "newgpu"): 1. **Define Target** in `target.py`: ```python class NewGPU(Target): name = "newgpu" kernel_translation = True config = {"grouped": True, "device": 11} Target.register(NewGPU) ``` 2. **Create Scheme** in `cpp/schemes.py`: ```python class CppNewGPU(CppScheme): target = NewGPU() loop_host_template = Path("cpp/newgpu/loop_host.cpp.j2") master_kernel_template = Path("cpp/newgpu/master_kernel.cpp.j2") loop_kernel_extension = "newgpu.cpp" Scheme.register(CppNewGPU) ``` 3. **Create Templates** in `resources/templates/cpp/newgpu/`: - `loop_host.cpp.j2` - Loop host wrapper - `master_kernel.cpp.j2` - Master include file 4. **Add Runtime Support** in `ops/c/src/newgpu/` (if needed) 5. **Update Makefiles** in `makefiles/` directory --- ## Runtime Library (ops/c/) The runtime library provides backend implementations for data management, parallel execution, and communication. ### Directory Structure ``` ops/c/ ├── include/ # Public headers │ ├── ops_lib_core.h # Core OPS API │ ├── ops_seq.h # Sequential backend header │ ├── ops_cuda.h # CUDA backend header │ ├── ops_hip.h # HIP backend header │ ├── ops_sycl.h # SYCL backend header │ └── ... ├── src/ # Implementation │ ├── core/ # Core library (shared across backends) │ │ ├── ops_lib_core.cpp # Core API implementation │ │ ├── ops_lazy.cpp # Lazy execution & tiling │ │ └── ops_instance.cpp # OPS instance management │ ├── sequential/ # Sequential backend │ ├── cuda/ # CUDA backend │ ├── hip/ # HIP backend │ ├── sycl/ # SYCL backend │ ├── mpi/ # MPI support for all backends │ │ ├── ops_mpi_core.cpp │ │ ├── ops_mpi_partition.cpp # Domain decomposition │ │ ├── ops_mpi_rt_support_cuda.cpp │ │ ├── ops_mpi_rt_support_sycl.cpp │ │ └── ... │ ├── ompoffload/ # OpenMP offload backend │ └── tridiag/ # Tridiagonal solver support └── lib/ # Compiled libraries ``` ### Core Components #### ops_lib_core.cpp - `ops_init()` / `ops_exit()` - Initialization and cleanup - `ops_decl_block()` - Block declaration - `ops_decl_dat()` - Dataset declaration - `ops_decl_stencil()` - Stencil declaration - `ops_partition()` - MPI partitioning trigger #### ops_lazy.cpp - Lazy execution queue management - Tiling plan computation - Communication-avoiding optimizations - Key structures: `ops_kernel_list`, `tiling_plan` #### MPI Support (ops/c/src/mpi/) - Domain decomposition (`ops_mpi_partition.cpp`) - Halo exchange management - Backend-specific MPI+GPU support: - `ops_mpi_rt_support_cuda.cpp` - CUDA+MPI - `ops_mpi_rt_support_sycl.cpp` - SYCL+MPI - `ops_mpi_rt_support_hip.cpp` - HIP+MPI ### Adding Runtime Support for a New Backend 1. **Create backend directory**: `ops/c/src/newgpu/` 2. **Implement required functions**: - Device memory allocation/deallocation - Data transfer (host ↔ device) - Kernel launch wrappers 3. **Add MPI support** (if needed): `ops/c/src/mpi/ops_mpi_rt_support_newgpu.cpp` 4. **Update build system**: - Add `makefiles/Makefile.newgpu` - Update `CMakeLists.txt` --- ## Build System OPS supports two build systems: **CMake** (recommended) and **Makefiles**. Both produce the same set of backend libraries and application binaries. For full build instructions see [installation.md](installation.md). ### CMake Build System The top-level [CMakeLists.txt](../CMakeLists.txt) orchestrates the entire build: compiler detection, dependency discovery, backend library compilation, translator installation, and optional application builds. #### Key CMake Options | Option | Default | Description | |-------------------------|------------------|------------------------------------------------------| | `CMAKE_BUILD_TYPE` | — | Build type: `Release`, `Debug`, or empty for default | | `BUILD_OPS_CXX` | `ON` | Build the C/C++ backend libraries | | `BUILD_OPS_FORTRAN` | `OFF` | Build the Fortran backend libraries | | `BUILD_OPS_APPS` | `OFF` | Build sample applications (library CMake only) | | `OPS_TEST` | `OFF` | Enable CTest-based tests | | `OPS_HIP` | `OFF` | Enable the HIP backend | | `LEGACY_CODEGEN` | `OFF` | Use the legacy code generator | | `ENABLE_IEEE` | `OFF` | Enable strict IEEE floating-point flags | | `OPS_VERBOSE_WARNING` | `OFF` | Show verbose output during build | | `CMAKE_INSTALL_PREFIX` | `/usr/local` | Library installation directory | | `APP_INSTALL_DIR` | `$HOME/OPS-APPS` | Application installation directory | | `OPS_INSTALL_DIR` | — | Path to installed OPS library (app CMake only) | | `GPU_NUMBER` | — | Number of GPUs for tests | | `GPU_ARCH` | `70` | CUDA compute capability (e.g., `80` for A100) | | `LIBTRID_PATH` | — | Path to tridiagonal solver library (optional) | #### Dependency Detection CMake automatically discovers: MPI (`find_package(MPI)`), HDF5 (`find_package(HDF5)`), CUDA (`find_package(CUDAToolkit)`), OpenMP (`find_package(OpenMP)`), HIP (when `OPS_HIP=ON`), and Python 3.8+. The translator's Python virtual environment is set up differently depending on the build system. For the **CMake build**, `ops_translator/CMakeLists.txt` copies the translator tree to `${CMAKE_INSTALL_PREFIX}/translator/ops_translator/` and runs `setup_venv_cmake.sh` directly (using `python3 -m venv`) to create the venv at `${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ops_venv/` — it does **not** call `make python`. For the **Makefile build**, `make python` inside `ops_translator/` creates the venv under `ops_translator/.python/`. #### Build Structure ``` CMakeLists.txt # Top-level: compiler flags, dependencies, options ├── ops_translator/CMakeLists.txt # Installs translator + sets up Python venv ├── ops/c/CMakeLists.txt # Backend libraries (ops_seq, ops_cuda, ops_mpi, etc.) ├── ops/fortran/CMakeLists.txt # Fortran backend libraries ├── apps/c/CMakeLists.txt # C/C++ example applications │ ├── CloverLeaf/CMakeLists.txt │ ├── shsgc/CMakeLists.txt │ └── ... └── apps/fortran/CMakeLists.txt # Fortran example applications ``` #### Library Targets The CMake build in `ops/c/CMakeLists.txt` produces these library targets: | CMake Target | Condition | Description | |------------------|--------------------------|-----------------------| | `ops_seq` | Always | Sequential + OpenMP | | `ops_cuda` | `CUDAToolkit_FOUND` | CUDA single-node | | `ops_hip` | `OPS_HIP` + `HIP_FOUND` | HIP single-node | | `ops_ompoffload` | NVHPC compiler + CUDA | OpenMP Offload | | `ops_mpi` | `MPI_FOUND` | MPI + sequential | | `ops_mpi_cuda` | MPI + CUDA | MPI + CUDA | | `ops_mpi_hip` | MPI + HIP | MPI + HIP | | `ops_hdf5_seq` | `HDF5_FOUND` | HDF5 I/O (sequential) | | `ops_hdf5_mpi` | HDF5 + MPI | HDF5 I/O (MPI) | All libraries are installed under `${CMAKE_INSTALL_PREFIX}/lib` with CMake export files at `${CMAKE_INSTALL_PREFIX}/lib/cmake`, allowing downstream projects to use `find_package(OPS)`. #### Typical Build Workflow ```bash # Build everything (library + apps) mkdir build && cd build cmake .. -DBUILD_OPS_APPS=ON -DCMAKE_INSTALL_PREFIX=$HOME/OPS-INSTALL \ -DAPP_INSTALL_DIR=$HOME/OPS-APPS -DGPU_NUMBER=1 make make install # Or build library and apps separately mkdir build && cd build cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/OPS-INSTALL make && make install mkdir appbuild && cd appbuild cmake ../../apps/c -DOPS_INSTALL_DIR=$HOME/OPS-INSTALL -DAPP_INSTALL_DIR=$HOME/OPS-APPS make ``` #### Adding a New Backend to CMake 1. Add a new library target in `ops/c/CMakeLists.txt` (following the pattern of existing backends) 2. Add conditional detection logic in the top-level `CMakeLists.txt` if a new dependency is needed 3. Use `installtarget()` macro to register the target for installation and export 4. Add MPI variant if applicable (create `ops_mpi_` target) ### Makefile System The Makefile-based build uses modular includes: ``` makefiles/ ├── Makefile.common # Common flags and definitions ├── Makefile.c_app # Main C application makefile ├── Makefile.cuda # CUDA-specific flags ├── Makefile.hip # HIP-specific flags ├── Makefile.sycl # SYCL flags (via Makefile.icx) ├── Makefile.mpi # MPI flags └── Makefile. # Compiler-specific settings ``` ### Build Targets For an application named `APP`, the following targets are generated: | Target | Description | |--------------------------|--------------------------------------| | `$(APP)_dev_seq` | Development sequential (no code-gen) | | `$(APP)_dev_mpi` | Development MPI (no code-gen) | | `$(APP)_seq` | Sequential with generated kernels | | `$(APP)_openmp` | OpenMP parallel | | `$(APP)_mpi` | MPI distributed | | `$(APP)_mpi_openmp` | MPI + OpenMP hybrid | | `$(APP)_tiled` | Lazy execution with tiling | | `$(APP)_cuda` | CUDA single GPU | | `$(APP)_mpi_cuda` | MPI + CUDA | | `$(APP)_sycl` | SYCL single device | | `$(APP)_mpi_sycl` | MPI + SYCL | | `$(APP)_hip` | HIP single GPU | | `$(APP)_mpi_hip` | MPI + HIP | | `$(APP)_ompoffload` | OpenMP Offload single GPU | | `$(APP)_mpi_ompoffload` | MPI + OpenMP Offload | --- ## Debugging Tips ### Code Generator Debugging ```bash # Verbose output python3 ops-translator -v --file_paths source.cpp # Dump parsed structure as JSON python3 ops-translator -d --file_paths source.cpp # Target specific backend only python3 ops-translator -t cuda --file_paths source.cpp ``` ### Runtime Debugging ```bash # Enable diagnostics ./app_cuda -OPS_DIAGS=2 # Check block decomposition (MPI) ./app_mpi_cuda -OPS_DIAGS=2 # Timing breakdown ops_timing_output(stdout); ``` ### Common Issues | Issue | Cause | Solution | |----------------------------------|-----------------------------------|-------------------------------| | `GET_MACRO` redefined | Name collision with Intel headers | Harmless warning, ignore | | `printf` in SYCL kernel | Variadic functions not allowed | Guard with `#ifndef OPS_SYCL` | | Preprocessor directives stripped | Code generator limitation | Use runtime conditionals | --- ## Contributing To contribute to OPS, please use the following steps: 1. Clone the [OPS](https://github.com/OP-DSL/OPS) repository on your local system. 2. Create a new branch in your cloned repository. 3. Make changes or contributions in your new branch. 4. Submit your changes by creating a pull request to the `develop` branch of the OPS repository. Contributions in the `develop` branch will be merged into the `master` branch when a new release is created.