Developer Guide
This guide provides an overview of OPS internals for developers who wish to contribute to OPS, add new backends, or understand how the code generation and runtime library work.
Architecture Overview
OPS consists of two main components:
Code Generator (
ops_translator/): A Python-based source-to-source translator that parses user applications (using libclang for C++ and fparser2 for Fortran) and generates parallel code for various backends.Runtime Library (
ops/c/andops/fortran/): Backend-specific implementations that handle data management, parallelization, and communication.
┌─────────────────────────────────────────────────────────────────────┐
│ User Application │
│ (ops_par_loop calls + kernels) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Code Generator (ops_translator) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Parser │───>│ Scheme │───>│ Jinja2 Templates │ │
│ │ (libclang/ │ │ (target │ │ (loop_host, master_ │ │
│ │ fparser2) │ │ logic) │ │ kernel, etc.) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Generated Parallel Code │
│ (CUDA, HIP, SYCL, OpenMP, OpenMP Offload + MPI variants) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Runtime Library (ops/c/src/) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ core/ │ │ cuda/ │ │ sycl/ │ │ mpi/ │ ... │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Code Generator (ops_translator)
The code generator is located in ops_translator/ops-translator/ and uses Python with Clang bindings (libclang) for C++ parsing and fparser2 for Fortran parsing.
Directory Structure
ops_translator/
├── ops-translator/ # Main translator code
│ ├── __main__.py # Entry point & CLI argument handling
│ ├── scheme.py # Code generation schemes (genLoopHost)
│ ├── target.py # Target definitions (Cuda, Sycl, Hip, etc.)
│ ├── ops.py # OPS constructs (Loop, Arg, Dat, etc.)
│ ├── store.py # Application, Program, ParseError classes
│ ├── util.py # Utilities, KernelProcess class
│ ├── language.py # Language definitions (C++, Fortran)
│ ├── jinja_utils.py # Jinja2 environment setup
│ ├── cpp/ # C++ specific code
│ │ ├── parser.py # Clang-based C++ parser
│ │ ├── schemes.py # C++ target scheme implementations
│ │ └── translator/ # Kernel/program translators
│ └── fortran/ # Fortran specific code
├── resources/ # Code generation resources
│ └── templates/ # Jinja2 templates
│ ├── cpp/ # C++ templates
│ │ ├── loop_host.cpp.j2 # Base loop host template
│ │ ├── master_kernel.cpp.j2 # Master kernel file
│ │ ├── cuda/ # CUDA-specific templates
│ │ ├── sycl/ # SYCL-specific templates
│ │ ├── mpi_openmp/ # MPI+OpenMP templates
│ │ └── ...
│ └── fortran/ # Fortran templates
└── .python/ # Python virtual environment for Makefile builds (generated by `make python`, not in version control)
# CMake builds create ops_venv/ under ${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ instead
Key Classes
Target (target.py)
Defines code generation targets and their configurations:
class Target(Findable):
name: str # Target identifier (e.g., "cuda", "sycl")
kernel_translation: bool # Whether kernel code needs transformation
config: Dict[str, Any] # Target-specific configuration
Available targets:
Target Class |
Name |
Description |
|---|---|---|
|
|
CPU sequential/OpenMP |
|
|
NVIDIA GPUs via CUDA |
|
|
AMD GPUs via HIP |
|
|
Intel/AMD/NVIDIA via SYCL |
|
|
GPU via OpenMP target |
|
|
Fortran-to-C CUDA |
|
|
Fortran-to-C HIP |
|
|
Fortran-to-C SYCL |
Scheme (scheme.py)
Orchestrates code generation for a language/target combination:
class Scheme(Findable):
lang: Lang # Language (C++, Fortran)
target: Target # Target backend
loop_host_template: Path # Template for loop host code
def genLoopHost(...) -> Tuple[str, str, str]:
"""Generate loop host code from template"""
# 1. Translate kernel if needed
# 2. Process kernel text (KernelProcess)
# 3. Render Jinja2 template
return (generated_code, extension, kernel_func)
KernelProcess (util.py)
Handles kernel text transformations for different backends:
class KernelProcess:
def clean_kernel_func_text(kernel_func) # Remove OPS-specific markers
def cuda_complex_numbers(kernel_func) # Handle complex number support
def sycl_kernel_func_text(kernel_func, consts) # SYCL-specific transforms
def get_kernel_body_and_arg_list(kernel_func) # Extract body and args
Parser (cpp/parser.py)
Uses libclang to parse C++ source files:
def parseLoops(translation_unit, program) -> None:
"""Parse ops_par_loop calls from C++ source"""
# Find macro instantiations and function calls
# Extract loop information (kernel, block, range, arguments)
Jinja2 Templates
Templates use Jinja2 syntax with OPS-specific filters and tests. Key template variables:
Variable |
Description |
|---|---|
|
Loop host object (kernel name, args, ndim, etc.) |
|
Original kernel function text |
|
Extracted kernel body |
|
Argument name list |
|
Current target object |
|
Constants used in kernel |
Example template structure (loop_host.cpp.j2):
{% block host_prologue %}
// Setup code: args, dimensions, pointers
{% endblock %}
{% block kernel_call %}
// Parallel launch code (varies by target)
{% endblock %}
{% block host_epilogue %}
// Cleanup, timing, diagnostics
{% endblock %}
Adding a New Backend
To add a new backend (e.g., “newgpu”):
Define Target in
target.py:
class NewGPU(Target):
name = "newgpu"
kernel_translation = True
config = {"grouped": True, "device": 11}
Target.register(NewGPU)
Create Scheme in
cpp/schemes.py:
class CppNewGPU(CppScheme):
target = NewGPU()
loop_host_template = Path("cpp/newgpu/loop_host.cpp.j2")
master_kernel_template = Path("cpp/newgpu/master_kernel.cpp.j2")
loop_kernel_extension = "newgpu.cpp"
Scheme.register(CppNewGPU)
Create Templates in
resources/templates/cpp/newgpu/:loop_host.cpp.j2- Loop host wrappermaster_kernel.cpp.j2- Master include file
Add Runtime Support in
ops/c/src/newgpu/(if needed)Update Makefiles in
makefiles/directory
Runtime Library (ops/c/)
The runtime library provides backend implementations for data management, parallel execution, and communication.
Directory Structure
ops/c/
├── include/ # Public headers
│ ├── ops_lib_core.h # Core OPS API
│ ├── ops_seq.h # Sequential backend header
│ ├── ops_cuda.h # CUDA backend header
│ ├── ops_hip.h # HIP backend header
│ ├── ops_sycl.h # SYCL backend header
│ └── ...
├── src/ # Implementation
│ ├── core/ # Core library (shared across backends)
│ │ ├── ops_lib_core.cpp # Core API implementation
│ │ ├── ops_lazy.cpp # Lazy execution & tiling
│ │ └── ops_instance.cpp # OPS instance management
│ ├── sequential/ # Sequential backend
│ ├── cuda/ # CUDA backend
│ ├── hip/ # HIP backend
│ ├── sycl/ # SYCL backend
│ ├── mpi/ # MPI support for all backends
│ │ ├── ops_mpi_core.cpp
│ │ ├── ops_mpi_partition.cpp # Domain decomposition
│ │ ├── ops_mpi_rt_support_cuda.cpp
│ │ ├── ops_mpi_rt_support_sycl.cpp
│ │ └── ...
│ ├── ompoffload/ # OpenMP offload backend
│ └── tridiag/ # Tridiagonal solver support
└── lib/ # Compiled libraries
Core Components
ops_lib_core.cpp
ops_init()/ops_exit()- Initialization and cleanupops_decl_block()- Block declarationops_decl_dat()- Dataset declarationops_decl_stencil()- Stencil declarationops_partition()- MPI partitioning trigger
ops_lazy.cpp
Lazy execution queue management
Tiling plan computation
Communication-avoiding optimizations
Key structures:
ops_kernel_list,tiling_plan
MPI Support (ops/c/src/mpi/)
Domain decomposition (
ops_mpi_partition.cpp)Halo exchange management
Backend-specific MPI+GPU support:
ops_mpi_rt_support_cuda.cpp- CUDA+MPIops_mpi_rt_support_sycl.cpp- SYCL+MPIops_mpi_rt_support_hip.cpp- HIP+MPI
Adding Runtime Support for a New Backend
Create backend directory:
ops/c/src/newgpu/Implement required functions:
Device memory allocation/deallocation
Data transfer (host ↔ device)
Kernel launch wrappers
Add MPI support (if needed):
ops/c/src/mpi/ops_mpi_rt_support_newgpu.cppUpdate build system:
Add
makefiles/Makefile.newgpuUpdate
CMakeLists.txt
Build System
OPS supports two build systems: CMake (recommended) and Makefiles. Both produce the same set of backend libraries and application binaries. For full build instructions see installation.md.
CMake Build System
The top-level CMakeLists.txt orchestrates the entire build: compiler detection, dependency discovery, backend library compilation, translator installation, and optional application builds.
Key CMake Options
Option |
Default |
Description |
|---|---|---|
|
— |
Build type: |
|
|
Build the C/C++ backend libraries |
|
|
Build the Fortran backend libraries |
|
|
Build sample applications (library CMake only) |
|
|
Enable CTest-based tests |
|
|
Enable the HIP backend |
|
|
Use the legacy code generator |
|
|
Enable strict IEEE floating-point flags |
|
|
Show verbose output during build |
|
|
Library installation directory |
|
|
Application installation directory |
|
— |
Path to installed OPS library (app CMake only) |
|
— |
Number of GPUs for tests |
|
|
CUDA compute capability (e.g., |
|
— |
Path to tridiagonal solver library (optional) |
Dependency Detection
CMake automatically discovers: MPI (find_package(MPI)), HDF5 (find_package(HDF5)), CUDA (find_package(CUDAToolkit)), OpenMP (find_package(OpenMP)), HIP (when OPS_HIP=ON), and Python 3.8+. The translator’s Python virtual environment is set up differently depending on the build system. For the CMake build, ops_translator/CMakeLists.txt copies the translator tree to ${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ and runs setup_venv_cmake.sh directly (using python3 -m venv) to create the venv at ${CMAKE_INSTALL_PREFIX}/translator/ops_translator/ops_venv/ — it does not call make python. For the Makefile build, make python inside ops_translator/ creates the venv under ops_translator/.python/.
Build Structure
CMakeLists.txt # Top-level: compiler flags, dependencies, options
├── ops_translator/CMakeLists.txt # Installs translator + sets up Python venv
├── ops/c/CMakeLists.txt # Backend libraries (ops_seq, ops_cuda, ops_mpi, etc.)
├── ops/fortran/CMakeLists.txt # Fortran backend libraries
├── apps/c/CMakeLists.txt # C/C++ example applications
│ ├── CloverLeaf/CMakeLists.txt
│ ├── shsgc/CMakeLists.txt
│ └── ...
└── apps/fortran/CMakeLists.txt # Fortran example applications
Library Targets
The CMake build in ops/c/CMakeLists.txt produces these library targets:
CMake Target |
Condition |
Description |
|---|---|---|
|
Always |
Sequential + OpenMP |
|
|
CUDA single-node |
|
|
HIP single-node |
|
NVHPC compiler + CUDA |
OpenMP Offload |
|
|
MPI + sequential |
|
MPI + CUDA |
MPI + CUDA |
|
MPI + HIP |
MPI + HIP |
|
|
HDF5 I/O (sequential) |
|
HDF5 + MPI |
HDF5 I/O (MPI) |
All libraries are installed under ${CMAKE_INSTALL_PREFIX}/lib with CMake export files at ${CMAKE_INSTALL_PREFIX}/lib/cmake, allowing downstream projects to use find_package(OPS).
Typical Build Workflow
# Build everything (library + apps)
mkdir build && cd build
cmake .. -DBUILD_OPS_APPS=ON -DCMAKE_INSTALL_PREFIX=$HOME/OPS-INSTALL \
-DAPP_INSTALL_DIR=$HOME/OPS-APPS -DGPU_NUMBER=1
make
make install
# Or build library and apps separately
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/OPS-INSTALL
make && make install
mkdir appbuild && cd appbuild
cmake ../../apps/c -DOPS_INSTALL_DIR=$HOME/OPS-INSTALL -DAPP_INSTALL_DIR=$HOME/OPS-APPS
make
Adding a New Backend to CMake
Add a new library target in
ops/c/CMakeLists.txt(following the pattern of existing backends)Add conditional detection logic in the top-level
CMakeLists.txtif a new dependency is neededUse
installtarget()macro to register the target for installation and exportAdd MPI variant if applicable (create
ops_mpi_<backend>target)
Makefile System
The Makefile-based build uses modular includes:
makefiles/
├── Makefile.common # Common flags and definitions
├── Makefile.c_app # Main C application makefile
├── Makefile.cuda # CUDA-specific flags
├── Makefile.hip # HIP-specific flags
├── Makefile.sycl # SYCL flags (via Makefile.icx)
├── Makefile.mpi # MPI flags
└── Makefile.<compiler> # Compiler-specific settings
Build Targets
For an application named APP, the following targets are generated:
Target |
Description |
|---|---|
|
Development sequential (no code-gen) |
|
Development MPI (no code-gen) |
|
Sequential with generated kernels |
|
OpenMP parallel |
|
MPI distributed |
|
MPI + OpenMP hybrid |
|
Lazy execution with tiling |
|
CUDA single GPU |
|
MPI + CUDA |
|
SYCL single device |
|
MPI + SYCL |
|
HIP single GPU |
|
MPI + HIP |
|
OpenMP Offload single GPU |
|
MPI + OpenMP Offload |
Debugging Tips
Code Generator Debugging
# Verbose output
python3 ops-translator -v --file_paths source.cpp
# Dump parsed structure as JSON
python3 ops-translator -d --file_paths source.cpp
# Target specific backend only
python3 ops-translator -t cuda --file_paths source.cpp
Runtime Debugging
# Enable diagnostics
./app_cuda -OPS_DIAGS=2
# Check block decomposition (MPI)
./app_mpi_cuda -OPS_DIAGS=2
# Timing breakdown
ops_timing_output(stdout);
Common Issues
Issue |
Cause |
Solution |
|---|---|---|
|
Name collision with Intel headers |
Harmless warning, ignore |
|
Variadic functions not allowed |
Guard with |
Preprocessor directives stripped |
Code generator limitation |
Use runtime conditionals |
Contributing
To contribute to OPS, please use the following steps:
Clone the OPS repository on your local system.
Create a new branch in your cloned repository.
Make changes or contributions in your new branch.
Submit your changes by creating a pull request to the
developbranch of the OPS repository.
Contributions in the develop branch will be merged into the master branch when a new release is created.