All notable changes to the OmpSs-2 programming model and its related software will be documented in this file.
The OmpSs-2 2024.11 release adds support for Coroutines through the NODES runtime and the nOS-V tasking library and introduces several new features in nOS-V which include support for a task suspension API, support for RISC-V, a Topology API, and a Memory Pressure API, among others. This release also introduces support for the breakdown model through ovni and nOS-V.
- Add compatibility with ALPI version 1.1 by implementing various functions from the tasking interface
- Introduce support for breakdown model implementation, supported through the use of
ovniemu -b
- Refactor shutdown mechanism, using a coordinated approach to prevent contention during runtime shutdown
- Introduce a Memory Pressure API, to query the current occupancy of the nOS-V shared memory segment
- Allow re-initialization of nOS-V, permitting the call to
nosv_init()
afternosv_shutdown()
- Enable
turbo
setting by default, and add correctness checking to detect changes to FPU flags from outside of nOS-V - Add support for coroutines and similar constructs through the
nosv_suspend()
API. - Add support for RISC-V
- Introduce a Topology API, which allows the configuration of system topology through the
nosv.toml
file - Allow submitting tasks as
NOSV_SUBMIT_IMMEDIATE
from a task's run callback - Introduce
nosv_cond_t
and related calls, as a replacement for pthread condition variables - Other miscellaneous fixes and improvements
- Introduce support for Coroutines
- Fix immediate successor logic from within busy threads
- Fix wrong header include order in the build system affecting NODES' installation
- Other minor bug fixes and code improvements
- Support other LLVM/Intel compiler generated code in libompv (tracing) by setting
OMP_ENABLE_COMPAT=1
- Other bug fixes and improvements
- Miscellaneous bug fixes and improvements
- Add breakdown model for nOS-V
- New mark API
ovni_mark_*()
to emit user-defined events - New API to manage stream metadata
ovni_attr_*()
- Update trace format to version 3 (to support independent streams)
- Introduce TAMPI-OPT, an update for the Task-Aware MPI (TAMPI) library which implements several optimizations
The OmpSs-2 2024.05 release includes the Directory/Cache (D/C) for Host and CUDA devices in Nanos6, several new features for the nOS-V tasking library, and performance and bugfixes. The libompv
in LLVM/OpenMP includes the implementation of OpenMP free-agents and instrumentation through ovni. This release removes the support for the Mercurium compiler.
- Add directory/cache (D/C) for Host and CUDA devices
- Add device memory allocation API for D/C-managed memory
- Improvements to the ovni instrumentation
- New batch submission API, which can accumulate tasks to submit them in batch once a certain threshold is reached
- Add
nosv_mutex_t
andnosv_barrier_t
as nOS-V aware alternatives to their pthread counterparts - Add instrumentation points for the
nosv_attach
andnosv_detach
calls - Add instrumentation for parallel tasks
- Activate the
turbo.enabled
configuration option by default, enabling flush-to-zero in x86-64 and aarch64 - Perform safety checks when the
turbo.enabled
configuration option is set to verify FPU flags are not modified by external libraries - Split instrumentation events for the scheduler to allow them to be more granularly controlled
- Allow nOS-V programs to call fork() without leaving the forked process in an incoherent state
- Other bugfixes and improvements
- Improve the error-handling of nOS-V return codes
- Improve descriptiveness of ovni instrumentation
- Various improvements related to API integrations (nOS-V, ALPI, ovni)
- Implement the OpenMP free-agents feature by setting
OMP_ENABLE_FREE_AGENTS=1
andOMP_WAIT_POLICY=passive
- Instrument through ovni by setting
OMP_OVNI=1
and enabling ovni instrumentation in nOS-V
- Add
OPENMP_RUNTIME
environment variable to choose the runtime library to link against - Other bugfixes and improvements
- New
ovni_thread_require
function to enable emulation models - Streams are marked as finished when calling
ovni_thread_free
- Support per-thread metadata
- Add manual page for
ovnidump
- Add support for
nosv_attach
andnosv_detach
events - Add support for
nosv_mutex_lock
,nosv_mutex_trylock
, andnosv_mutex_unlock
events - Add support for
nosv_barrier
events - Add OpenMP model to instrument the
libompv
implementation - Add new body model to support parallel tasks in nOS-V (
taskfor
directive) - Fix Paraver cfgs for Mac OS
- Other bugfixes and improvements
The OmpSs-2 2023.11 release includes performance and bugfixes for the runtime systems, several new features for the nOS-V tasking library, and performance improvements on the taskiter
construct implementation. It also implements the ALPI (version 1.0) in the runtime systems, which provides support for task-aware libraries. The LLVM/OpenMP includes a new OpenMP runtime called OpenMP-V (libompv
) that works on top of the nOS-V tasking library. A new instrumentation library called Sonar is provided to instrument MPI function calls through ovni.
- The OmpSs-2 runtime systems expose the ALPI generic low-level tasking interface
- Implement the ALPI interface (version 1.0)
- Allow embedding jemalloc allocator
- Embed hwloc and jemalloc by default
- Add
devices.cuda.prefetch
config option to control CUDA prefetching of data dependencies (enabled by default) - Install the
nanos6.toml
config file in$prefix/share
- Remove obsolete instrument.h public interface
- Remove obsolete stats and graph instrumentations
- Remove software dependency with libunwind and elfutils
- Fix execution when enabling extrae instrumentation
- Remove memory leaks
- Various bugfixes and corrections
- Implement the ALPI interface (version 1.0)
- Add
misc.stack_size
config option to change the stack size of nOS-V threads - Add
ovni.level
config option for fine-grained instrumentation control - Change
nosv_attach
API to not require an explicit task type and support multiple attaches - Implement parallel tasks which can be executed on multiple CPUs at once
- Allow calling
nosv_init
andnosv_shutdown
multiple times - Change error handling to return custom nOS-V error codes
- Allow early wake of deadline tasks with
nosv_submit
passing theNOSV_SUBMIT_DEADLINE_WAKE
flag - Add compatibility layer for calls to
sched_get/setaffinity
andpthread_get/setaffinity
- Add instrumentation points for the
nosv_create
andnosv_destroy
APIs - Various bugfixes and corrections
- Improve performance of the
taskiter
construct - Fix several bugs of the
taskiter
implementation - Ensure nOS-V library is at the first level of dependencies
- Use the updated attach/detach from nOS-V 2.0
- Drop support for nOS-V versions older than 2.0
- Provide OpenMP runtime named OpenMP-V (
libompv
) working over the nOS-V tasking library (-fopenmp=libompv
) - Make OpenMP-V runtime compatible with task-aware libraries
- Drop support for task-aware libraries in vanilla OpenMP runtime
libomp
- Fix task data dependencies' calculation for long double types
- Add
OVNI_TRACEDIR
envar to change the trace directory (default isovni
) - Add the
ovniver
program to report the libovni version and commit - Add
ovni_version_get()
function - Add nOS-V API subsystem events for
nosv_create()
andnosv_destroy()
- Add TAMPI model with
T
code, subsystem events and cfgs - Add MPI model with
M
code, function events and cfgs - Don't hardcore destination directory names like lib, to use the ones in the destination host (like lib64)
- Introduce the Sonar library that uses ovni for instrumenting MPI functions
- Leverage the ALPI interface instead of the Nanos6-specific interface
- Drop support for OmpSs-2 versions older than 2023.11
- See other features and fixes in each task-aware libraries' CHANGELOG
The OmpSs-2 2023.05.1 release includes bug fixes and documentation improvements in the Nanos6 runtime, the LLVM/OpenMP runtime, and the LLVM/Clang compiler.
- Fix CUDA kernel launch configuration and improve performance of OmpSs-2@CUDA support
- Allow failures at CUDA prefetching without aborting the execution
- Fix linking with jemalloc when --as-needed linking flag is used
- Improve testing infrastructure and programs
- Update documentation regarding OmpSs-2@CUDA support
- Improve general documentation
- Fix OpenMP potential use-after-free in polling tasks' mechanism
- Fix unconditional break inside a for-loop which is encapsulated in a task
- Fix device tasks call order when capturing more information in other clauses
- Add support
shmem
clause in device tasks
The OmpSs-2 2023.05 release includes new software projects and several performance and usability improvements for the OmpSs-2 programming model. In the context of OmpSs-2, this release introduces the new NODES runtime system supporting OmpSs-2, a novel and efficient tasking library named nOS-V, new Task-Aware libraries for interoperability with GPU offloading models, and new features in the ovni instrumentation library.
- Improve support for ovni instrumentation in the Nanos6 runtime and support for the idle CPUs view
- Add performance and usability improvements in Nanos6
- Allow embedding hwloc library into Nanos6 to avoid conflicts with other third-party software that use different hwloc versions
- Add support for
atomic
andcritical
OmpSs-2 directives in the LLVM/Clang compiler - Drop support for
task for
clause - Mercurium is the OmpSs-2 legacy compiler, not supported anymore, and will not provide new features for OmpSs-2. Use the LLVM/Clang compiler instead
- Introduce the new low-level nOS-V threading and tasking library, enabling co-execution of applications
- Introduce the new NODES runtime system, built on top of nOS-V, that supports the OmpSs-2 model. This runtime implements the
taskiter
construct and leverages directed task graphs (DCTG) to optimize the execution of iterative applications - Extend
-fompss-2
option from LLVM/Clang to choose between Nanos6 and NODES runtimes by accepting the option valueslibnanos6
(default) andlibnodes
, respectively
- Introduce the new Task-Aware CUDA (TACUDA), Task-Aware HIP (TAHIP) and Task-Aware SYCL (TASYCL) libraries. These task-aware libraries seamlessly integrate the CUDA, HIP and SYCL APIs for GPU offloading with the OmpSs-2 and OpenMP tasking models
- Add performance improvements and bug fixes in the Task-Aware MPI (TAMPI) and Task-Aware GASPI (TAGASPI) communication libraries
- Extend Task-Aware MPI (TAMPI) to support ovni instrumentation and allow tracing of multi-node hyrbid MPI+OmpSs-2 applications
- Add new graph-based design in ovni to support complex models like the new breakdown timeline
The OmpSs-2 2022.11 release introduces LLVM support for CUDA tasks and runtime loading, OVNI instrumentation support, and several bug fixes and features that improve the overall performance and programmability.
- Add a probabilistic attribute to the Immediate Successor feature (enabled by default)
- Improve locking, reduce allocations, and fix alignment issues
- Improve task creation performance
- Add support for ovni instrumentation through the configuration:
version.instrument=ovni
- Choose detail level with the option
instrument.ovni.level
- Support
device(cuda)
tasks in OmpSs-2 programs built with the LLVM compiler - Drop support for
device(cuda)
in Mercurium - Support both building kernels separately with NVCC and linking them to the final binary or building directly with LLVM
- Support building PTX binaries and CUDA kernels at runtime when placed in a specific folder (by default nanos6-cuda-kernels)
- Add a new configuration option for the default CUDA kernels folder
devices.cuda.kernels_folder
The OmpSs-2 2021.11.1 release introduces some bug fixes and minor improvements.
- Adapt taskfor to avoid overwriting task args in compiler-generated code
- Improve support for custom CXXFLAGS in the runtime system
- Add
--disable-all-instrumentations
configure option in the runtime system - Provide
nanos6-info
with new options to show compile/link runtime flags - Provide
nanos6-info
with new options to show current and default config files
The OmpSs-2 2021.11 release introduces the support for the taskloop with the collapse clause, some performance and code fixes in the runtime system, and several fixes for the CTF tracing tools.
- Add compiler support for the taskloop with the collapse clause
- Set
hybrid
CPU manager policy as default - Fix the setting of a floating-point optimization bit in the CSR register (x86) when enabling
turbo
mode - Add several fixes to CTF tracing tools
- Add support for
if(0)
and taskwaits with dependencies in fast CTF converter (nanos6-ctf2prv-fast
) - Remove unnecessary warning at run-time in the NUMA-aware code
The OmpSs-2 2021.06 release instroduces efficient support in the programming model and the runtime system for NUMA systems. It also enhances the taskloop construct to support data dependencies. Now the CTF instrumentation supports MPI applications.
- Add memory allocation API to distribute allocations across NUMA domains and schedule tasks based on that information
- Add the
onready
task clause to define arbitrary callback functions executed when a task becomes ready - Add new
hybrid
CPU manager policy in the runtime that combines bothidle
andbusy
policies - Make idle CPUs block inside the scheduler while there are no ready tasks when enabling the
busy
policy - Remove polling services API
- Remove OmpSs-2@Cluster features and code from this runtime system version
- Remove
nanos6-cluster
submodule; see OmpSs-2@Cluster Releases for cluster versions - Other bugfixes, performance and code improvements
- Add fast CTF trace converter enabled through the runtime config option
instrument.ctf.converter.fast
- Add support for multi-process tracing enabled by the Task-Aware MPI library
- Add merger tool for multi-process traces named
nanos6-mergeprv
The OmpSs-2 2020.11.1 release introduces bug fixes and code improvements.
- Efficient support for taskloop dependencies
- Fix reductions in taskloops and taskfors
- Fix Fortran reductions in Mercurium
- Fully implement the
assert
directive - Centralize runtime configuration options
- Abort execution when an unknown runtime configuration option is defined
- Fix CTF instrumentation issues
- Unify instrumentation, monitoring and hwcounter points in Nanos6
- Add stable Nanos6 for OmpSs-2@Cluster at
nanos6-cluster
submodule - Bugfixes, performance and code improvements
The OmpSs-2 2020.11 release introduces several features and fixes that improve general performance. It replaces all the configuration environment variables with a configuration file, improving the runtime system's usability. Now, the discrete dependency system is the default implementation. Finally, the LLVM-based compiler has been extended to support most of OmpSs-2 features.
- Replace all environment variables with a TOML configuration file
- Add
NANOS6_CONFIG
environment variable to specify the configuration file - Add
NANOS6_CONFIG_OVERRIDE
to override options from the configuration file - Enhance performance in architectures with hyperthreading
- Improve locking performance in the runtime system
- Optimize memory allocations in the runtime system
- Add the
assert
declarative directive to check the loaded dependency system - Support all OmpSs-2 features in the LLVM-based compiler except device tasks
- Other bugfixes, performance and code improvements
- Make
discrete
the default dependency system - Add support for CUDA task reductions in discrete dependencies
- Improve allocations in discrete dependencies
- Add support for kernel events in CTF instrumentation
- Add new Paraver views for CTF traces
- Add fixes for OpenACC and CUDA devices
The OmpSs-2 2020.06.1 release introduces some bug fixes and performance improvements.
- Improve the interface and performance of the Nanos6 scheduler's lock
- Fix CTF instrumentation bugs and limitations
- Fix PAPI hardware counters backend
- Support newer versions of GCC, Clang and GLIBC in Nanos6
- Fix task external events API
- Remove preemption mechanism from critical sections
- Add Nanos6 test suite built with the OmpSs-2 LLVM-based compiler
- Fix OmpSs-2
atomic
directive - Define
_OMPSS_2
when compiling with the OmpSs-2 LLVM-based compiler - Bugfixes, performance and code improvements
The OmpSs-2 2020.06 release introduces several features that improve the general performance of OmpSs-2 applications. It adds a new variant to extract execution traces with a lightweight internal tracer. It also improves the support for CUDA and provides support for OpenACC tasks. Additionally, it introduces a new compiler for OmpSs-2 programs in beta development based on the open-source LLVM infrastructure.
- Use jemalloc as a scalable multi-threading memory allocator
- Add
turbo
variant enabling floating-point optimizations and the discrete dependency system - Refactor of CPU Manager and DLB support improvements
- Add new OmpSs-2 compiler based on LLVM in beta development supporting several OmpSs-2 features
- Add support for non-blocking TAMPI in the LLVM OpenMP runtime system
- Bugfixes, performance and code improvements
- Improve taskfor distribution policy
- Improve scheduling performance and code infrastructure
- Implement the discrete dependency system with lock-free techniques
- Add support for weak dependencies in discrete
- Add support for commutative and concurrent dependencies in discrete
- Refactor the hardware counters infrastructure and support both PAPI and PQoS counters
- Add
ctf
variant to extract execution traces in CTF format using a lightweight internal tracer - Provide the
ctf2prv
tool to convert CTF traces to Paraver traces - Avoid Extrae trace desynchronizations in hybrid MPI+OmpSs-2 executions
- Remove the
stats-papi
instrumentation variant
- Refactor of the devices' infrastructure
- Perform transparent CUDA Unified Memory prefetching
- Add support for cuBLAS and similar CUDA APIs
- Add support for OpenACC tasks
The OmpSs-2 2019.11.2 release introduces some bug fixes.
- Fix important error at the Nanos6 initialization
- Fix in discrete dependency system
- Several fixes for OmpSs-2@Cluster
- Fix minor issues of Mercurium
The OmpSs-2 2019.11.1 release introduces some bugfixes and performance improvements.
- Fix execution of CUDA tasks
- Fix
dmalloc
in OmpSs-2@Cluster - Add missing calls to CPU Manager
- Improve taskfor performance
- Improve general performance by using a reasonable cache line size padding
- Add tests checking the execution of CUDA tasks
- Fix minor issues of Mercurium
The OmpSs-2 2019.11 release introduces a new optimized data dependency implementation. It improves the usability,
performance and code of the scheduling infrastructure and the task for
feature. It also adds support for DLB
and OmpSs-2@Linter.
- Data dependency implementation can be decided at run-time through
NANOS6_DEPENDENCIES
variable - Performance and code improvements on the
task for
feature - Add support for Dynamic Load Balancing (DLB) tool
- Add compiler and runtime support for OmpSs-2@Linter
- Important bugfix in memory allocator (used by OmpSs-2@Cluster)
- Bugfixes, performance and code improvements
- Add new optimized discrete dependency system implementation; enabled by
NANOS6_DEPENDENCIES=discrete
- Usability, performance and code improvements on the scheduling infrastructure
- Remove profile instrumentation variant
- Remove interception mechanism of memory allocation functions
The OmpSs-2 2019.06.2 release introduces some bugfixes.
- Compiling extrae variant with high optimization flags
- Removing backtrace sampling from the extrae variant
The OmpSs-2 2019.06.1 release mainly introduces bugfixes and code improvements.
- Renaming loop directive to task for
- Tasks can leverage reductions and external events at the same time (over distinct data regions)
- OmpSs-2@Cluster bugfixes
- Fixing binding information reported by nanos6-info binary
- Support for the TAGASPI library
- Other bugfixes and code improvements
The OmpSs-2 2019.06 release mainly introduces the new support for OmpSs-2@Cluster. It also includes some improvements and optimizations for array task reductions and general bugfixes.
- Support for OmpSs-2@Cluster
- Bugfixes and performance improvements
- Bugfixes and optimization for array reductions
- Delete obsolete task data dependency implementations
- Delete obsolete schedulers
The OmpSs-2 2018.11 release provides full support for the TAMPI library. It also includes general bugfixes and performance improvements.
- Full support for TAMPI
- Bugfixes and performance improvements
- Bugfixes in task external events API
The OmpSs-2 2018.06.2 release introduces some bugfixes.
- Bugfixes in HWLOC support
The OmpSs-2 2018.06.1 release introduces some bugfixes.
- Bugfixes in task reductions
The OmpSs-2 2018.06 release introduces support for OmpSs-2@CUDA in Unified Memory NVIDIA devices. It also supports array task reductions in C/C++ and task priorities. Additionally, it provides two new APIs used by the TAMPI library.
- Support for OmpSs-2@CUDA Unified Memory
- Bugfixes and performance improvements
- Support for array task reductions in C/C++
- Support for task priorities
- Add priority scheduler
- Add polling services API
- Add task external events API
- Rename taskloop construct to loop
The OmpSs-2 2017.11.1 release provides general bugfixes.
- Fixes for the building system
- Fixes for the loading system
The OmpSs-2 2017.11 release is the first release of the OmpSs-2 parallel task-based programming model, which comprises the Nanos6 runtime system and the Mercurium source-to-source compiler. This version provides the essential infrastructure to manage the parallelism of user tasks (task creation, task scheduling, etc.) and their data dependencies. The task dependency system supports array section dependencies, the nested dependency domain connection, and both early release and weak dependency models.
- General infrastructure of the runtime system and the compiler
- Support for user tasks and nesting of tasks
- Implement different schedulers: FIFO, LIFO, etc
- Implementation of a task data dependency system
- Support for array section dependencies
- Support for nested dependency domain connection
- Support for early release of task dependencies
- Support for weak task dependencies
- Support for reductions
- Taskloop construct with dependencies
- Task pause/resume API