Releases: LLNL/RAJA
v2022.10.1
This release updates the RAJA release number in CMake, which was inadvertently
missed in the v2022.10.0 release.
Please download the RAJA-v2022.10.1.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
v2022.10.0
This release contains new features, bug fixes, and build improvements. Please see the RAJA user guide for more information about items in this release.
Please download the RAJA-v2022.10.0.tar.gz file below. The others, generated by GitHub, may not work for you due to RAJA's dependencies on git submodules.
Notable changes include:
-
New features / API changes:
- Introduced new RAJA::forall and reduction interfaces that extend the execution behavior of reduction operations with RAJA::forall. The main difference with the pre-existing reduction interface in RAJA is that reduction variables and operations are passed into the RAJA::forall method and lambda expression instead of using the lambda capture mechanism for reduction objects. This offers flexibility and potential performance advantages when using RAJA reductions as the new interface enables the ability to integrate with programming model back-end reduction machinery directly for OpenMP and SYCL, for example. The interface also enables user-chosen kernel names to be passed to RAJA::forall for performance analysis annotations that are easier to understand. Example codes are included as well as a description of the new interface and comparison with the pre-existing interface in the RAJA User Guide.
- Added support for run time execution policy selection for RAJA::forall kernels. Users can specify any number of execution policies in their code and then select which to use at run time. There is no discussion of this in the RAJA User Guide yet. However, there are a couple of example codes in files RAJA/examples/dynamic-forall.cpp.
- The RAJA::launch framework has been moved out of the experimental namespace, into the RAJA:: namespace, which introduces an API change.
- Add support for all RAJA segment types in the RAJA::launch framework.
- Add SYCL back-end support for RAJA::launch and dynamic shared memory for all back-ends in RAJA::launch. These changes introduce API changes.
- Add additional policies to WorkGroup construct that allow for different methods of dispatching work.
- Add special case implementations to CUDA atomicInc and atomicDec functions to use special hardware support when available. This can result in a significant performance boost.
- Rework HIP atomic implementations to support more native data types.
- Added RAJA_UNROLL_COUNT macro which enables users to unroll loops for a fix unroll count.
- Major User Guide rework:
- New RAJA tutorial sections, including new exercise source files to work through. Material used in recent RADIUSS/AWS RAJA Tutorial.
- Cleaned up and expanded RAJA feature sections to be more like a reference guide with links to associated tutorial sections for implementation examples.
- Improved presentation of build configuration sections.
-
Build changes / improvements:
- Submodule updates:
- BLT updated to v0.5.2 release.
- Camp updated to v2022.10.0 release.
- The minimum CMake version required has changed. For a HIP build, CMake 3.23 or newer is required. For all other builds CMake 3.20 or newer is required.
- OpenMP back-end support is now off by default to match behavior of all other RAJA parallel back-end support. To enable OpenMP, users must now run CMake with the -DENABLE_OPENMP=On option.
- Support OpenMP back-end enablement in a HIP build configuration.
- RAJA_ENABLE_VECTORIZATION CMake option added to enable/disable new SIMD/SIMT vectorization support. The default is 'On'. The option allows users to disable if they wish.
- Improvements to build target export mechanics coordinated with camp, BLT, and Spack projects.
- Improve HIP builds to better support evolving ROCm software stack.
- Add CMake variable RAJA_ALLOW_INCONSISTENT_OPTIONS and CMake messages to allow users more control when using CMake dependent options. When CMake is run, the code now checks for cases when RAJA_ENABLE_X=On and but ENABLE_X=Off. Previously, this was confusing because X would not be enabled despite the value of the RAJA-specific option.
- Build system refactoring to make CMake configurations more robust; added test to check for installed CMake config.
- Added basic support to compile with C++20 standard.
- Add missing compilation macro guards for HIP and CUDA policies in vectorization support when not using a GPU device.
- Submodule updates:
-
Bug fixes / improvements:
- Expanded test coverage to catch more cases that users have run into.
- Various fixes in SIMD/SIMT support for different compilers and versions users have hit recently. Also, changes to internal implementations to improve run time performance for those features.
v2022.03.1
This is a patch release for v2022.03.0. It fixes compilation errors which occur due to vectorization headers being included for the GPU when not compiling for the GPU.
v2022.03.0
This release contains new features, bug fixes, and build improvements. Please see the release notes below and the RAJA user guide for more information about items in this release.
Please download the RAJA-v2022.03.0.tar.gz file below. The others, generated by GitHub, will not work due to RAJA's use of git submodules.
Notable changes include:
-
Important note: As of this release, the coordinated release of RAJA Portability Suite components (RAJA, Umpire, CHAI, and camp) will be tagged as YYYY.MM.pp for year, month, and patch number. For example, This release is tagged 2022.03.0. indicating that is was part of a coordinated RAJA Suite release in March 2022. The intent of the new labeling scheme is to indicate that all Suite components with a common year-month release tag are compatible and to make the association amongst them clear to users. If an individual component requires a patch release independent of the others, the release for that component will be labeled 2022.03.1, for example, to indicate that it is one patch release beyond the original combined Suite release.
-
New features / API changes:
- BREAKING CHANGE: RAJA OffsetLayout constructor was changed to take (begin, end) args (where end is one past the last index) instead of (first, last) args (where last index was included). This is consistent with expected behavior and other RAJA Layout/View concepts.
- New experimental features that support SIMD/SIMT programming by guaranteeing vectorization without the need to rely on compiler auto-vectorization. Basic documentation for this is included in the RAJA User Guide and should provide enough description for interested users to try it out.
- "Flatten" policies were added for RAJA Teams. This reshapes multi-dimensional GPU thread blocks to 1D.
- RAJA Teams now allows a single execution policy to be provided. Previously, it required two; e.g., a CPU policy and a GPU policy.
- ROCTX suport has been added to enable kernel naming with RAJA Teams.
- Details of CUDA and HIP errors are now added to the reported exception string. Previously, this information was going to stderr.
- All CUDA execution policies have been expanded to allow users to specify a minimum number of blocks per SM, if they wish to do that. An analogous capability for HIP execution policies is being hashed out.
- Changes were made to RAJA scans to address a consistency issue and allow passing a const pointer as an input span.
- RAJA View pointer type is fixed to properly allow CHAI ManagedArray type to be passed through to View instead of the raw pointer type. This fixes an issue where some required CHAI memory transfers were not occurring.
- A "combining adapter" concept has been added that allows multi-dimensional loops to be run using one-dimensional interfaces. Please see the RAJA User Guide for more description.
- Additional feature support and improvements have been made to the RAJA SYCL back-end (please see the RAJA User Guide for more information):
- "nontrivially copyable" SYCL interface has been removed (i.e., 'RAJA::sycl_exec_nontrivial<...>' and
'RAJA::SyclKernelNonTrivial<...>') as these constructs are no longer needed when using recent updates to the Intel OneAPI compiler. Execution is now dispatched based on the C++ 'is_trivially_copyable' type trait. - Support for RAJA::kernel loop tiling policies is now available for SYCL execution.
- The naming scheme for SYCL 'group' and 'local' policies has been changed from 1-based to 0-based for block dimensions.
- The use of the SYCL atomic OneAPI extension namespace has been cleaned up.
- "nontrivially copyable" SYCL interface has been removed (i.e., 'RAJA::sycl_exec_nontrivial<...>' and
-
Build changes/improvements:
- AS OF THIS RELEASE, RAJA REQUIRES A C++14-COMPLIANT COMPILER TO BUILD.
- AS OF THIS RELEASE, RAJA REQUIRES CMAKE version 3.14.5 or newer.
- The BLT submodule is updated to v0.5.0, which includes improved support for ROCm/HIP builds.Although the option CMAKE_HIP_ARCHITECTURES to specify the HIP target architecture is not available until CMake version 3.21, the option is supported in the new BLT version and works with all versions of CMake.
- The camp submodule is updated to v2022.03.0. If you do not use the submodule and build RAJA with an external version of camp, you will need to use camp v2022.03.0 or newer.
- The "RAJA_" prefix has been added to all CMake options. Options that shadow a CMake or BLT option are turned into cmake_dependent_option calls, ensuring that they can be controlled independently and have the correct dependence on the underlying CMake or BLT support; e.g., RAJA_ENABLE_CUDA requires ENABLE_CUDA.
- The camp_DIR export has been removed. Camp paths will be searched using the default logic which consistent with camp.
- The raja-config.cmake package file is now "relocatable", meaning it can be moved to another directory location after an install and still work. This should make it easier to use for applications that use RAJA and CMake, but do not use BLT.
- CMake logic for using CUB in RAJA for a CUDA build has been changed. The default behavior is now that when the CUDA version is < 11, the RAJA CUB submodule will be used. When the CUDA version is >= 11, the CUB version that is included in the associated CUDA toolkit will be used. Users have the ability to override these defaults and select a specific version of CUB if they wish.
- CMake logic for using rocPRIM in RAJA for a HIP build is similar. The default behavior is now that when the HIP version is < 4, the RAJA rocPRIM submodule will be used. When the HIP version is >= 4, the rocPRIM version that is included in the associated ROCm toolkit will be used. Users have the ability to override these defaults and select a specific version of rocPRIM if they wish.
- The RAJA Spack package was updated to include the version of this release and address some issues.
- Added a concept of RAJA_HIP_ACTIVE that mirrors RAJA_CUDA_ACTIVE.
- The CMake option RAJA_ENABLE_HIP_INDIRECT_FUNCTION_CALL has been removed. Now the choice is made based on the ROCm compiler version.
-
Bug fixes/improvements:
- A bug in TBB non-inplace scan implementation was fixed.
- RAJA StaticLayout was fixed to avoid compiler warnings due to converting a negative integer value to an unsigned integral type.
- Various improvements, updates, and fixes (formatting, typos, etc.) in RAJA User Guide.
v0.14.1
This is a patch release for v0.14.0. It updates the camp submodule to v0.2.3 and fixes a couple of broken macro include guards in RAJA.
Please download the RAJA-v0.14.1.tar.gz file below. The others will not work due to the way RAJA uses git submodules.
v0.14.0
This release contains new features, bug fixes, and build improvements. Please see the RAJA user guide for more information about items in this release.
Please download the RAJA-v0.14.0.tar.gz file below. The others will not work due to the way RAJA uses git submodules.
Notable changes include:
-
New features / API changes:
- Initial release of some SYCL execution back-end features for supporting Intel GPUs. Users should be able to exercise RAJA::forall, basic RAJA::kernel, and reductions. Future releases will contain additional RAJA feature support for SYCL.
- Various enhancements to the experimental RAJA "teams" capability, including documentation and complete code examples illustrating usage.
- The RAJA "teams" interface was expanded to initial support for RAJA/camp resources.
- The RAJA "teams" interface was expanded to allow users to label kernels with name strings to easily attribute execution timings and other details to specific kernels with NVIDIA profiling tools, for example. Usage information is available in the RAJA User Guide. Kernel naming will be available for all other RAJA kernel execution methods in a future release.
- Deprecated sort and scan methods taking iterators have been removed, Now, these methods take RAJA span arguments. For example, (begin, end) args are replaced with RAJA::make_span(begin, N), where N = end - begin. Please see the RAJA User Guide documentation for scan and sort operations for details and usage examples.
- Sort and scan methods now accept an optional resource argument.
- Methods were added to the RAJA::kernel API to accept a resource argument; specifically 'kernel_resource' and 'kernel_param_resource'. These kernel methods return an Event object similar to the RAJA::forall interface.
- RAJA resource support added to RAJA workgroup and worksite constructs.
- OpenMP CPU multithreading policies have been reworked so that usage involving OpenMP scheduling are consistent. Specification of a chunk size for scheduling policies is optional, which is consistent with native OpenMP usage. In addition, no-wait policies are more constrained to prevent potentially non-conforming (to the OpenMP spec) usage. Finally, additional policy type aliases have been added to make common use cases less verbose. Please see the RAJA policy documentation in the User Guide for policy descriptions.
- Host implementation of HIP atomics added.
- Add ability to specify atomic to use on the host in CUDA and HIP atomic policies (i.e., added host atomic template parameter), This is useful for host-device decorated lambda expressions that may be used for either host or device execution. It also fixes compilationissues with Hip atomic compilation in host-device contexts.
- The RAJA Registry API has been changed to return raw pointers to registry objects rather than shared_ptr type. This is better for performance.
- New content has been added to the RAJA Developer Guide available in the Read The Docs Sphinx documentation. This should help folks align their work with RAJA processes when making contributions to RAJA.
- Basic doxygen source code documentation is now available via a link in our Read The Docs Sphinx documentation.
- Unified memory implementation for storing indices in TypedListSegment, which was marked deprecated in v0.12.0 release has been removed. Now, TypedListSegment constructor requires a camp resource object to be passed which indicates the memory space where the indices will live. Specifically, the array of indices passed to the constructor by a user (assumed to live in host memory for the "owned" case) will be copied to an internally owned allocation in the memory space defined by the resource object.
- The ListSegment constructor takes a resource by value now, previously taken by reference, which allows more resource argument types to be passed more seamlessly to the List Segment constructor.
- 'CudaKernelFixedSM' and 'CudaKernelFixedSMAsync' methods were added which allow users to specify the minimum number of thread blocks to launch per SM. This resulted in a performance improvement for an application use case. Future work will expand this concept to other GPU kernel execution methods in RAJA.
-
Build changes/improvements:
- Update BLT submodule to latest release, v0.4.1.
- Update camp submodule to latest tagged release, v0.2.2
- The RAJA_CXX_STANDARD_FLAG CMake variable was removed. The BLT_CXX_STD variable is now used instead.
- Support for building RAJA as a shared library on Windows has been added.
- A build system adjustment was made to address an issue when RAJA is built with an external version of camp (e.g., through Spack).
- The build default has been changed to use the version of CUB that is installed in the specified version of the CUDA toolkit, if available, when CUDA is enabled. Similarly, for the analogous functionality in HIP. Specific versions of these libraries can still be specified for a RAJA build. Please see the RAJA User Guide for details.
- The build system now uses the BLT cmake_dependent_options support for options defined by BLT. This avoids shadowing of BLT options by options defined in RAJA and in the cases where RAJA is used as a sub-module in another BLT project. For example, it provides the ability to disable RAJA tests and examples at a more fine granularity.
- Checks were added to the RAJA CMake build system to check for minimum equired versions of CUDA (9.2) and HIP (3.5).
- A build system bug was fixed so that targets for third-party dependencies provided by BLT (e.g., CUDA and HIP) are exported properly. This allows non-BLT projects to use the imported RAJA target.
- An issue was fixed to appease the MSVC 2019 compiler.
- Improvements to build system to address Hip linking issues.
-
Bug fixes/improvements:
- Hip and CUDA block reductions were tweaked to fix the number of steps in the final wavefront/warp reduction. This saves a couple rounds of warp shfls.
- A runtime bug resulting from defaulted View constructors not being implemented correctly in CUDA 10.1 is fixed. This fixes an issue with CHAI managed arrays not having their copy constructor being triggered properly.
- Fix bug that caused a CUDA or HIP synchronization error when a zero length loop was enqueued in a workgroup.
- Added missing HIP workgroup unordered execution policy, so HIP version is consistent with CUDA version.
- Fixed issue where the RAJA non-resource API returns an EventProxy object with a dangling resource pointer, by getting a reference to the default resource for the execution context.
- IndexSet utility methods for collecting indices into a separate container now work with any index type.
- The volatile qualifier was removed from a type conversion function used in RAJA atomics. This fixes a performance issue with HIP where the value was written to stack memory during type conversion.
- Numerous improvements, updates, and fixes (formatting, typos, etc.) in RAJA User Guide.
v0.13.0
This release contains new features, bug fixes, and build improvements. Please see the RAJA user guide for more information about items in this release.
Please download the RAJA-v0.13.0.tar.gz file below. The others will not work due to the way RAJA uses git submodules.
Notable changes include:
-
New features:
- Execution policies for the RAJA HIP back-end and examples have been added to the RAJA User Guide and Tutorial.
- Strongly-typed indices now work with Multiview.
-
Build changes/improvements:
- Update BLT to version X.y.z.
- Added option to enable/disable runtime plugin loading. This is now off by default. Previously, it was always enabled and there was no way to disable it.
-
Bug fixes/improvements:
- Issues have been addressed so that the OpenMP target back-end is now working properly for all supported features. This has been verified with multiple clang compilers, including clang 10, and the XL compiler.
- Various data structures have been made trivially copyable to ensure they are mapped properly to device memory with OpenMP target execution.
- Numerous improvements and fixes (formatting, typos, etc.) in User Guide.
v0.12.1
This release contains fixes for errors when using a CUDA-enabled RAJA with a non-CUDA compiler, squashed compiler warnings, and some other bug fixes related to OpenMP target compilation.
Please download the RAJA-v0.12.1.tar.gz file below. The others will not work due to the way RAJA uses git submodules.
v0.12.0
This release contains new features, notable changes, and bug fixes. Please see the RAJA user guide for more information about items in this release.
Please download the RAJA-v0.12.0.tar.gz file below. The others will not work due to the way RAJA uses git submodules.
Notable changes include:
-
Notable repository change:
- The 'master' branch in the RAJA git repo has been renamed to 'main'.
-
New features:
- New RAJA "work group" capability added. This allows multiple GPU kernels to be executed via one kernel launch, greatly reducing the run time overhead of launching CUDA kernels.
- Dynamic plug-ins in RAJA, which enable the use of things like Kokkos Performance Profiline Tools to be used with RAJA (https://github.com/kokkos/kokkos-tools)
- Added ability to pass a resource object to RAJA::forall methods to enable asynchronous execution for CUDA and HIP back-ends.
- Added "Multi-view" that works like a regular view, except that it can wrap multiple arrays so their accesses can share index arithmetic.
- Introduced RAJA "Teams" concept as an experimental feature. This enables hierarchical parallelism and additional nested loop patterns beyond what RAJA::kernel supports. Please note that this is very much a work-in-progress and is not yet documented in the user guide.
- Added initial support for dynamic loop tiling.
- New OpenMP execution policies added to support static, dynamic, and guided scheduling.
- Added support for const iterators to be used with RAJA scans.
- Support for bitwise and and or reductions have been added.
- The RAJA::kernel interface has been expanded to allow only segment index arguments used in a lambda to be passed to the lambda. In previous versions of RAJA, every lambda invoked in a kernel had to accept an index argument for every segment in the segment tuple passed to RAJA::kernel execution templates, even if not all segment indices were used in a lambda. This release still allows that usage pattern. The new capability requires an additional template parameter to be passed to the RAJA::statement::Lambda type, which identify the segment indices that will be passed and in which order.
-
API Changes:
- The RAJA 'VarOps' namespace has been removed. All entities previously in that namespace are now in the 'RAJA' namespace.
- RAJA span is now public for users to access and has been made more like std::span.
- RAJA::statement::tile_fixed has been moved to RAJA::tile_fixed (namespace change).
- RAJA::statement::{Segs, Offsets, Params, ValuesT} have been moved to RAJA::{Segs, Offsets, Params, ValuesT} (namespace change).
- RAJA ListSegment constructors have been expanded to accept a camp Resource object. This enables run time specification of the memory space where the data for list segment indices will live. In earlier RAJA versions, the space in which list segment index data lived was a compile-time choice based on whether CUDA or HIP was enabled and the data resided in unified memory for either case. This is still supported in this release, but is marked as a DEPRECATED FEATURE. In the next RAJA release, ListSegment construction will require a camp Resource object. When compiling RAJA with your application, you will see deprecation warnings if you are using the deprecated ListSegment constructor.
- A reset method was added to OpenMP target offload reduction classes so they contain the same functionality as reductions for all other back-ends.
-
Build changes/improvements:
- The BLT, camp, CUB , and rocPRIM submodules have all been updated to more recent versions. Please note that RAJA now requires rocm versionc 3.5 or newer to use the HIP back-end.
- Build for clang9 on macox has been fixed.
- Build for Intel19 on Windows has been fixed.
- Host/device annotations have been added to reduction operations to eliminate compiler warnings for certain use cases.
- Several warnings generated by the MSVC compiler have been eliminated.
- A couple of PGI compiler warnings have been removed.
- CMake improvements to make it is easier to use an external camp or CUB library with RAJA.
- Note that the RAJA tests are undergoing a substantial overhaul. Users, who chose to build and run RAJA tests, should know that many tests are now being generated in the build space directory structure which mimics the RAJA source directory structure. As a result, only some test executables appear in the top-level 'test' subdirectory of the build directory; others can be found in lower-level directories. The reason for this change is to reduce test build times for certain compilers.
-
Bug fixes:
- An issue with SIMD privatization with the Intel compiler, required to generate correct code, has been fixed.
- An issue with the atomicExchange() operation for the RAJA HIP back-end has been fixed.
- A type issue in the RAJA::kernel implementation involving RAJA span usage has been fixed.
- Checks for iterator ranges and container sizes have been added to RAJA scans, which fixes an issue when users attempted to run a scan over a range of size zero.
- Several type errors in the Layout.hpp header file have been fixed.
- Several fixes have been made in the Layout and Static Layout types.
- Several fixes have been made to the OpenMP target offload back-end to address host-device memory issues.
- A variety of RAJA User Guide issues have been addressed, as well as issues in RAJA example codes.
v0.11.0
This release contains new features, several notable changes, and some bug fixes.
Please download the RAJA-v0.11.0.tar.gz file below. The others will not work due to the way RAJA uses git submodules.
Notable changes include:
-
New features:
- HIP compiler back-end added to support AMD GPUs. Usage is essentially the same as for CUDA. Note that this feature is considered a work-in-progress and not yet production ready. It is undocumented, but noted here, for friendly users who would like to try it out.
- Updated version of camp third-party library, which includes variety of portability fixes. Most users should not need to concern themselves with the details of camp.
- Added new tutorial material and exercises.
- Documentation improvements.
-
API Changes:
- None.
-
Build changes/improvements:
- RAJA version number is now accessible as #define macro variable constants so that users who need to parameterize their code to support multiple RAJA versions can do this more easily. See the file RAJA/include/RAJA/config.hpp for details. RAJA version numbers are also experted as CMake variables.
- Added support to link to external camp library. By default, the camp git submodule will be used. If you prefer to use a different version of camp, set the RAJA CMake variable 'EXTERNAL_CAMP_SOURCE_DIR' to the location of the desired camp directory.
- BLT submodule (CMake-based build system) has been updated to latest BLT release (v0.3.0). The release contains a new version of GoogleTest, which required us to modify our use of gtest macros and our own testing macros. For the most part, this change should be invisible to users. However, the new GoogleTest does not work with CUDA versions 9.1.x or earlier. Therefore, if you compile RAJA with CUDA enabled and also wish to enable RAJA tests, you must use CUDA 9.2.x or newer.
-
Bug fixes:
- Fixed various issues to make internal implementations more robust, resolved issues with non fully-qualified types in some places, and work arounds for some compiler issues.