Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] rmm v25.04 #1876

Merged
merged 48 commits into from
Apr 9, 2025
Merged

[RELEASE] rmm v25.04 #1876

merged 48 commits into from
Apr 9, 2025

Conversation

raydouglass
Copy link
Member

❄️ Code freeze for branch-25.04 and v25.04 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-25.04 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-25.04 into main for the release

raydouglass and others added 30 commits January 23, 2025 15:03
Forward-merge branch-25.02 into branch-25.04
Contributes to rapidsai/build-planning#146

Proposes:

* setting `[tool.scikit-build].ninja.make-fallback = false`, so `scikit-build-core` will not silently fallback to using GNU Make if `ninja` is not available

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1804
Forward-merge branch-25.02 to branch-25.04
This migrates amd64 CI jobs (PRs and nightlies) to use L4 GPUs from the NVKS cluster.

xref: rapidsai/build-infra#184

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: #1803
Fixes `build_type` input not being used in `test` workflows. See
#1811 (comment).
## Description
Testing rapidsai/shared-workflows#276.

We will merge this PR and then we can try running manual branch tests.

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/rmm/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Uses a retry wrapper for `pip` commands to try to alleviate CI failures due to hash mismatches that result from network hiccups

xref rapidsai/build-planning#148

This will retry failures that show up in CI like:

```
   Collecting nvidia-cublas-cu12 (from libraft-cu12==25.2.*,>=0.0.0a0)
    Downloading https://pypi.nvidia.com/nvidia-cublas-cu12/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_aarch64.whl (604.9 MB)
       ━━━━━━━━━━━━━━━━━━━━━                 350.2/604.9 MB 229.2 MB/s eta 0:00:02
  ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
      nvidia-cublas-cu12 from https://pypi.nvidia.com/nvidia-cublas-cu12/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_aarch64.whl#sha256=93a4e0e386cc7f6e56c822531396de8170ed17068a1e18f987574895044cd8c3 (from libraft-cu12==25.2.*,>=0.0.0a0):
          Expected sha256 93a4e0e386cc7f6e56c822531396de8170ed17068a1e18f987574895044cd8c3
               Got        849c88d155cb4b4a3fdfebff9270fb367c58370b4243a2bdbcb1b9e7e940b7be
```

Authors:
  - Gil Forsyth (https://github.com/gforsyth)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1814
This completes the migration to NVKS runners now that all libraries have been tested and rapidsai/shared-workflows#273 has been merged.

xref: rapidsai/build-infra#184

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #1816
This change helps completely insulate rmm (and transitively) the rest of RAPIDS from fmt and spdlog as dependencies, thereby solving a large number of issues around ABI stability, symbol visibility, package clobbering, and more. See rapidsai/build-planning#104 for more information.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Bradley Dice (https://github.com/bdice)
  - James Lamb (https://github.com/jameslamb)

URL: #1808
Addresses #1808 (comment)

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - James Lamb (https://github.com/jameslamb)

URL: #1820
A pair of doxygen comments in `host_memory_resource` referenced `device_memory_resource` when it didn't mean to, very likely a simple copy/paste issue.

#1794

Authors:
  - Nicholas Sielicki (https://github.com/aws-nslick)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1809
This is a cleanup PR. I found that we were extraneously including `<thrust/optional.h>` in the pool memory resource (also `thrust::optional` is deprecated in favor of `cuda::std::optional` in the upcoming major release of CCCL). I did a pass with IWYU to see what else could be fixed. IWYU could only really analyze our tests, since RMM is header-only. There are a lot of false positives/negatives, so I don't think it is appropriate to automate IWYU in our CI. However, this felt valuable enough to open a refactoring PR.

I also updated some deprecated GTest code which was using `TYPED_TEST_CASE` instead of `TYPED_TEST_SUITE` and replaced some uses of `::value` with the corresponding `_v` STL features.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1821
Forward-merge branch-25.02 into branch-25.04
Summary:

## `recipe.yaml`
- We use the [multi-output cache](https://rattler.build/latest/multiple_output_cache/) to avoid double-compiling. The `build` environment compiles things, the individual outputs call `cmake --install`
- We make use of the built-in `git` functions for grabbing the short-SHA (https://rattler.build/latest/experimental_features/#git-functions)
- We use `load_from_file` to pull in metadata from the corresponding `pyproject.toml` (https://rattler.build/latest/experimental_features/#load_from_filefile_path)
- Relatively "simple" `*_build.sh` scripts are inlined into `recipe.yaml` instead of existing as separate files

## `build_*.sh`
- We use `--no-build-id` to allow `sccache` to look in a predictable place, see: https://rattler.build/latest/tips_and_tricks/#using-sccache-or-ccache-with-rattler-build
- Depending on whether `rapids-is-release-build`, we include either `rapidsai` (release) or `rapidsai-nightly` (non-release) in the channel listing
- Channels must be specified at the command-line
  - This uses https://github.com/rapidsai/gha-tools/blob/main/tools/rapids-rattler-channel-string to generate an array of channels
- We remove the `build_cache` directory after building so it doesn't get packaged up with the other artifacts and uploaded to S3

xref: rapidsai/build-planning#47

Authors:
  - Gil Forsyth (https://github.com/gforsyth)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1796
Update CMake minimum required to 3.30.4 across all of RAPIDS

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)
  - Bradley Dice (https://github.com/bdice)

URL: #1826
Removes the `.` from the `py_version` context variable and standardizes whitespace and section ordering

Authors:
  - Gil Forsyth (https://github.com/gforsyth)
  - https://github.com/apps/pre-commit-ci
  - Bradley Dice (https://github.com/bdice)
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)

URL: #1832
Fixes redistribution of `rapids-logger` code which can cause clobbering. See #1833.

After this change, the following paths should _not_ be in the `librmm` package:
- `lib/librapids_logger.so`
- `lib/cmake/rapids_logger/*`
- `include/rapids_logger/*`

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - https://github.com/jakirkham
  - Gil Forsyth (https://github.com/gforsyth)

URL: #1834
This pr uses new functionality added to shared-actions and shared-workflows to capture sccache hit rate information. To add this to other repos, we'll need to make the slight alteration here:

`sccache --show-adv-stats | tee ../../telemetry-artifacts/sccache-stats.txt`

That is, output the sccache stats to a file with a particular name in the telemetry-artifacts folder.

Authors:
  - Mike Sarahan (https://github.com/msarahan)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #1830
Fixes for `build`/`host` dependencies in the rattler recipe for librmm.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: #1835
RMM benchmarks should statically link Google Benchmark. We saw they were linking to `libbenchmark.so` while working with rattler-build: #1836 (comment)

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #1837
Turns on erroring for overlinking errors and fixes all of those errors.

I've reduced the number of overdepending warnings, but `rapids-logger` seems to
consistently cause an overdepending warning, so I haven't yet switched that to
error mode.

Authors:
  - Gil Forsyth (https://github.com/gforsyth)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1836
Telemetry is causing build workflows to fail. This adds `telemetry-setup` to the `build.yaml` workflow.

Authors:
  - Bradley Dice (https://github.com/bdice)
  - Mike Sarahan (https://github.com/msarahan)

Approvers:
  - Mike Sarahan (https://github.com/msarahan)

URL: #1838
bdice and others added 12 commits March 1, 2025 01:07
This is a skeleton for adding examples, requested in issue #1784.

I plan to merge some minimal form of this, and then add a few examples that answer common questions about RMM, such as how to use specific memory resource adaptors or how to use RMM for managing multi-thread, multi-stream work.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Mark Harris (https://github.com/harrism)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #1800
Retry getting improved error throwing and logging, with bugs fixed and test added that repros the cudf failure.  [Original PR](#1827) that was [reverted](#1843). 

The changes to the previously-approved PR that includes the fixes and test is [this commit](c8a8505). The [original while loop](https://github.com/rapidsai/rmm/blob/6e8539e42d51852faab5f9b330232168f9223eee/include/rmm/mr/device/pool_memory_resource.hpp#L253) has been restored with better error handling. 

Note that this changes the interface of the macros, one of which is called in cudf that will be changed [here](rapidsai/cudf#18108) after this goes in.

Authors:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1844
Fix for `-fdebug-prefix-map` breaking sccache (it contains the librmm build number).

Workaround for prefix-dev/rattler-build#1458.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Gil Forsyth (https://github.com/gforsyth)

URL: #1846
This PR adds tests for internal macros. Closes #1848.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1847
This PR runs C++ examples in CI. Closes #1845.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: #1850
Updates several `dependencies.yaml` entries to match the others in the
file which allows the `update-version.sh` script to work correctly.
Recently PR ( #1844 ) changed how error messages were generated when pointing to a particular file and line number. In particular they changed from using the typical C-string (`const char*`), which is `\0` terminated, to a C++ `std::string` object, which is not `\0` terminated. This change in turn was picked up when RMM headers are used to compile libraries (like cuDF) including file paths in strings that are not `\0` terminated.

Conda in turn would detect the paths in these error messages and attempt to fix them as part of the prefix replacement process. When Conda did the prefix replacement would add an additional `\0` terminating character to string. However as the strings are now `std::string` based which lack `\0` terminating characters the final string written out by Conda would be one byte longer. This could mean overwriting other text data in the library or writing outside the text block. This is known bug in Conda ( conda/conda-build#1674 ).

Thus when cuDF started building with the aforementioned RMM change last week, the packages it created lacked had file paths in error messages lacking the `\0` terminating character. These in turn would be inadvertently corrupted by Conda when installing the packages in an environment. This led to a quite hairy bug detailed in issue ( rapidsai/cudf#18251 ).

To correct this issue, we drop the `std::string` constructor that was added in the aforementioned PR. More specifically we adapted the following code from cuDF's [`CUDF_EXPECTS_3`]( https://github.com/rapidsai/cudf/blob/8041ac8e370b092229841508fdfd1efb88fef034/cpp/include/cudf/utilities/error.hpp#L186-L192 ) and [`CUDF_FAIL_2`]( https://github.com/rapidsai/cudf/blob/86eb82399f0e056731e2062dc95a4583c26e9af1/cpp/include/cudf/utilities/error.hpp#L225-L227 ), which still uses a C-style string. Also to address the need for runtime generation of some errors. We use `std::string` for only an initial snippet of the string and add other contents like the `__FILE__` after. This keeps the latter bits as C-style strings.

Authors:
  - https://github.com/jakirkham

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1858
rmm nightlies are currently failing on CUDA 11.4 because CUDA 11
librmm-examples package is overconstrained.
If the driver supports the flag, unconditionally set the async memory pool usage property to include a request to support HW decompression.

- Closes #1849

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1854
…#1873)

This reverts commit 7f0cead.

- Closes #1872

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1873
Copy link

copy-pr-bot bot commented Mar 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added CMake Python Related to RMM Python API conda cpp Pertains to C++ code ci labels Mar 20, 2025
@AyodeAwe AyodeAwe merged commit 35ca074 into main Apr 9, 2025
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CMake conda cpp Pertains to C++ code Python Related to RMM Python API
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.