Add chunks='auto' support for cftime datasets #10527

charles-turner-1 · 2025-07-13T01:09:51Z

Tests added

welcome · 2025-07-13T01:09:54Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

for more information, see https://pre-commit.ci

jemmajeffree · 2025-07-14T04:52:19Z

Would these changes also work for cf timedeltas or are they going to still cause problems?
I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

charles-turner-1 · 2025-07-14T05:02:09Z

Would these changes also work for cf timedeltas or are they going to still cause problems? I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

If you can find something thats specifically a cftimedelta and run the _contains_cftime_datetimes function on it that'd be super helpful to know whether it returns True or False.

jemmajeffree · 2025-07-14T07:39:31Z

TLDR: don't mind me, it's not going to cause any issues

Firstly, what I thought was a cftimedelta turned out to be a numpy timedelta hanging out with a cftime

When I did manage to coerce this timedelta into cftime conventions, it just contained a floating point number of days, so I can't see anything having issues with its size

coder = xr.coding.times.CFTimedeltaCoder()
result = coder.encode(oops.average_DT).load()
print(result.dtype)
result

xarray/namedarray/daskmanager.py

…1/xarray into autochunk-cftime

…pect disk chunks sensibly & this should be ready to go, I think

charles-turner-1 · 2025-07-15T23:13:46Z

I did some prodding around yesterday and I realised this won't let us do something like

import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto')

yet, only stuff along the lines of

import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto')

I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack.

dcherian · 2025-07-16T14:23:57Z

Why so? Are we sending "auto" in to normalize_chunks first?

…inda janky

…1/xarray into autochunk-cftime

charles-turner-1 · 2025-07-23T08:40:06Z

Yup, this is the call stack:

----> 3 xr.open_dataset(
      4     "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
  /Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
    720     )
--> 721     ds = _dataset_from_backend_dataset(
    722         backend_ds,
  /Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
    417     if chunks is not None:
--> 418         ds = _chunk_ds(
    419             ds,
  /Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
    367     for name, var in backend_ds.variables.items():
--> 368         var_chunks = _get_chunk(var, chunks, chunkmanager)
    369         variables[name] = _maybe_chunk(
  /Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
    101 
--> 102     chunk_shape = chunkmanager.normalize_chunks(
    103         chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks()

I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too.

Do I want to refactor to move the changes in xarray/structure/chunks.py into the daskmanager module if possible?

Once I've got the structure there cleaned up, I'll work on replacing the build_chunkspec function with something more sensible - I just need to work out how to extract the implementation in dask cleanly now I think - normalize_chunks also seems to calculate sensible chunk sizes.

dcherian · 2025-07-23T16:12:08Z

xarray/structure/chunks.py

+
+        from xarray.namedarray.utils import build_chunkspec
+
+        target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))


How about adding get_auto_chunk_size to the ChunkManager class; and put the dask-specific stuff in the DaskManager.

cc @TomNicholas

dcherian · 2025-07-23T16:14:26Z

I guess one bit that's confusing here is that the code-path for backends and normal variables is different?

So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory.

dcherian · 2025-07-23T16:15:27Z

xarray/namedarray/daskmanager.py

+        cubed.Array.rechunk
+        """
+
+        if _contains_cftime_datetimes(data):


I guess this can be deleted

Had a play and I don't think I can fully get rid of it, I've reused as much of the abstracted logic as possible though.

dcherian · 2025-07-25T15:00:05Z

xarray/namedarray/daskmanager.py

+    def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:
+        from dask import config as dask_config
+        from dask.utils import parse_bytes
+
+        from xarray.namedarray.utils import fake_target_chunksize
+
+        target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
+        return fake_target_chunksize(var, target_chunksize=target_chunksize)


Suggested change

def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:

from dask import config as dask_config

from dask.utils import parse_bytes

from xarray.namedarray.utils import fake_target_chunksize

target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))

return fake_target_chunksize(var, target_chunksize=target_chunksize)

def get_auto_chunk_size(self) -> int:

from dask import config as dask_config

from dask.utils import parse_bytes

return parse_bytes(dask_config.get("array.chunk-size"))

Only this much is dask-specific, so that's what the DaskManager should be responsible for.

dcherian · 2025-07-25T15:01:37Z

xarray/structure/chunks.py

+    if _contains_cftime_datetimes(var) and auto_chunks:
+        limit, var_dtype = chunkmanager.get_auto_chunk_size(var)
+    else:
+        limit, var_dtype = None, var.dtype


This logic would change to use fake_target_chunksize

for more information, see https://pre-commit.ci

…1/xarray into autochunk-cftime

…added code?

charles-turner-1 · 2025-07-28T02:24:37Z

I think most of the work left to do on this is just fixing the typing now...

…1/xarray into autochunk-cftime

xarray/structure/chunks.py

dcherian · 2025-08-24T16:55:57Z

xarray/namedarray/utils.py

+    if no_op:
+        return target_chunksize, data.dtype
+
+    import numpy as np


let's move imports to the top if we can; and remove the no_op bit

Can only move numpy to the top - moving from xarray.core.formatting import first_n_items creates a ciruclar import

I've removed the no_op stuff - this has the effect of assuming uniform dtypes across all arrays going into dask now. All tests pass (locally) so it's probably not a big deal - I'm not even sure that numpy would allow mixed dtypes, it doesn't feel like it should, but it might be worth noting?

What do you mean?

Looked this up subsequently and I think I'm talking waffle - the if no_op was just in there so that the logic for getting the array item size (in bytes) from the first item was skipped if we didn't find a cftime dtype in the array and a request for auto chunking.

Since arrays can only contain a single dtype, this shouldn't make any difference.

TLDR; ignore my previous comment, it was nonsense

xarray/namedarray/utils.py

xarray/tests/test_backends.py

xarray/tests/test_dask.py

dcherian · 2025-08-24T16:59:21Z

Sorry for the late review here. I left a few minor comments. Happy to merge after those are addressed

…1/xarray into autochunk-cftime

charles-turner-1 · 2025-08-25T00:16:36Z

No worries, figured you must have been busy/on holiday.

I've addressed all those comments - thanks for all the help getting off the ground with this!

dcherian · 2025-08-25T14:51:44Z

xarray/structure/chunks.py

@@ -83,8 +85,15 @@ def _get_chunk(var: Variable, chunks, chunkmanager: ChunkManagerEntrypoint):
        for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True)
    )

+    limit = chunkmanager.get_auto_chunk_size()
+    limit, var_dtype = fake_target_chunksize(var, limit)


Don't we need to check if var contains_cftime_objects?

This is related to what I was getting at yesterday with the no-op bit - reverting b5933ed would put that back in.

With that said, the logic doesn't change meaningfully depending on it. Currently, if we put an eg. 300MiB limit in to a var which is an f64, we tell dask to compute the chunks based on those numbers. If we put in an f32 with the same limit, it'll currently tell the dask chunking mechanism to compute chunks for a f64 with a 150MiB limit - which gets us the exact same chunk sizes back (based on my tests).

Actually, one of the side effects of the current implementation (no _contains_cftime_datetimes(var)) is that this would actually let you chunk arbitrary object dtypes, not just cftime. Whether this is desirable or not I guess would depend on whether you'd expect people to put arbitrarily/variable sized objects in - if there is the possibility for the size of objects in an array to vary, then the chunk calculation calculate inappropriate chunks.

I guess with the current implementation maybe_fake_target_chunksize would be a better name for the function, if we revert b5933ed then fake_target_chunksize makes more sense again.

All works, just need to satisfy mypy and whatnot now

eb1a967

github-actions bot added topic-documentation topic-NamedArray Lightweight version of Variable labels Jul 13, 2025

[pre-commit.ci] auto fixes from pre-commit.com hooks

852476d

for more information, see https://pre-commit.ci

charles-turner-1 changed the title ~~All works, just need to satisfy mypy and whatnot now~~ Add chunks='auto' support for cftime datasets Jul 13, 2025

charles-turner-1 and others added 5 commits July 12, 2025 18:10

Merge branch 'main' into autochunk-cftime

c921c59

Fix moving import to be optional

1aba531

[pre-commit.ci] auto fixes from pre-commit.com hooks

9429c3d

for more information, see https://pre-commit.ci

Make mypy happy

3c9d27e

Add some clarifying comments about what we need to do to optimise this

5153d2d

charles-turner-1 marked this pull request as draft July 14, 2025 05:02

dcherian reviewed Jul 14, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

dcherian reviewed Jul 14, 2025

View reviewed changes

xarray/namedarray/daskmanager.py Outdated Show resolved Hide resolved

charles-turner-1 added 4 commits July 15, 2025 07:04

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

62e71e6

…1/xarray into autochunk-cftime

@dcherian's suggestions. Just need to update chunking strategy to res…

cfdc31b

…pect disk chunks sensibly & this should be ready to go, I think

Merge branch 'main' of https://github.com/charles-turner-1/xarray

2f16bc7

Merge branch 'main' into autochunk-cftime

ce720fa

charles-turner-1 and others added 3 commits July 23, 2025 17:29

Merge branch 'main' into autochunk-cftime

4fa58c1

Can now load cftime arrays with auto-chunking. Implementation still k…

e58d6d7

…inda janky

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

590e503

…1/xarray into autochunk-cftime

dcherian reviewed Jul 23, 2025

View reviewed changes

dcherian reviewed Jul 25, 2025

View reviewed changes

charles-turner-1 and others added 6 commits July 28, 2025 08:43

Deepak's suggestions (think mypy is still going to be angry for now)

d8f45b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

20226c1

for more information, see https://pre-commit.ci

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

11ac9f0

…1/xarray into autochunk-cftime

Fix errant line

8485df5

Clean up DaskManager.rechunk a bit - maybe possible to remove more …

2c27877

…added code?

Remove unused import

0983261

charles-turner-1 and others added 4 commits August 8, 2025 11:25

Merge branch 'main' into autochunk-cftime

c4ec31f

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

adbf5b2

…1/xarray into autochunk-cftime

Fix a couple of type errors

6c93bc4

Mypy & tests passing locally

74bc0ea

charles-turner-1 marked this pull request as ready for review August 8, 2025 06:36

charles-turner-1 requested a review from dcherian August 8, 2025 06:37

charles-turner-1 added 2 commits August 12, 2025 21:48

Merge branch 'main' into autochunk-cftime

0b9bbd0

Merge branch 'main' into autochunk-cftime

e58322f

dcherian reviewed Aug 24, 2025

View reviewed changes

xarray/structure/chunks.py Outdated Show resolved Hide resolved

dcherian reviewed Aug 24, 2025

View reviewed changes

xarray/namedarray/utils.py Outdated Show resolved Hide resolved

dcherian reviewed Aug 24, 2025

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

dcherian reviewed Aug 24, 2025

View reviewed changes

xarray/tests/test_dask.py Outdated Show resolved Hide resolved

charles-turner-1 and others added 3 commits August 25, 2025 07:26

Merge branch 'main' into autochunk-cftime

dbc6ebd

Merge branch 'autochunk-cftime' of https://github.com/charles-turner-…

5680663

…1/xarray into autochunk-cftime

Deepak's comments

b5933ed

This was referenced Aug 25, 2025

Change default xarray_open_kwargs to use chunks="auto" intake/intake-esm#632

Open

Prep for autochunking cftime intake/intake-esm#737

Open

dcherian reviewed Aug 25, 2025

View reviewed changes


		from xarray.namedarray.utils import build_chunkspec

		target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))

Uh oh!

Add chunks='auto' support for cftime datasets #10527

Are you sure you want to change the base?

Add chunks='auto' support for cftime datasets #10527

Conversation

charles-turner-1 commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

welcome bot commented Jul 13, 2025

Uh oh!

jemmajeffree commented Jul 14, 2025

Uh oh!

charles-turner-1 commented Jul 14, 2025

Uh oh!

jemmajeffree commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

charles-turner-1 commented Jul 15, 2025

Uh oh!

dcherian commented Jul 16, 2025

Uh oh!

charles-turner-1 commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dcherian commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 commented Jul 28, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcherian commented Aug 24, 2025

Uh oh!

charles-turner-1 commented Aug 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charles-turner-1 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

charles-turner-1 commented Jul 13, 2025 •

edited

Loading

charles-turner-1 commented Jul 23, 2025 •

edited

Loading

dcherian commented Jul 23, 2025 •

edited

Loading

charles-turner-1 Aug 25, 2025 •

edited

Loading

charles-turner-1 Aug 25, 2025 •

edited

Loading