Skip to content

Conversation

charles-turner-1
Copy link

@charles-turner-1 charles-turner-1 commented Jul 13, 2025

  • Tests added

Copy link

welcome bot commented Jul 13, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@github-actions github-actions bot added topic-documentation topic-NamedArray Lightweight version of Variable labels Jul 13, 2025
@charles-turner-1 charles-turner-1 changed the title All works, just need to satisfy mypy and whatnot now Add chunks='auto' support for cftime datasets Jul 13, 2025
@jemmajeffree
Copy link
Contributor

Would these changes also work for cf timedeltas or are they going to still cause problems?
I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

@charles-turner-1
Copy link
Author

Would these changes also work for cf timedeltas or are they going to still cause problems? I'm tempted to write a script to bash through all the ACCESS-NRI intake datastores and see if there's anything else in there that's dtype object — let me know if this would be useful, or if we should just wait for it to break later

If you can find something thats specifically a cftimedelta and run the _contains_cftime_datetimes function on it that'd be super helpful to know whether it returns True or False.

@charles-turner-1 charles-turner-1 marked this pull request as draft July 14, 2025 05:02
@jemmajeffree
Copy link
Contributor

TLDR: don't mind me, it's not going to cause any issues

Firstly, what I thought was a cftimedelta turned out to be a numpy timedelta hanging out with a cftime
Screenshot 2025-07-14 at 5 23 31 pm
When I did manage to coerce this timedelta into cftime conventions, it just contained a floating point number of days, so I can't see anything having issues with its size

coder = xr.coding.times.CFTimedeltaCoder()
result = coder.encode(oops.average_DT).load()
print(result.dtype)
result
Screenshot 2025-07-14 at 5 38 33 pm

@charles-turner-1
Copy link
Author

I did some prodding around yesterday and I realised this won't let us do something like

import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto')

yet, only stuff along the lines of

import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto')

I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack.

@dcherian
Copy link
Contributor

Why so? Are we sending "auto" in to normalize_chunks first?

@charles-turner-1
Copy link
Author

charles-turner-1 commented Jul 23, 2025

Yup, this is the call stack:

----> 3 xr.open_dataset(
      4     "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
  /Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
    720     )
--> 721     ds = _dataset_from_backend_dataset(
    722         backend_ds,
  /Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
    417     if chunks is not None:
--> 418         ds = _chunk_ds(
    419             ds,
  /Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
    367     for name, var in backend_ds.variables.items():
--> 368         var_chunks = _get_chunk(var, chunks, chunkmanager)
    369         variables[name] = _maybe_chunk(
  /Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
    101 
--> 102     chunk_shape = chunkmanager.normalize_chunks(
    103         chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks()

I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too.

Do I want to refactor to move the changes in xarray/structure/chunks.py into the daskmanager module if possible?

Once I've got the structure there cleaned up, I'll work on replacing the build_chunkspec function with something more sensible - I just need to work out how to extract the implementation in dask cleanly now I think - normalize_chunks also seems to calculate sensible chunk sizes.


from xarray.namedarray.utils import build_chunkspec

target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding get_auto_chunk_size to the ChunkManager class; and put the dask-specific stuff in the DaskManager.

cc @TomNicholas

@dcherian
Copy link
Contributor

dcherian commented Jul 23, 2025

I guess one bit that's confusing here is that the code-path for backends and normal variables is different?

So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory.

cubed.Array.rechunk
"""

if _contains_cftime_datetimes(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this can be deleted

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a play and I don't think I can fully get rid of it, I've reused as much of the abstracted logic as possible though.

Comment on lines 323 to 330
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:
from dask import config as dask_config
from dask.utils import parse_bytes

from xarray.namedarray.utils import fake_target_chunksize

target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
return fake_target_chunksize(var, target_chunksize=target_chunksize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]:
from dask import config as dask_config
from dask.utils import parse_bytes
from xarray.namedarray.utils import fake_target_chunksize
target_chunksize = parse_bytes(dask_config.get("array.chunk-size"))
return fake_target_chunksize(var, target_chunksize=target_chunksize)
def get_auto_chunk_size(self) -> int:
from dask import config as dask_config
from dask.utils import parse_bytes
return parse_bytes(dask_config.get("array.chunk-size"))

Only this much is dask-specific, so that's what the DaskManager should be responsible for.

Comment on lines 93 to 96
if _contains_cftime_datetimes(var) and auto_chunks:
limit, var_dtype = chunkmanager.get_auto_chunk_size(var)
else:
limit, var_dtype = None, var.dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic would change to use fake_target_chunksize

@charles-turner-1
Copy link
Author

I think most of the work left to do on this is just fixing the typing now...

@charles-turner-1 charles-turner-1 marked this pull request as ready for review August 8, 2025 06:36
if no_op:
return target_chunksize, data.dtype

import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move imports to the top if we can; and remove the no_op bit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can only move numpy to the top - moving from xarray.core.formatting import first_n_items creates a ciruclar import

Copy link
Author

@charles-turner-1 charles-turner-1 Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the no_op stuff - this has the effect of assuming uniform dtypes across all arrays going into dask now. All tests pass (locally) so it's probably not a big deal - I'm not even sure that numpy would allow mixed dtypes, it doesn't feel like it should, but it might be worth noting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked this up subsequently and I think I'm talking waffle - the if no_op was just in there so that the logic for getting the array item size (in bytes) from the first item was skipped if we didn't find a cftime dtype in the array and a request for auto chunking.

Since arrays can only contain a single dtype, this shouldn't make any difference.

TLDR; ignore my previous comment, it was nonsense

@dcherian
Copy link
Contributor

Sorry for the late review here. I left a few minor comments. Happy to merge after those are addressed

@charles-turner-1
Copy link
Author

No worries, figured you must have been busy/on holiday.

I've addressed all those comments - thanks for all the help getting off the ground with this!

@@ -83,8 +85,15 @@ def _get_chunk(var: Variable, chunks, chunkmanager: ChunkManagerEntrypoint):
for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True)
)

limit = chunkmanager.get_auto_chunk_size()
limit, var_dtype = fake_target_chunksize(var, limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to check if var contains_cftime_objects?

Copy link
Author

@charles-turner-1 charles-turner-1 Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to what I was getting at yesterday with the no-op bit - reverting b5933ed would put that back in.

With that said, the logic doesn't change meaningfully depending on it. Currently, if we put an eg. 300MiB limit in to a var which is an f64, we tell dask to compute the chunks based on those numbers. If we put in an f32 with the same limit, it'll currently tell the dask chunking mechanism to compute chunks for a f64 with a 150MiB limit - which gets us the exact same chunk sizes back (based on my tests).

Actually, one of the side effects of the current implementation (no _contains_cftime_datetimes(var)) is that this would actually let you chunk arbitrary object dtypes, not just cftime. Whether this is desirable or not I guess would depend on whether you'd expect people to put arbitrarily/variable sized objects in - if there is the possibility for the size of objects in an array to vary, then the chunk calculation calculate inappropriate chunks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess with the current implementation maybe_fake_target_chunksize would be a better name for the function, if we revert b5933ed then fake_target_chunksize makes more sense again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-documentation topic-NamedArray Lightweight version of Variable
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants