-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add chunks='auto' support for cftime datasets #10527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Would these changes also work for cf timedeltas or are they going to still cause problems? |
If you can find something thats specifically a cftimedelta and run the |
…1/xarray into autochunk-cftime
…pect disk chunks sensibly & this should be ready to go, I think
I did some prodding around yesterday and I realised this won't let us do something like import xarray as xr
cftime_datafile = "/path/to/file.nc"
xr.open_dataset(cftime_datafile, chunks='auto') yet, only stuff along the lines of import xarray as xr
cftime_datafile = "/path/to/file.nc"
ds = xr.open_dataset(cftime_datafile, chunks=-1)
ds = ds.chunk('auto') I think implementing the former is going to be a bit harder, but I'm starting to clock the code structure a bit more now so I'll have a decent crack. |
Why so? Are we sending |
Yup, this is the call stack: ----> 3 xr.open_dataset(
4 "/Users/u1166368/xarray/tos_Omon_CESM2-WACCM_historical_r2i1p1f1_gr_185001-201412.nc", chunks="auto"
/Users/u1166368/xarray/xarray/backends/api.py(721)open_dataset()
720 )
--> 721 ds = _dataset_from_backend_dataset(
722 backend_ds,
/Users/u1166368/xarray/xarray/backends/api.py(418)_dataset_from_backend_dataset()
417 if chunks is not None:
--> 418 ds = _chunk_ds(
419 ds,
/Users/u1166368/xarray/xarray/backends/api.py(368)_chunk_ds()
367 for name, var in backend_ds.variables.items():
--> 368 var_chunks = _get_chunk(var, chunks, chunkmanager)
369 variables[name] = _maybe_chunk(
/Users/u1166368/xarray/xarray/structure/chunks.py(102)_get_chunk()
101
--> 102 chunk_shape = chunkmanager.normalize_chunks(
103 chunk_shape, shape=shape, dtype=var.dtype, previous_chunks=preferred_chunk_shape
> /Users/u1166368/xarray/xarray/namedarray/daskmanager.py(60)normalize_chunks() I've fixed it in the latest commit - but I think the implementation leaves a lot to be desired too. Do I want to refactor to move the changes in Once I've got the structure there cleaned up, I'll work on replacing the |
xarray/structure/chunks.py
Outdated
|
||
from xarray.namedarray.utils import build_chunkspec | ||
|
||
target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding get_auto_chunk_size
to the ChunkManager class; and put the dask-specific stuff in the DaskManager.
cc @TomNicholas
I guess one bit that's confusing here is that the code-path for backends and normal variables is different? So let's add a test that reads form disk; and one that works iwth a DataArray constructed in memory. |
cubed.Array.rechunk | ||
""" | ||
|
||
if _contains_cftime_datetimes(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this can be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a play and I don't think I can fully get rid of it, I've reused as much of the abstracted logic as possible though.
xarray/namedarray/daskmanager.py
Outdated
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]: | ||
from dask import config as dask_config | ||
from dask.utils import parse_bytes | ||
|
||
from xarray.namedarray.utils import fake_target_chunksize | ||
|
||
target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) | ||
return fake_target_chunksize(var, target_chunksize=target_chunksize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def get_auto_chunk_size(self, var: Variable) -> tuple[int, _DType]: | |
from dask import config as dask_config | |
from dask.utils import parse_bytes | |
from xarray.namedarray.utils import fake_target_chunksize | |
target_chunksize = parse_bytes(dask_config.get("array.chunk-size")) | |
return fake_target_chunksize(var, target_chunksize=target_chunksize) | |
def get_auto_chunk_size(self) -> int: | |
from dask import config as dask_config | |
from dask.utils import parse_bytes | |
return parse_bytes(dask_config.get("array.chunk-size")) |
Only this much is dask-specific, so that's what the DaskManager should be responsible for.
xarray/structure/chunks.py
Outdated
if _contains_cftime_datetimes(var) and auto_chunks: | ||
limit, var_dtype = chunkmanager.get_auto_chunk_size(var) | ||
else: | ||
limit, var_dtype = None, var.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic would change to use fake_target_chunksize
for more information, see https://pre-commit.ci
…1/xarray into autochunk-cftime
I think most of the work left to do on this is just fixing the typing now... |
xarray/namedarray/utils.py
Outdated
if no_op: | ||
return target_chunksize, data.dtype | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move imports to the top if we can; and remove the no_op
bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can only move numpy to the top - moving from xarray.core.formatting import first_n_items
creates a ciruclar import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the no_op
stuff - this has the effect of assuming uniform dtypes across all arrays going into dask now. All tests pass (locally) so it's probably not a big deal - I'm not even sure that numpy would allow mixed dtypes, it doesn't feel like it should, but it might be worth noting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked this up subsequently and I think I'm talking waffle - the if no_op
was just in there so that the logic for getting the array item size (in bytes) from the first item was skipped if we didn't find a cftime dtype in the array and a request for auto chunking.
Since arrays can only contain a single dtype, this shouldn't make any difference.
TLDR; ignore my previous comment, it was nonsense
Sorry for the late review here. I left a few minor comments. Happy to merge after those are addressed |
No worries, figured you must have been busy/on holiday. I've addressed all those comments - thanks for all the help getting off the ground with this! |
@@ -83,8 +85,15 @@ def _get_chunk(var: Variable, chunks, chunkmanager: ChunkManagerEntrypoint): | |||
for dim, preferred_chunk_sizes in zip(dims, preferred_chunk_shape, strict=True) | |||
) | |||
|
|||
limit = chunkmanager.get_auto_chunk_size() | |||
limit, var_dtype = fake_target_chunksize(var, limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to check if var
contains_cftime_objects
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to what I was getting at yesterday with the no-op bit - reverting b5933ed would put that back in.
With that said, the logic doesn't change meaningfully depending on it. Currently, if we put an eg. 300MiB limit in to a var which is an f64, we tell dask to compute the chunks based on those numbers. If we put in an f32 with the same limit, it'll currently tell the dask chunking mechanism to compute chunks for a f64 with a 150MiB limit - which gets us the exact same chunk sizes back (based on my tests).
Actually, one of the side effects of the current implementation (no _contains_cftime_datetimes(var)
) is that this would actually let you chunk arbitrary object dtypes, not just cftime. Whether this is desirable or not I guess would depend on whether you'd expect people to put arbitrarily/variable sized objects in - if there is the possibility for the size of objects in an array to vary, then the chunk calculation calculate inappropriate chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess with the current implementation maybe_fake_target_chunksize
would be a better name for the function, if we revert b5933ed then fake_target_chunksize
makes more sense again.
Uh oh!
There was an error while loading. Please reload this page.