Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the canonical way to compress a Dataset? #7477

Open
eric-czech opened this issue Mar 25, 2025 · 4 comments
Open

What is the canonical way to compress a Dataset? #7477

eric-czech opened this issue Mar 25, 2025 · 4 comments

Comments

@eric-czech
Copy link

eric-czech commented Mar 25, 2025

Given that Arrow is the preferred backend for a Dataset, what is a user supposed to do if they want concurrent reads, concurrent writes AND on-disk compression for a larger dataset?

Parquet would be the obvious answer except that there is no native support for writing sharded, parquet datasets concurrently [1].

Am I missing something?

And if so, why is this not the standard/default way that Dataset's work as they do in Xarray, Ray Data, Composer, etc.?

@eric-czech
Copy link
Author

eric-czech commented Apr 2, 2025

I saw this post by @lhoestq: https://discuss.huggingface.co/t/increased-arrow-table-size-by-factor-of-2/26561/4 suggesting that there is at least some internal code for writing sharded parquet datasets non-concurrently. This appears to be that code:

for index, shard in hf_tqdm(
enumerate(shards),
desc="Uploading the dataset shards",
total=num_shards,
):
shard_path_in_repo = f"{data_dir}/{split}-{index:05d}-of-{num_shards:05d}.parquet"
buffer = BytesIO()
shard.to_parquet(buffer)
uploaded_size += buffer.tell()
shard_addition = CommitOperationAdd(path_in_repo=shard_path_in_repo, path_or_fileobj=buffer)
api.preupload_lfs_files(
repo_id=repo_id,
additions=[shard_addition],
repo_type="dataset",
revision=revision,
create_pr=create_pr,
)
additions.append(shard_addition)

Is there any fundamental reason (e.g. race conditions) that this kind of operation couldn't exist as a utility or method on a Dataset with a num_proc argument? I am not seeing any other issues explicitly for that ask.

@lhoestq
Copy link
Member

lhoestq commented Apr 2, 2025

We simply haven't implemented a method to save as sharded parquet locally yet ^^'

Right now the only sharded parquet export method is push_to_hub() which writes to HF. But we can have a local one as well.

In the meantime the easiest way to export as sharded parquet locally is to .shard() and .to_parquet() (see code from my comment here)

@eric-czech
Copy link
Author

In the meantime the easiest way to export as sharded parquet locally is to .shard() and .to_parquet()

Makes sense, BUT how can it be done concurrently? I could of course use multiprocessing myself or a dozen other libraries for parallelizing single-node/local operations like that.

What I'm asking though is, what is the way to do this that is most canonical for datasets specifically? I.e. what is least likely to causing pickling or other issues because it is used frequently internally by datasets and already likely tests for a lot of library-native edge-cases?

@lhoestq
Copy link
Member

lhoestq commented Apr 3, 2025

Everything in datasets is picklable :) and even better: since the data are memory mapped from disk, pickling in one process and unpickling in another doesn't do any copy - it instantaneously reloads the memory map.

So feel free to use the library you prefer to parallelize your operations.

(it's another story in distributed setups though, because in that case you either need to copy and send the data or setup a distributed filesystem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants