What is the canonical way to compress a Dataset? #7477

eric-czech · 2025-03-25T16:47:51Z

Given that Arrow is the preferred backend for a Dataset, what is a user supposed to do if they want concurrent reads, concurrent writes AND on-disk compression for a larger dataset?

Parquet would be the obvious answer except that there is no native support for writing sharded, parquet datasets concurrently [1].

Am I missing something?

And if so, why is this not the standard/default way that Dataset's work as they do in Xarray, Ray Data, Composer, etc.?

The text was updated successfully, but these errors were encountered:

eric-czech · 2025-04-02T14:41:38Z

I saw this post by @lhoestq: https://discuss.huggingface.co/t/increased-arrow-table-size-by-factor-of-2/26561/4 suggesting that there is at least some internal code for writing sharded parquet datasets non-concurrently. This appears to be that code:

datasets/src/datasets/arrow_dataset.py

Lines 5380 to 5397 in 94ccd1b

    
           for index, shard in hf_tqdm( 
        
               enumerate(shards), 
        
               desc="Uploading the dataset shards", 
        
               total=num_shards, 
        
           ): 
        
               shard_path_in_repo = f"{data_dir}/{split}-{index:05d}-of-{num_shards:05d}.parquet" 
        
               buffer = BytesIO() 
        
               shard.to_parquet(buffer) 
        
               uploaded_size += buffer.tell() 
        
               shard_addition = CommitOperationAdd(path_in_repo=shard_path_in_repo, path_or_fileobj=buffer) 
        
               api.preupload_lfs_files( 
        
                   repo_id=repo_id, 
        
                   additions=[shard_addition], 
        
                   repo_type="dataset", 
        
                   revision=revision, 
        
                   create_pr=create_pr, 
        
               ) 
        
               additions.append(shard_addition)

Is there any fundamental reason (e.g. race conditions) that this kind of operation couldn't exist as a utility or method on a Dataset with a num_proc argument? I am not seeing any other issues explicitly for that ask.

lhoestq · 2025-04-02T14:59:35Z

We simply haven't implemented a method to save as sharded parquet locally yet ^^'

Right now the only sharded parquet export method is push_to_hub() which writes to HF. But we can have a local one as well.

In the meantime the easiest way to export as sharded parquet locally is to .shard() and .to_parquet() (see code from my comment here)

eric-czech · 2025-04-02T15:21:24Z

In the meantime the easiest way to export as sharded parquet locally is to .shard() and .to_parquet()

Makes sense, BUT how can it be done concurrently? I could of course use multiprocessing myself or a dozen other libraries for parallelizing single-node/local operations like that.

What I'm asking though is, what is the way to do this that is most canonical for datasets specifically? I.e. what is least likely to causing pickling or other issues because it is used frequently internally by datasets and already likely tests for a lot of library-native edge-cases?

lhoestq · 2025-04-03T09:13:10Z

Everything in datasets is picklable :) and even better: since the data are memory mapped from disk, pickling in one process and unpickling in another doesn't do any copy - it instantaneously reloads the memory map.

So feel free to use the library you prefer to parallelize your operations.

(it's another story in distributed setups though, because in that case you either need to copy and send the data or setup a distributed filesystem)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the canonical way to compress a Dataset? #7477

What is the canonical way to compress a Dataset? #7477

eric-czech commented Mar 25, 2025 •

edited

Loading

eric-czech commented Apr 2, 2025 •

edited

Loading

lhoestq commented Apr 2, 2025

eric-czech commented Apr 2, 2025

lhoestq commented Apr 3, 2025

What is the canonical way to compress a Dataset? #7477

What is the canonical way to compress a Dataset? #7477

Comments

eric-czech commented Mar 25, 2025 • edited Loading

eric-czech commented Apr 2, 2025 • edited Loading

lhoestq commented Apr 2, 2025

eric-czech commented Apr 2, 2025

lhoestq commented Apr 3, 2025

eric-czech commented Mar 25, 2025 •

edited

Loading

eric-czech commented Apr 2, 2025 •

edited

Loading