-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the canonical way to compress a Dataset? #7477
Comments
I saw this post by @lhoestq: https://discuss.huggingface.co/t/increased-arrow-table-size-by-factor-of-2/26561/4 suggesting that there is at least some internal code for writing sharded parquet datasets non-concurrently. This appears to be that code: datasets/src/datasets/arrow_dataset.py Lines 5380 to 5397 in 94ccd1b
Is there any fundamental reason (e.g. race conditions) that this kind of operation couldn't exist as a utility or method on a |
We simply haven't implemented a method to save as sharded parquet locally yet ^^' Right now the only sharded parquet export method is In the meantime the easiest way to export as sharded parquet locally is to |
Makes sense, BUT how can it be done concurrently? I could of course use multiprocessing myself or a dozen other libraries for parallelizing single-node/local operations like that. What I'm asking though is, what is the way to do this that is most canonical for |
Everything in So feel free to use the library you prefer to parallelize your operations. (it's another story in distributed setups though, because in that case you either need to copy and send the data or setup a distributed filesystem) |
Given that Arrow is the preferred backend for a Dataset, what is a user supposed to do if they want concurrent reads, concurrent writes AND on-disk compression for a larger dataset?
Parquet would be the obvious answer except that there is no native support for writing sharded, parquet datasets concurrently [1].
Am I missing something?
And if so, why is this not the standard/default way that
Dataset
's work as they do in Xarray, Ray Data, Composer, etc.?The text was updated successfully, but these errors were encountered: