Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index error when num_shards > len(dataset) #7443

Open
eminorhan opened this issue Mar 10, 2025 · 1 comment
Open

index error when num_shards > len(dataset) #7443

eminorhan opened this issue Mar 10, 2025 · 1 comment

Comments

@eminorhan
Copy link

eminorhan commented Mar 10, 2025

In ds.push_to_hub() and ds.save_to_disk(), num_shards must be smaller than or equal to the number of rows in the dataset, but currently this is not checked anywhere inside these functions. Attempting to invoke these functions with num_shards > len(dataset) should raise an informative ValueError.

I frequently work with datasets with a small number of rows where each row is pretty large, so I often encounter this issue, where the function runs until the shard index in ds.shard(num_shards, indx) goes out of bounds. Ideally, a ValueError should be raised before reaching this point (i.e. as soon as ds.push_to_hub() or ds.save_to_disk() is invoked with num_shards > len(dataset)).

It seems that adding something like:

if len(self) < num_shards:
   raise ValueError(f"num_shards ({num_shards}) must be smaller than or equal to the number of rows in the dataset ({len(self)}). Please either reduce num_shards or increase max_shard_size to make sure num_shards <= len(dataset).")

to the beginning of the definition of the ds.shard() function here would deal with this issue for both ds.push_to_hub() and ds.save_to_disk(), but I'm not exactly sure if this is the best place to raise the ValueError (it seems that a more correct way to do it would be to write separate checks for ds.push_to_hub() and ds.save_to_disk()). I'd be happy to submit a PR if you think something along these lines would be acceptable.

@eminorhan
Copy link
Author

Actually, looking at the code a bit more carefully, maybe an even better solution is to explicitly set num_shards=len(self) somewhere inside both push_to_hub() and save_to_disk() when these functions are invoked with num_shards > len(dataset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant