index error when num_shards > len(dataset) #7443

eminorhan · 2025-03-10T22:40:59Z

In ds.push_to_hub() and ds.save_to_disk(), num_shards must be smaller than or equal to the number of rows in the dataset, but currently this is not checked anywhere inside these functions. Attempting to invoke these functions with num_shards > len(dataset) should raise an informative ValueError.

I frequently work with datasets with a small number of rows where each row is pretty large, so I often encounter this issue, where the function runs until the shard index in ds.shard(num_shards, indx) goes out of bounds. Ideally, a ValueError should be raised before reaching this point (i.e. as soon as ds.push_to_hub() or ds.save_to_disk() is invoked with num_shards > len(dataset)).

It seems that adding something like:

if len(self) < num_shards:
   raise ValueError(f"num_shards ({num_shards}) must be smaller than or equal to the number of rows in the dataset ({len(self)}). Please either reduce num_shards or increase max_shard_size to make sure num_shards <= len(dataset).")

to the beginning of the definition of the ds.shard() function here would deal with this issue for both ds.push_to_hub() and ds.save_to_disk(), but I'm not exactly sure if this is the best place to raise the ValueError (it seems that a more correct way to do it would be to write separate checks for ds.push_to_hub() and ds.save_to_disk()). I'd be happy to submit a PR if you think something along these lines would be acceptable.

The text was updated successfully, but these errors were encountered:

eminorhan · 2025-03-10T23:43:07Z

Actually, looking at the code a bit more carefully, maybe an even better solution is to explicitly set num_shards=len(self) somewhere inside both push_to_hub() and save_to_disk() when these functions are invoked with num_shards > len(dataset).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index error when num_shards > len(dataset) #7443

index error when num_shards > len(dataset) #7443

eminorhan commented Mar 10, 2025 •

edited

Loading

eminorhan commented Mar 10, 2025

index error when num_shards > len(dataset) #7443

index error when num_shards > len(dataset) #7443

Comments

eminorhan commented Mar 10, 2025 • edited Loading

eminorhan commented Mar 10, 2025

eminorhan commented Mar 10, 2025 •

edited

Loading