`load_dataset` of size 40GB creates a cache of >720GB #7502

pietrolesci · 2025-04-07T16:52:34Z

Hi there,

I am trying to load a dataset from the Hugging Face Hub and split it into train and validation splits. Somehow, when I try to do it with load_dataset, it exhausts my disk quota. So, I tried manually downloading the parquet files from the hub and loading them as follows:

 ds = DatasetDict(
        {
            "train": load_dataset(
                "parquet", 
                data_dir=f"{local_dir}/{tok}", 
                cache_dir=cache_dir, 
                num_proc=min(12, os.cpu_count()),   # type: ignore
                split=ReadInstruction("train", from_=0, to=NUM_TRAIN, unit="abs"),  # type: ignore
            ),
            "validation": load_dataset(
                "parquet", 
                data_dir=f"{local_dir}/{tok}", 
                cache_dir=cache_dir, 
                num_proc=min(12, os.cpu_count()),   # type: ignore
                split=ReadInstruction("train", from_=NUM_TRAIN, unit="abs"),  # type: ignore
            )
        }
    )

which still strangely creates 720GB of cache. In addition, if I remove the raw parquet file folder (f"{local_dir}/{tok}" in this example), I am not able to load anything. So, I am left wondering what this cache is doing. Am I missing something? Is there a solution to this problem?

Thanks a lot in advance for your help!

A related issue: huggingface/transformers#10204 (comment).

Python: 3.11.11
datasets: 3.5.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`load_dataset` of size 40GB creates a cache of >720GB #7502

`load_dataset` of size 40GB creates a cache of >720GB #7502

pietrolesci commented Apr 7, 2025 •

edited

Loading

load_dataset of size 40GB creates a cache of >720GB #7502

load_dataset of size 40GB creates a cache of >720GB #7502

Comments

pietrolesci commented Apr 7, 2025 • edited Loading

`load_dataset` of size 40GB creates a cache of >720GB #7502

`load_dataset` of size 40GB creates a cache of >720GB #7502

pietrolesci commented Apr 7, 2025 •

edited

Loading