You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to load a dataset from the Hugging Face Hub and split it into train and validation splits. Somehow, when I try to do it with load_dataset, it exhausts my disk quota. So, I tried manually downloading the parquet files from the hub and loading them as follows:
which still strangely creates 720GB of cache. In addition, if I remove the raw parquet file folder (f"{local_dir}/{tok}" in this example), I am not able to load anything. So, I am left wondering what this cache is doing. Am I missing something? Is there a solution to this problem?
Hi there,
I am trying to load a dataset from the Hugging Face Hub and split it into train and validation splits. Somehow, when I try to do it with
load_dataset
, it exhausts my disk quota. So, I tried manually downloading the parquet files from the hub and loading them as follows:which still strangely creates 720GB of cache. In addition, if I remove the raw parquet file folder (
f"{local_dir}/{tok}"
in this example), I am not able to load anything. So, I am left wondering what this cache is doing. Am I missing something? Is there a solution to this problem?Thanks a lot in advance for your help!
A related issue: huggingface/transformers#10204 (comment).
Python: 3.11.11
datasets: 3.5.0
The text was updated successfully, but these errors were encountered: