Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF_DATASETS_CACHE ignored? #7480

Open
stephenroller opened this issue Mar 26, 2025 · 3 comments
Open

HF_DATASETS_CACHE ignored? #7480

stephenroller opened this issue Mar 26, 2025 · 3 comments

Comments

@stephenroller
Copy link

Describe the bug

I'm struggling to get things to respect HF_DATASETS_CACHE.

Rationale: I'm on a system that uses NFS for homedir, so downloading to NFS is expensive, slow, and wastes valuable quota compared to local disk. Instead, it seems to rely mostly on HF_HUB_CACHE.

Current version: 3.2.1dev. In the process of testing 3.4.0

Steps to reproduce the bug

[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]

dump.py:

from datasets import load_dataset
dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-100BT", split="train")

Repro steps

# ensure no cache
$ mv ~/.cache/huggingface ~/.cache/huggingface.bak

$ export HF_DATASETS_CACHE=/tmp/roller/datasets
$ rm -rf ${HF_DATASETS_CACHE}
$ env | grep HF | grep -v TOKEN
HF_DATASETS_CACHE=/tmp/roller/datasets

$ python dump.py
# (omitted for brevity)

# (while downloading) 
$ du -hcs ~/.cache/huggingface/hub
18G     hub
18G     total

# (after downloading)
$ du -hcs ~/.cache/huggingface/hub

It's a shame because datasets supports s3 (which I could really use right now) but hub does not.

Expected behavior

  • ~/.cache/huggingface/hub stays empty
  • /tmp/roller/datasets becomes full of stuff

Environment info

[Currently writing using datasets 3.2.1dev. Will follow up with 3.4.0 results]

@stephenroller
Copy link
Author

FWIW, it does eventually write to /tmp/roller/datasets when generating the final version.

@Harry-Yang0518
Copy link

Hey, I’d love to work on this issue but I am a beginner, can I work it with you?

@Harry-Yang0518
Copy link

Hi @lhoestq,
I'd like to look into this issue but I'm still learning. Could you share any quick pointers on the HF_DATASETS_CACHE behavior here? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants