Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with local dataset after upgrade from 3.3.2 to 3.4.0 #7455

Open
andjoer opened this issue Mar 15, 2025 · 1 comment
Open

Problems with local dataset after upgrade from 3.3.2 to 3.4.0 #7455

andjoer opened this issue Mar 15, 2025 · 1 comment

Comments

@andjoer
Copy link

andjoer commented Mar 15, 2025

Describe the bug

I was not able to open a local saved dataset anymore that was created using an older datasets version after the upgrade yesterday from datasets 3.3.2 to 3.4.0

The traceback is

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 67, in _generate_tables
    batches = pa.ipc.open_stream(f)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 190, in open_stream
    return RecordBatchStreamReader(source, options=options,
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 52, in __init__
    self._open(source, options=options, memory_pool=memory_pool)
  File "pyarrow/ipc.pxi", line 1006, in pyarrow.lib._RecordBatchStreamReader._open
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Expected to read 538970747 metadata bytes, but only read 2126

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/datasets/builder.py", line 1855, in _prepare_split_single
    for _, table in generator:
  File "/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/arrow/arrow.py", line 69, in _generate_tables
    reader = pa.ipc.open_file(f)
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 234, in open_file
    return RecordBatchFileReader(
  File "/usr/local/lib/python3.10/dist-packages/pyarrow/ipc.py", line 110, in __init__
    self._open(source, footer_offset=footer_offset,
  File "pyarrow/ipc.pxi", line 1090, in pyarrow.lib._RecordBatchFileReader._open
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Not an Arrow file

Steps to reproduce the bug

Load a dataset from a local folder with

dataset = load_dataset(
                args.train_data_dir,
                cache_dir=args.cache_dir,
            )

as it is done for example in the training script for SD3 controlnet.

This is the minimal script to test it:

from datasets import load_dataset

def main():
    dataset = load_dataset(
        "local_dataset",  
    )
    print(dataset)
    print("Sample data:", dataset["train"][0])

if __name__ == "__main__":
    main()

Expected behavior

Work in 3.4.0 like in 3.3.2

Environment info

  • datasets version: 3.4.0
  • Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • huggingface_hub version: 0.29.3
  • PyArrow version: 19.0.1
  • Pandas version: 2.2.3
  • fsspec version: 2024.12.0
@lhoestq
Copy link
Member

lhoestq commented Mar 17, 2025

Hi ! I just released 3.4.1 with a fix, let me know if it's working now !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants