Features.from_arrow_schema is destructive #7479

BramVanroy · 2025-03-26T16:46:43Z

Describe the bug

I came across this, perhaps niche, bug where Features does not/cannot account for pyarrow's nullable=False option in Fields. Interestingly, I found that in regular "flat" fields this does not necessarily lead to conflicts, but when a non-nullable field is in a struct, an incompatibility arises.

It's not easy to explain in words, so the minimal example below should help I hope.

Note that I suggest a solution in the comments in the code, simply allowing Dataset.to_parquet to allow for a schema argument which, when provided, will override the default ds.features.arrow_schema.

Steps to reproduce the bug

import os
from datasets import Dataset, Features

import pyarrow as pa
import pyarrow.parquet as pq

# HF datasets is destructive when you call Features.from_arrow_schema(schema) on a schema 
# because it will not account for nullable and non-nullable fields in structs (it will always allow nullable)
# Reloading the same dataset with the original schema will raise an error because the schema is not the same anymore
non_nullable_schema = pa.schema(
    [
        pa.field("text", pa.string(), nullable=False),
        pa.field("meta",
            pa.struct(
                [
                    pa.field("date", pa.list_(pa.string()), nullable=False),
                ],
            ),
        ),

    ]
)
print("ORIGINAL SCHEMA")
print(non_nullable_schema)
print()

feats = Features.from_arrow_schema(non_nullable_schema)

print("FEATUR-IZED SCHEMA (nullable-restrictions are gone)")
print(feats.arrow_schema)
print()

ds = Dataset.from_dict(
    {
        "text": ["a", "b", "c"],
        "meta": [{"date": ["2021-01-01"]}, {"date": ["2021-01-02"]}, {"date": ["2021-01-03"]}],
    },
    features=feats,
)

fname = "tmp.parquet"

# This is not possible: TypeError: pyarrow.parquet.core.ParquetWriter() got multiple values for keyword argument 'schema'
# Though I believe this would be the easiest fix: allow schema to be passed to to_parquet and overwrite the schema in the dataset
# ds.to_parquet(fname, schema=non_nullable_schema)

ds.to_parquet(fname)

try:
    _ = pq.read_table(fname, schema=non_nullable_schema)
finally:
    os.unlink(fname)

Expected behavior

Non-destructive behavior when converting an arrow schema to Features; or
the ability to override the default arrow schema with a custom one

Environment info

datasets version: 3.2.0
Platform: Linux-5.14.0-427.20.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version: 3.11.10
huggingface_hub version: 0.27.1
PyArrow version: 18.1.0
Pandas version: 2.2.3
fsspec version: 2024.9.0

The text was updated successfully, but these errors were encountered:

BramVanroy linked a pull request Mar 26, 2025 that will close this issue

Implement capability to restore non-nullability in Features #7482

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features.from_arrow_schema is destructive #7479

Features.from_arrow_schema is destructive #7479

BramVanroy commented Mar 26, 2025 •

edited

Loading

Features.from_arrow_schema is destructive #7479

Features.from_arrow_schema is destructive #7479

Comments

BramVanroy commented Mar 26, 2025 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

BramVanroy commented Mar 26, 2025 •

edited

Loading