You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across this, perhaps niche, bug where Features does not/cannot account for pyarrow's nullable=False option in Fields. Interestingly, I found that in regular "flat" fields this does not necessarily lead to conflicts, but when a non-nullable field is in a struct, an incompatibility arises.
It's not easy to explain in words, so the minimal example below should help I hope.
Note that I suggest a solution in the comments in the code, simply allowing Dataset.to_parquet to allow for a schema argument which, when provided, will override the default ds.features.arrow_schema.
Steps to reproduce the bug
importosfromdatasetsimportDataset, Featuresimportpyarrowaspaimportpyarrow.parquetaspq# HF datasets is destructive when you call Features.from_arrow_schema(schema) on a schema # because it will not account for nullable and non-nullable fields in structs (it will always allow nullable)# Reloading the same dataset with the original schema will raise an error because the schema is not the same anymorenon_nullable_schema=pa.schema(
[
pa.field("text", pa.string(), nullable=False),
pa.field("meta",
pa.struct(
[
pa.field("date", pa.list_(pa.string()), nullable=False),
],
),
),
]
)
print("ORIGINAL SCHEMA")
print(non_nullable_schema)
print()
feats=Features.from_arrow_schema(non_nullable_schema)
print("FEATUR-IZED SCHEMA (nullable-restrictions are gone)")
print(feats.arrow_schema)
print()
ds=Dataset.from_dict(
{
"text": ["a", "b", "c"],
"meta": [{"date": ["2021-01-01"]}, {"date": ["2021-01-02"]}, {"date": ["2021-01-03"]}],
},
features=feats,
)
fname="tmp.parquet"# This is not possible: TypeError: pyarrow.parquet.core.ParquetWriter() got multiple values for keyword argument 'schema'# Though I believe this would be the easiest fix: allow schema to be passed to to_parquet and overwrite the schema in the dataset# ds.to_parquet(fname, schema=non_nullable_schema)ds.to_parquet(fname)
try:
_=pq.read_table(fname, schema=non_nullable_schema)
finally:
os.unlink(fname)
Expected behavior
Non-destructive behavior when converting an arrow schema to Features; or
the ability to override the default arrow schema with a custom one
Describe the bug
I came across this, perhaps niche, bug where
Features
does not/cannot account for pyarrow'snullable=False
option in Fields. Interestingly, I found that in regular "flat" fields this does not necessarily lead to conflicts, but when a non-nullable field is in a struct, an incompatibility arises.It's not easy to explain in words, so the minimal example below should help I hope.
Note that I suggest a solution in the comments in the code, simply allowing
Dataset.to_parquet
to allow for aschema
argument which, when provided, will override the default ds.features.arrow_schema.Steps to reproduce the bug
Expected behavior
Environment info
datasets
version: 3.2.0huggingface_hub
version: 0.27.1fsspec
version: 2024.9.0The text was updated successfully, but these errors were encountered: