Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading int96 timestamp in Parquet #427

Open
linhr opened this issue Mar 27, 2025 · 0 comments
Open

Reading int96 timestamp in Parquet #427

linhr opened this issue Mar 27, 2025 · 0 comments
Labels
good first issue Good for newcomers

Comments

@linhr
Copy link
Contributor

linhr commented Mar 27, 2025

Spark may write timestamp as the deprecated int96 physical type in Parquet files. Currently, such data cannot be read correctly in Sail.

  1. Arrow reads int96 as timestamp with nanosecond unit, while Spark expects microsecond unit. So the valid value range is different.
  2. Schema analysis request (printSchema()) fails since we cannot convert the Arrow data type (nanosecond unit) back to Spark data type.

We should respect the Spark schema (stored as a metadata key) when reading the Parquet file. Type casting of timestamp seems possible after the recent upstream fix (apache/arrow-rs#7285). So we should be able to handle this after the next Arrow release.

@linhr linhr added the good first issue Good for newcomers label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant