-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predicates incorrectly keep missings for Float64
and Int64
dtypes for pyarrow=4
#484
Comments
With In [1]: import numpy as np
...: import pandas as pd
...: import pyarrow as pa
...:
...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
...: from kartothek.io.eager import store_dataframes_as_dataset, read_table
...: import pandas as pd
...: import minimalkv
...: from functools import partial
...:
...:
...: df = pd.DataFrame(
...: {
...: "I": pd.array([0, 1, pd.NA], dtype="Int64"),
...: "f": pd.array([0.0, 1.1, np.nan], dtype="float64"),
...: "F": pd.array([0.0, 1.1, pd.NA], dtype="Float64"),
...: "o_1": pd.array([0, 1, None], dtype="object"),
...: "o_2": pd.array(["0", "1", None], dtype="object"),
...: "s": pd.array(["0", "b", None], dtype="string"),
...: }
...: )
...: df.dtypes
Out[1]:
I Int64
f float64
F Float64
o_1 object
o_2 object
s string
dtype: object
In [2]: df.to_parquet("/tmp/file.parquet")
...:
...: store = partial(minimalkv.get_store_from_url, f"hfs:///tmp?create_if_missing=False")
...: store_dataframes_as_dataset(
...: dfs=[df],
...: dataset_uuid="test",
...: store=store,
...: overwrite=True
...: )
...:
...: pa.parquet.read_table("/tmp/file.parquet").to_pandas()
...:
Out[2]:
I f F o_1 o_2 s
0 0 0.0 0.0 0.0 0 0
1 1 1.1 1.1 1.0 1 b
2 <NA> NaN <NA> NaN None <NA>
In [3]: read_table(
...: dataset_uuid="test",
...: store=store,
...: )
Out[3]:
F I f o_1 o_2 s
0 0.0 0 0.0 0.0 0 0
1 1.1 1 1.1 1.0 1 b
2 <NA> <NA> NaN NaN None <NA>
In [4]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "!=", 0)]]).to_pandas()
Out[4]:
I f F o_1 o_2 s
0 1 1.1 1.1 1 1 b
In [5]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "!=", 1)]]).to_pandas()
Out[5]:
I f F o_1 o_2 s
0 0 0.0 0.0 0 0 0
In [6]: read_table(
...: dataset_uuid="test",
...: store=store,
...: predicates=[[("I", "!=", 0)]]
...: )
Out[6]:
F I f o_1 o_2 s
0 1.1 1 1.1 1.0 1 b
In [7]: read_table(
...: dataset_uuid="test",
...: store=store,
...: predicates=[[("I", "!=", 1)]]
...: )
Out[7]:
F I f o_1 o_2 s
0 0.0 0 0.0 0.0 0 0
1 <NA> <NA> NaN NaN None <NA> @xhochy says this might be due to missings being stored as zeros in the data. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The text was updated successfully, but these errors were encountered: