Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicates incorrectly keep missings for Float64 and Int64 dtypes for pyarrow=4 #484

Open
mlondschien opened this issue Jun 25, 2021 · 1 comment

Comments

@mlondschien
Copy link
Contributor

In [1]: from functools import partial
   ...: 
   ...: import minimalkv
   ...: import numpy as np
   ...: import pandas as pd
   ...: import pyarrow as pa
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "I": pd.array([0, 1, pd.NA], dtype="Int64"),
   ...:         "f": pd.array([0.0, 1.1, np.nan], dtype="float64"),
   ...:         "F": pd.array([0.0, 1.1, pd.NA], dtype="Float64"),
   ...:         "o_1": pd.array([0, 1, None], dtype="object"),
   ...:         "o_2": pd.array(["0", "1", None], dtype="object"),
   ...:         "s": pd.array(["0", "b", None], dtype="string"),
   ...:     }
   ...: )
   ...: df.dtypes
Out[1]: 
I        Int64
f      float64
F      Float64
o_1     object
o_2     object
s       string
dtype: object

In [2]: df.to_parquet("/tmp/file.parquet")
   ...: pa.parquet.read_table("/tmp/file.parquet").to_pandas()
Out[2]: 
      I    f     F  o_1   o_2     s
0     0  0.0   0.0  0.0     0     0
1     1  1.1   1.1  1.0     1     b
2  <NA>  NaN  <NA>  NaN  None  <NA>

In [3]: store = partial(minimalkv.get_store_from_url, f"hfs:///tmp?create_if_missing=False")
   ...: store_dataframes_as_dataset(dfs=[df], dataset_uuid="test", store=store, overwrite=True)
   ...: read_table(dataset_uuid="test", store=store)
Out[3]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1   1.1     1  1.1  1.0     1     b
2  <NA>  <NA>  NaN  NaN  None  <NA>

In [4]: # Int64
   ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "!=", 0)]]).to_pandas()
Out[4]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [5]: read_table(dataset_uuid="test", store=store, predicates=[[("I", "!=", 0)]])
Out[5]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [6]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "==", 0)]]).to_pandas()
Out[6]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [7]: read_table(dataset_uuid="test", store=store, predicates=[[("I", "==", 0)]])
Out[7]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1  <NA>  <NA>  NaN  NaN  None  <NA>

In [8]: # Float64                                             
   ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("F", "!=", 0.0)]]).to_pandas()
Out[8]: 
   I    f    F  o_1 o_2  s                                                                                                                                                                    
0  1  1.1  1.1    1   1  b     

In [10]: 
    ...: read_table(dataset_uuid="test", store=store, predicates=[[("F", "!=", 0.0)]])
Out[10]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [11]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("F", "==", 0.0)]]).to_pandas()
Out[11]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [12]: read_table(dataset_uuid="test", store=store, predicates=[[("F", "==", 0.0)]])
Out[12]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1  <NA>  <NA>  NaN  NaN  None  <NA>

In [15]: # float64
    ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("f", "!=", 0.0)]]).to_pandas()
Out[15]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [16]: read_table(dataset_uuid="test", store=store, predicates=[[("f", "!=", 0.0)]])
Out[16]: 
      F     I    f  o_1   o_2     s
0   1.1     1  1.1  1.0     1     b
1  <NA>  <NA>  NaN  NaN  None  <NA>

In [17]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("f", "==", 0.0)]]).to_pandas()
Out[17]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [18]: read_table(dataset_uuid="test", store=store, predicates=[[("f", "==", 0.0)]])
Out[18]: 
     F  I    f  o_1 o_2  s
0  0.0  0  0.0  0.0   0  0

In [19]: # string
    ...: pa.parquet.read_table("/tmp/file.parquet", filters=[[("s", "!=", "0")]]).to_pandas()
Out[19]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [20]: read_table(dataset_uuid="test", store=store, predicates=[[("s", "!=", "0")]])
Out[20]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [21]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("s", "==", "0")]]).to_pandas()
Out[21]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [22]: read_table(dataset_uuid="test", store=store, predicates=[[("s", "==", "0")]])
Out[22]: 
     F  I    f  o_1 o_2  s
0  0.0  0  0.0  0.0   0  0
@mlondschien
Copy link
Contributor Author

With pyarrow=2.0.0:

In [1]: import numpy as np              
   ...: import pandas as pd
   ...: import pyarrow as pa
   ...:                   
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import store_dataframes_as_dataset, read_table
   ...: import pandas as pd
   ...: import minimalkv       
   ...: from functools import partial
   ...:                                 
   ...:  
   ...: df = pd.DataFrame(
   ...:   {                        
   ...:     "I": pd.array([0, 1, pd.NA], dtype="Int64"),
   ...:     "f": pd.array([0.0, 1.1, np.nan], dtype="float64"),
   ...:     "F": pd.array([0.0, 1.1, pd.NA], dtype="Float64"),
   ...:     "o_1": pd.array([0, 1, None], dtype="object"),
   ...:     "o_2": pd.array(["0", "1", None], dtype="object"),
   ...:     "s": pd.array(["0", "b", None], dtype="string"),
   ...:   }
   ...: )
   ...: df.dtypes
Out[1]: 
I        Int64
f      float64
F      Float64
o_1     object
o_2     object
s       string
dtype: object

In [2]: df.to_parquet("/tmp/file.parquet")
   ...: 
   ...: store = partial(minimalkv.get_store_from_url, f"hfs:///tmp?create_if_missing=False")
   ...: store_dataframes_as_dataset(
   ...:    dfs=[df],
   ...:    dataset_uuid="test",
   ...:    store=store,
   ...:    overwrite=True
   ...: )
   ...: 
   ...: pa.parquet.read_table("/tmp/file.parquet").to_pandas()
   ...: 
Out[2]: 
      I    f     F  o_1   o_2     s
0     0  0.0   0.0  0.0     0     0
1     1  1.1   1.1  1.0     1     b
2  <NA>  NaN  <NA>  NaN  None  <NA>

In [3]: read_table(
   ...:    dataset_uuid="test",
   ...:    store=store,
   ...: )
Out[3]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1   1.1     1  1.1  1.0     1     b
2  <NA>  <NA>  NaN  NaN  None  <NA>

In [4]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "!=", 0)]]).to_pandas()
Out[4]: 
   I    f    F  o_1 o_2  s
0  1  1.1  1.1    1   1  b

In [5]: pa.parquet.read_table("/tmp/file.parquet", filters=[[("I", "!=", 1)]]).to_pandas()
Out[5]: 
   I    f    F  o_1 o_2  s
0  0  0.0  0.0    0   0  0

In [6]: read_table(
   ...:    dataset_uuid="test",
   ...:    store=store,
   ...:    predicates=[[("I", "!=", 0)]]
   ...: )
Out[6]: 
     F  I    f  o_1 o_2  s
0  1.1  1  1.1  1.0   1  b

In [7]: read_table(
   ...:    dataset_uuid="test",
   ...:    store=store,
   ...:    predicates=[[("I", "!=", 1)]]
   ...: )
Out[7]: 
      F     I    f  o_1   o_2     s
0   0.0     0  0.0  0.0     0     0
1  <NA>  <NA>  NaN  NaN  None  <NA>

@xhochy says this might be due to missings being stored as zeros in the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant