Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties in handling NaN and NULL #11029

Open
augcollet opened this issue Mar 21, 2025 · 3 comments
Open

Difficulties in handling NaN and NULL #11029

augcollet opened this issue Mar 21, 2025 · 3 comments

Comments

@augcollet
Copy link

Hello,

I'm having some trouble handling NaN and NULL values, especially for float columns. I don't know if the documentation could be improved on this point, if the API could be improved, or if I'm going about it the wrong way.

Here's a simple example to illustrate my problem. I'm creating a simple table with a string column, an int column, and a float column, with some missing values.

import ibis
from ibis import _
import os
import numpy as np

con=ibis.postgres.connect(
    user=os.getenv('POSTGRES_USER'),
    password=os.getenv('POSTGRES_PASSWORD'),
    host="postgres",
    port=os.getenv('POSTGRES_PORT'),
    database=os.getenv('POSTGRES_DB') 
)
ibis.set_backend(con)

con.raw_sql("""
DROP TABLE IF EXISTS test.test_nan_null;
CREATE TABLE test.test_nan_null (
    string_col TEXT,
    int_col INT,
    float_col DOUBLE PRECISION
);
INSERT INTO test.test_nan_null (string_col, int_col, float_col) VALUES
    ('a', 1, 1.5),
    ('b', 2, NULL),
    (NULL, NULL, 'NaN'::DOUBLE PRECISION);
""")

t=con.table('test_nan_null', database='test')
print(t.schema())
display(t.execute())

Image

The Pandas display doesn't show this, but note that in the case of a float column, I can have a NULL value and a NaN value, which are technically not the same thing. This is visible in my database.
Image
Image

When I try to use fill_null on the float column, it only applies to the NULL value and not the NaN value.

t.mutate(
    string_col=_['string_col'].fill_null('EMPTY'),
    int_col=_['int_col'].fill_null(-1),
    float_col=_['float_col'].fill_null(-1),
).execute()

Image

So I try using fillna instead, but it doesn't change anything because it's just an alias for fill_null (and, moreover, fillna will soon be deprecated).

t.mutate(
    string_col=_['string_col'].fill_null('EMPTY'),
    int_col=_['int_col'].fill_null(-1),
    float_col=_['float_col'].fillna(-1),
).execute()

Image

Ultimately, the only way I have to solve the problem is to use an ifelse, which considerably complicates the syntax.

t.mutate(
    string_col=_['string_col'].fill_null('EMPTY'),
    int_col=_['int_col'].fill_null(-1),
    float_col=ibis.ifelse(_['float_col'].isnull()|_['float_col'].isnan(), -1, _['float_col']),
).execute()

Image

Wouldn't it be interesting to keep fillna while adapting the implementation to this type of case? Or, better yet, to handle NaNs and NULLs interchangeably when applying fill_null to a float column?

Thank you in advance for your help!

Regards,

What version of ibis are you using?

Name: ibis-framework
Version: 10.3.1
Summary: The portable Python dataframe library
Home-page: https://ibis-project.org
Author: 
Author-email: Ibis Maintainers <maintainers@ibis-project.org>
License: Apache-2.0

What backend(s) are you using, if any?

PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
@NickCrews
Copy link
Contributor

Hi! Thanks for the issue. I DON'T want to make the same mistake that pandas did, and treat NaN and +/-infinity the same as NULL. I think the current default behavior is good. I also do NOT want fill_na and fill_null to act differently. They are too similarly named, people would footgun themselves.

  1. I think we should definitely improve the docs to explicitly call out that by default, we only target real NULLs, not NaN or inf.

  2. I'm not as sure if this is a good idea or not: make it easier for a user to opt-in to treating NaN and inf the same as NULL. Possibly could adjust the API of .isnull() to be def isnull(*, nan_as_null: bool = False, inf_as_null: bool = False)

The alternative to 2, and what I recommend in the meantime, is making a custom function for yourself:

def is_nullish(n: ir.Value) -> ir.BooleanValue:
    if n.type().is_floating():
        return n.isnull() | n.isnan()
    else:
        return n.isnull()

# and then use as
float_col=is_nullish(_.float_col).ifelse(-1, _float_col),

@augcollet
Copy link
Author

Hello, and thank you very much for your feedback.

I understand and agree that Pandas is making a mistake by not differentiating between NaN and NULL, and that it's a good idea for Ibis to make this distinction.

If Ibis makes this technical choice, users should be aware of this distinction (by improving the documentation), in which case keeping the two functions fill_na and fill_null shouldn't be a problem. In other words, if the coexistence of these two functions is considered ambiguous, it's necessarily because the technical choice to differentiate between NaN and NULL was not sufficiently explained.

In my opinion, it's more the current behavior that can be a source of error, mistakenly thinking that all missing values ​​have been filled after using fill_null, when NaN values ​​remain. This is precisely what happened to me, and debugging it wasn't easy. It's possible that other users have made the same mistake in scripts running in production, without necessarily noticing it yet.

Otherwise, I think adding a nan_as_null (bool) argument to the fill_null function (not just isnull) could be a good compromise! This would be a simple and fairly natural way to make users aware that Ibis distinguishes between null and NaN, without having to reintroduce the deprecated fill_na function or change the default behavior.

@NickCrews
Copy link
Contributor

I made #11077 to improve the docs. I even included in the first line of the docstring of fill_null that "This does NOT affect NaN and inf". I am tempted to just call that sufficient. Really, @cpcloud is the maintainer of ibis and will have the final say on the nan_as_null decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: backlog
Development

No branches or pull requests

2 participants