Difficulties in handling NaN and NULL #11029

augcollet · 2025-03-21T16:09:08Z

Hello,

I'm having some trouble handling NaN and NULL values, especially for float columns. I don't know if the documentation could be improved on this point, if the API could be improved, or if I'm going about it the wrong way.

Here's a simple example to illustrate my problem. I'm creating a simple table with a string column, an int column, and a float column, with some missing values.

import ibis
from ibis import _
import os
import numpy as np

con=ibis.postgres.connect(
    user=os.getenv('POSTGRES_USER'),
    password=os.getenv('POSTGRES_PASSWORD'),
    host="postgres",
    port=os.getenv('POSTGRES_PORT'),
    database=os.getenv('POSTGRES_DB') 
)
ibis.set_backend(con)

con.raw_sql("""
DROP TABLE IF EXISTS test.test_nan_null;
CREATE TABLE test.test_nan_null (
    string_col TEXT,
    int_col INT,
    float_col DOUBLE PRECISION
);
INSERT INTO test.test_nan_null (string_col, int_col, float_col) VALUES
    ('a', 1, 1.5),
    ('b', 2, NULL),
    (NULL, NULL, 'NaN'::DOUBLE PRECISION);
""")

t=con.table('test_nan_null', database='test')
print(t.schema())
display(t.execute())

The Pandas display doesn't show this, but note that in the case of a float column, I can have a NULL value and a NaN value, which are technically not the same thing. This is visible in my database.

When I try to use fill_null on the float column, it only applies to the NULL value and not the NaN value.

t.mutate(
    string_col=_['string_col'].fill_null('EMPTY'),
    int_col=_['int_col'].fill_null(-1),
    float_col=_['float_col'].fill_null(-1),
).execute()

So I try using fillna instead, but it doesn't change anything because it's just an alias for fill_null (and, moreover, fillna will soon be deprecated).

t.mutate(
    string_col=_['string_col'].fill_null('EMPTY'),
    int_col=_['int_col'].fill_null(-1),
    float_col=_['float_col'].fillna(-1),
).execute()

Ultimately, the only way I have to solve the problem is to use an ifelse, which considerably complicates the syntax.

t.mutate(
    string_col=_['string_col'].fill_null('EMPTY'),
    int_col=_['int_col'].fill_null(-1),
    float_col=ibis.ifelse(_['float_col'].isnull()|_['float_col'].isnan(), -1, _['float_col']),
).execute()

Wouldn't it be interesting to keep fillna while adapting the implementation to this type of case? Or, better yet, to handle NaNs and NULLs interchangeably when applying fill_null to a float column?

Thank you in advance for your help!

Regards,

What version of ibis are you using?

Name: ibis-framework
Version: 10.3.1
Summary: The portable Python dataframe library
Home-page: https://ibis-project.org
Author: 
Author-email: Ibis Maintainers <maintainers@ibis-project.org>
License: Apache-2.0

What backend(s) are you using, if any?

PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

The text was updated successfully, but these errors were encountered:

NickCrews · 2025-04-02T23:48:36Z

Hi! Thanks for the issue. I DON'T want to make the same mistake that pandas did, and treat NaN and +/-infinity the same as NULL. I think the current default behavior is good. I also do NOT want fill_na and fill_null to act differently. They are too similarly named, people would footgun themselves.

I think we should definitely improve the docs to explicitly call out that by default, we only target real NULLs, not NaN or inf.
I'm not as sure if this is a good idea or not: make it easier for a user to opt-in to treating NaN and inf the same as NULL. Possibly could adjust the API of .isnull() to be def isnull(*, nan_as_null: bool = False, inf_as_null: bool = False)

The alternative to 2, and what I recommend in the meantime, is making a custom function for yourself:

def is_nullish(n: ir.Value) -> ir.BooleanValue:
    if n.type().is_floating():
        return n.isnull() | n.isnan()
    else:
        return n.isnull()

# and then use as
float_col=is_nullish(_.float_col).ifelse(-1, _float_col),

augcollet · 2025-04-03T15:23:00Z

Hello, and thank you very much for your feedback.

I understand and agree that Pandas is making a mistake by not differentiating between NaN and NULL, and that it's a good idea for Ibis to make this distinction.

If Ibis makes this technical choice, users should be aware of this distinction (by improving the documentation), in which case keeping the two functions fill_na and fill_null shouldn't be a problem. In other words, if the coexistence of these two functions is considered ambiguous, it's necessarily because the technical choice to differentiate between NaN and NULL was not sufficiently explained.

In my opinion, it's more the current behavior that can be a source of error, mistakenly thinking that all missing values have been filled after using fill_null, when NaN values remain. This is precisely what happened to me, and debugging it wasn't easy. It's possible that other users have made the same mistake in scripts running in production, without necessarily noticing it yet.

Otherwise, I think adding a nan_as_null (bool) argument to the fill_null function (not just isnull) could be a good compromise! This would be a simple and fairly natural way to make users aware that Ibis distinguishes between null and NaN, without having to reintroduce the deprecated fill_na function or change the default behavior.

NickCrews · 2025-04-03T18:07:09Z

I made #11077 to improve the docs. I even included in the first line of the docstring of fill_null that "This does NOT affect NaN and inf". I am tempted to just call that sufficient. Really, @cpcloud is the maintainer of ibis and will have the final say on the nan_as_null decision.

github-project-automation bot added this to Ibis planning and roadmap Mar 21, 2025

github-project-automation bot moved this to backlog in Ibis planning and roadmap Mar 21, 2025

This was referenced Apr 3, 2025

feat: avoid pandas converting float("nan") to NULL in memtable #11075

Open

docs: clarify NULL/NaN/inf distinction in docstring and examples #11077

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulties in handling NaN and NULL #11029

Difficulties in handling NaN and NULL #11029

augcollet commented Mar 21, 2025

NickCrews commented Apr 2, 2025

augcollet commented Apr 3, 2025

NickCrews commented Apr 3, 2025

Difficulties in handling NaN and NULL #11029

Difficulties in handling NaN and NULL #11029

Comments

augcollet commented Mar 21, 2025

What version of ibis are you using?

What backend(s) are you using, if any?

NickCrews commented Apr 2, 2025

augcollet commented Apr 3, 2025

NickCrews commented Apr 3, 2025