Skip to content

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Jul 24, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Description

Note

Changed scope a bit from the original PR, see (#2879 (review)) and the original description for details

Show original description

Description

While I was looking into (#2839 (comment)) I noticed that the default (and only) behavior for nw.DataFrame.to_pandas when backed by either pa.Table, pl.DataFrame isn't great for nested datatypes:

import pyarrow as pa

data = {"breakpoint": [2, 3], "count": [3, 2]}

>>> pa.table(data).to_struct_array().to_pandas()
0    {'breakpoint': 2, 'count': 3}
1    {'breakpoint': 3, 'count': 2}
dtype: object

We can do better than this, and currently this PR will instead give us something that works with pd.Series.struct:

import pyarrow as pa
import narwhals as nw

data = {"breakpoint": [2, 3], "count": [3, 2]}

>>> nw.from_native(pa.table(data).to_struct_array(), series_only=True).to_pandas()
0    {'breakpoint': 2, 'count': 3}
1    {'breakpoint': 3, 'count': 2}
Name: , dtype: struct<breakpoint: int64, count: int64>[pyarrow]

Questions

  1. Would we want this for any other DTypes e.g. List?
  2. Would it be worthwhile to expose this as the to_pandas(use_pyarrow_extension_array=...) parameter from polars?
    i. Note that this defaults to False, which is equivalent to our current behavior and pyarrow's

Adds optional keyword-only arguments to both of DataFrame.to_pandas, Series.to_pandas.

This aligns us with the same methods found in polars and pyarrow:

As mentioned in (#2123), this can reduce the memory overhead if configured correctly.

But, the main benefit I'm excited about is that we can preserve pyarrow data types (nulls, nested data) see (#2123 (comment)) - which were previously lost unconditionally

Example

import pyarrow as pa

import narwhals as nw

data = {"breakpoint": [2, 3], "count": [3, 2], "what": [None, 1]}
native = pa.table(data).to_struct_array()
series = nw.from_native(native, series_only=True)

Before

>>> series.to_pandas()
0    {'breakpoint': 2, 'count': 3, 'what': None}
1     {'breakpoint': 3, 'count': 2, 'what': 1.0}
Name: , dtype: object

After

>>> series.to_pandas(use_pyarrow_extension_array=True)
0    {'breakpoint': 2, 'count': 3, 'what': None}
1     {'breakpoint': 3, 'count': 2, 'what': 1.0}
Name: , dtype: struct<breakpoint: int64, count: int64, what: int64>[pyarrow]

Tasks

@dangotbanned dangotbanned added enhancement New feature or request pyarrow Issue is related to pyarrow backend pandas-like Issue is related to pandas-like backends labels Jul 24, 2025
@dangotbanned dangotbanned requested a review from FBruzzesi July 25, 2025 21:51
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this @dangotbanned , it seems reasonable to me!
Yet if we do it, I would rather keep it consistent for all the dtypes that would only be supported with ArrowDtype's. WDYT?

- Lightly adapted fromhttps://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas
- Most likely, will edit it down but there's too many options + ambiguous names to not have something
@dangotbanned dangotbanned changed the title feat: Preserve Struct dtype in pyarrow -> pandas feat: Accept kwargs in (DataFrame|Series).to_pandas Jul 27, 2025
@dangotbanned dangotbanned linked an issue Jul 27, 2025 that may be closed by this pull request
@dangotbanned dangotbanned added eager-only and removed pyarrow Issue is related to pyarrow backend pandas-like Issue is related to pandas-like backends labels Jul 27, 2025
@dangotbanned
Copy link
Member Author

Hey @MarcoGorelli, would you be able to give me some boxes to tick in order to get this one over the line please? πŸ™

It seems like we've been in agreement from the start on use_pyarrow_extension_array πŸ™‚

But it doesn't seem like the changes I've made in response to (#2879 (comment)) and (#2879 (comment)) have gotten us closer

Really appreciate you bearing with me on this!

@dangotbanned dangotbanned removed the request for review from MarcoGorelli August 28, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eager-only enhancement New feature or request nested data `list`, `struct`, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enh: Passing arguments to pa.Table.to_pandas
3 participants