Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

MarcoGorelli
Copy link
Contributor

Text styling

  • The blog is written with plain language (where relevant).
  • If there are headers, they use the proper header tags in order to do so (with only one level-one header).
  • All links describe where they link to (for example, check the Quansight labs website).
  • Any kind of styling that the author uses (for example, bold for emphasis) is consistent throughout the blog.

Non-text contents

  • Blog post featured image is in PNG or JPEG format, not SVG.
  • All content is represented as text (for example, images need alt text and videos need captions or descriptive transcripts).
  • If there are emojis, there are not more than three in a row.
  • Don't use flashing gifs or videos.
  • If it were to be read as plain text, the blog still makes sense and no information is missing.

Copy link

vercel bot commented Dec 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
labs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 8, 2025 4:38pm

@MarcoGorelli MarcoGorelli changed the title Arrow PyCapsule Interface and Narwhals: Rust and Tusk for universal dataframe support Arrow PyCapsule Interface and Narwhals for universal dataframe support Jan 6, 2025
@MarcoGorelli MarcoGorelli changed the title Arrow PyCapsule Interface and Narwhals for universal dataframe support [BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support Jan 6, 2025
@MarcoGorelli MarcoGorelli marked this pull request as ready for review January 6, 2025 17:35
@MarcoGorelli
Copy link
Contributor Author

MarcoGorelli commented Jan 6, 2025

cc @kylebarron @WillAyd Here's a blog post I've put together on the PyCapsule Interface and Narwhals - just pinging you in case you were interested in reading about the treatment of the former, and whether I've misrepresented anything (if not, no worries at all! Feel free to unsubscribe)

The post preview is available at: https://labs-lr1c4romv-quansight.vercel.app/blog/narwhals-pycapsule

Copy link

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great


The technical details are beyond the scope of this post, but the summary is:

- We accept any object which implements the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how technical you want to be and I realize this post is aimed at dataframe libraries, but there are other methods that can be implemented that still fall under the PyCapsule Interface

  • arrow_c_array
  • arrow_c_schema
  • arrow_c_stream
  • arrow_c_device_array
  • arrow_device_array_stream

Stream may be the most fitting for exchanging dataframes, but the other ones have use outside of 2-d structures. May want to be careful in saying that the presence of this method exclusively means you are implementing the PyCapsule interface

.vscode/settings.json Outdated Show resolved Hide resolved
Copy link
Member

@pavithraes pavithraes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Everything looks good, and it's an interesting read!

without us having to write any specialised code to handle the subtle differences
between them!

> **_NOTE:_** How many times can `agnostic_sum_i64_column` be called? Strictly
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on your target audience, you might be able to remove this note. I love the thoroughness, but it is maybe too into the weeds

I also am not sure I agree with the content - each invocation to the dataframe libraries is going to yield a new Python capsule. In the non-duckdb cases, those capsules provide a non-owning view of the data; the data has a separate lifetime managed by a different object. I think in the duckdb case, they are tying the lifetime of the array to the invocation of the first capsule, but that's more of a technical detail of that library than anything related to the specification itself

Copy link

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again minor comments and suggestions, but feel free to ignore. Thanks for writing this - glad to see this get more promotion


Let's cover some scenarios:

- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't quite true either - the PyCapsule Interface says nothing about lifetimes. In the dataframe case, it yields a pointer to a stream of arrays. It is up to the consumer when they want to actually start iterating over the stream

Perhaps the main consumers of the PyCapsule interface choose to do so eagerly, but that's an implementation detail of the libraries themselves, not the PyCapsule Interface

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by arro3 are lazy. See the overloads for e.g. cast. If you pass in something with a __arrow_c_stream__ dunder, the stream will not be materialized automatically. Rather, it will return an ArrayReader, so that it's fully lazy. Only when the output ArrayReader itself is iterated over will the cast calls actually happen.

- If you want complete control over your data: use the
PyCapsule Interface. If you have the necessary Rust / C skills, there should be no limit
to how complex and bespoke you make your data processing.
- If you want to keep your library logic to pure-Python and without heavy dependencies so
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion on ordering - I think most users will be totally fine with narwhals, especially if your target audience is Python developers. So might be worth putting narwhals first on the list

- If you want to keep your library logic to pure-Python and without heavy dependencies so
it's easy to maintain and install: use Narwhals. Packaging a pure-Python project is very
easy, especially compared with if you need to get Rust or C in there.
- If you want to do part of your processing in Python, and part of it in Rust - **use both**!
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you cared to cater to non-Rust developers, you might also want to mention nanoarrow. I've also done a youtube video on building Python extensions using C++ - https://www.youtube.com/watch?v=EhUnmXPjTy8&list=PLyPvk-ejh-138onAgM7sVcpzu1HLp-Uiv

Not trying to promote my own content here - just offering in case you wanted to expand the scope of your blog post (which may or may not be worth it)

Copy link

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! I've had a long-stalled pycapsule post on my back burner too


> **_NOTE:_** How many times can `agnostic_sum_i64_column` be called? Strictly
> speaking, the PyCapsule Interface only guarantees that it can be called once for
> any given input. In practice, however, all implementations that we're aware of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW arro3 IO functions like read_parquet I think will provision a single stream instance. So the first time you use the stream pointer from read_parquet it would consume the stream I think and you wouldn't be able to use it again.

msg = f"Column '{column_name}' is of type {dtype}, expected Int64"
raise TypeError(msg)
df = lf.collect()
return df[column_name].sum()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What backend do you use to actually compute the sum?


Let's cover some scenarios:

- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by arro3 are lazy. See the overloads for e.g. cast. If you pass in something with a __arrow_c_stream__ dunder, the stream will not be materialized automatically. Rather, it will return an ArrayReader, so that it's fully lazy. Only when the output ArrayReader itself is iterated over will the cast calls actually happen.

Let's cover some scenarios:

- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.
The PyCapsule Interface requires you to materialise the data into memory immediately,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, I think it's the opposite; the Arrow Stream interface encourages laziness by being defined as a stream. It's only common in-memory implementations that fully materialize data.

to do this in Polars: [Expressions Plugins](https://marcogorelli.github.io/polars-plugins-tutorial/).
How does that differ from writing custom code with the PyCapsule Interface?

- Pros: Polars Plugins slot in seamlessly with Polars' lazy execution. This can enable massive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it would be nice to have a generic way to use UDFs defined in terms of the pycapsule interface with Polars. It wouldn't be that hard to write the glue code

@kylebarron
Copy link

Also in case anyone else is curious, I simplified the rust example: MarcoGorelli/pycapsule-demo#1

@MarcoGorelli MarcoGorelli marked this pull request as draft January 15, 2025 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants