[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894

MarcoGorelli · 2024-12-20T15:19:01Z

Text styling

The blog is written with plain language (where relevant).
If there are headers, they use the proper header tags in order to do so (with only one level-one header).
All links describe where they link to (for example, check the Quansight labs website).
Any kind of styling that the author uses (for example, bold for emphasis) is consistent throughout the blog.

Non-text contents

Blog post featured image is in PNG or JPEG format, not SVG.
All content is represented as text (for example, images need alt text and videos need captions or descriptive transcripts).
If there are emojis, there are not more than three in a row.
Don't use flashing gifs or videos.
If it were to be read as plain text, the blog still makes sense and no information is missing.

vercel · 2024-12-20T15:19:05Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
labs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 8, 2025 4:38pm

MarcoGorelli · 2025-01-06T17:44:42Z

cc @kylebarron @WillAyd Here's a blog post I've put together on the PyCapsule Interface and Narwhals - just pinging you in case you were interested in reading about the treatment of the former, and whether I've misrepresented anything (if not, no worries at all! Feel free to unsubscribe)

The post preview is available at: https://labs-lr1c4romv-quansight.vercel.app/blog/narwhals-pycapsule

WillAyd

looks great

WillAyd · 2025-01-06T18:32:24Z

apps/labs/posts/narwhals-pycapsule.md

+
+The technical details are beyond the scope of this post, but the summary is:
+
+- We accept any object which implements the


Not sure how technical you want to be and I realize this post is aimed at dataframe libraries, but there are other methods that can be implemented that still fall under the PyCapsule Interface

arrow_c_array

arrow_c_schema

arrow_c_stream

arrow_c_device_array

arrow_device_array_stream

Stream may be the most fitting for exchanging dataframes, but the other ones have use outside of 2-d structures. May want to be careful in saying that the presence of this method exclusively means you are implementing the PyCapsule interface

.vscode/settings.json

pavithraes

Thanks! Everything looks good, and it's an interesting read!

WillAyd · 2025-01-08T19:27:41Z

apps/labs/posts/narwhals-pycapsule.md

+without us having to write any specialised code to handle the subtle differences
+between them!
+
+> **_NOTE:_** How many times can `agnostic_sum_i64_column` be called? Strictly


Depending on your target audience, you might be able to remove this note. I love the thoroughness, but it is maybe too into the weeds

I also am not sure I agree with the content - each invocation to the dataframe libraries is going to yield a new Python capsule. In the non-duckdb cases, those capsules provide a non-owning view of the data; the data has a separate lifetime managed by a different object. I think in the duckdb case, they are tying the lifetime of the array to the invocation of the first capsule, but that's more of a technical detail of that library than anything related to the specification itself

WillAyd

Again minor comments and suggestions, but feel free to ignore. Thanks for writing this - glad to see this get more promotion

WillAyd · 2025-01-08T19:29:46Z

apps/labs/posts/narwhals-pycapsule.md

+
+Let's cover some scenarios:
+
+- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.


This comment isn't quite true either - the PyCapsule Interface says nothing about lifetimes. In the dataframe case, it yields a pointer to a stream of arrays. It is up to the consumer when they want to actually start iterating over the stream

Perhaps the main consumers of the PyCapsule interface choose to do so eagerly, but that's an implementation detail of the libraries themselves, not the PyCapsule Interface

To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by arro3 are lazy. See the overloads for e.g. cast. If you pass in something with a __arrow_c_stream__ dunder, the stream will not be materialized automatically. Rather, it will return an ArrayReader, so that it's fully lazy. Only when the output ArrayReader itself is iterated over will the cast calls actually happen.

WillAyd · 2025-01-08T19:34:21Z

apps/labs/posts/narwhals-pycapsule.md

+- If you want complete control over your data: use the
+  PyCapsule Interface. If you have the necessary Rust / C skills, there should be no limit
+  to how complex and bespoke you make your data processing.
+- If you want to keep your library logic to pure-Python and without heavy dependencies so


Just a suggestion on ordering - I think most users will be totally fine with narwhals, especially if your target audience is Python developers. So might be worth putting narwhals first on the list

WillAyd · 2025-01-08T19:37:04Z

apps/labs/posts/narwhals-pycapsule.md

+- If you want to keep your library logic to pure-Python and without heavy dependencies so
+  it's easy to maintain and install: use Narwhals. Packaging a pure-Python project is very
+  easy, especially compared with if you need to get Rust or C in there.
+- If you want to do part of your processing in Python, and part of it in Rust - **use both**!


If you cared to cater to non-Rust developers, you might also want to mention nanoarrow. I've also done a youtube video on building Python extensions using C++ - https://www.youtube.com/watch?v=EhUnmXPjTy8&list=PLyPvk-ejh-138onAgM7sVcpzu1HLp-Uiv

Not trying to promote my own content here - just offering in case you wanted to expand the scope of your blog post (which may or may not be worth it)

kylebarron

Overall looks good! I've had a long-stalled pycapsule post on my back burner too

kylebarron · 2025-01-09T18:17:41Z

apps/labs/posts/narwhals-pycapsule.md

+
+> **_NOTE:_** How many times can `agnostic_sum_i64_column` be called? Strictly
+> speaking, the PyCapsule Interface only guarantees that it can be called once for
+> any given input. In practice, however, all implementations that we're aware of


FWIW arro3 IO functions like read_parquet I think will provision a single stream instance. So the first time you use the stream pointer from read_parquet it would consume the stream I think and you wouldn't be able to use it again.

kylebarron · 2025-01-09T18:18:29Z

apps/labs/posts/narwhals-pycapsule.md

+        msg = f"Column '{column_name}' is of type {dtype}, expected Int64"
+        raise TypeError(msg)
+    df = lf.collect()
+    return df[column_name].sum()


What backend do you use to actually compute the sum?

kylebarron · 2025-01-09T18:22:11Z

apps/labs/posts/narwhals-pycapsule.md

+
+Let's cover some scenarios:
+
+- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.


To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by arro3 are lazy. See the overloads for e.g. cast. If you pass in something with a __arrow_c_stream__ dunder, the stream will not be materialized automatically. Rather, it will return an ArrayReader, so that it's fully lazy. Only when the output ArrayReader itself is iterated over will the cast calls actually happen.

kylebarron · 2025-01-09T18:23:01Z

apps/labs/posts/narwhals-pycapsule.md

+Let's cover some scenarios:
+
+- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.
+  The PyCapsule Interface requires you to materialise the data into memory immediately,


If anything, I think it's the opposite; the Arrow Stream interface encourages laziness by being defined as a stream. It's only common in-memory implementations that fully materialize data.

kylebarron · 2025-01-09T18:24:33Z

apps/labs/posts/narwhals-pycapsule.md

+to do this in Polars: [Expressions Plugins](https://marcogorelli.github.io/polars-plugins-tutorial/).
+How does that differ from writing custom code with the PyCapsule Interface?
+
+- Pros: Polars Plugins slot in seamlessly with Polars' lazy execution. This can enable massive


FWIW it would be nice to have a generic way to use UDFs defined in terms of the pycapsule interface with Polars. It wouldn't be that hard to write the glue code

kylebarron · 2025-01-09T18:25:51Z

Also in case anyone else is curious, I simplified the rust example: MarcoGorelli/pycapsule-demo#1

init

72c2cb5

vercel bot deployed to Preview December 20, 2024 15:24 View deployment

complete text

39433f2

vercel bot deployed to Preview January 3, 2025 11:54 View deployment

add images

577432d

vercel bot deployed to Preview January 3, 2025 15:11 View deployment

MarcoGorelli added 2 commits January 3, 2025 15:14

typo

a507051

resize

6cc65d6

vercel bot deployed to Preview January 3, 2025 15:21 View deployment

resizse

783d90b

vercel bot deployed to Preview January 3, 2025 16:57 View deployment

minor fixes

35e9013

vercel bot deployed to Preview January 3, 2025 17:13 View deployment

resize again

58428ed

MarcoGorelli changed the title ~~Arrow PyCapsule Interface and Narwhals: Rust and Tusk for universal dataframe support~~ Arrow PyCapsule Interface and Narwhals for universal dataframe support Jan 6, 2025

MarcoGorelli changed the title ~~Arrow PyCapsule Interface and Narwhals for universal dataframe support~~ [BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support Jan 6, 2025

vercel bot deployed to Preview January 6, 2025 13:53 View deployment

add caveat

1fc063a

vercel bot deployed to Preview January 6, 2025 14:00 View deployment

MarcoGorelli marked this pull request as ready for review January 6, 2025 16:01

MarcoGorelli requested review from pavithraes, trallard, rgommers and gabalafou as code owners January 6, 2025 16:01

MarcoGorelli added 3 commits January 6, 2025 16:19

fix date

f626270

retitle

b66e667

fixup

45a8384

vercel bot deployed to Preview January 6, 2025 16:50 View deployment

MarcoGorelli marked this pull request as draft January 6, 2025 17:18

another round of updates

f9b898e

MarcoGorelli marked this pull request as ready for review January 6, 2025 17:35

vercel bot deployed to Preview January 6, 2025 17:39 View deployment

link to import / export

104bbe4

vercel bot deployed to Preview January 6, 2025 17:49 View deployment

WillAyd reviewed Jan 6, 2025

View reviewed changes

MarcoGorelli added 2 commits January 8, 2025 16:14

specify ArrowStreamExportable protocol

b02cf9d

update date

8ece81c

vercel bot deployed to Preview January 8, 2025 16:25 View deployment

pavithraes reviewed Jan 8, 2025

View reviewed changes

.vscode/settings.json Outdated Show resolved Hide resolved

pavithraes approved these changes Jan 8, 2025

View reviewed changes

undo .vscode/settings.json change

28bbb94

vercel bot deployed to Preview January 8, 2025 16:38 View deployment

WillAyd reviewed Jan 8, 2025

View reviewed changes

kylebarron approved these changes Jan 9, 2025

View reviewed changes

MarcoGorelli marked this pull request as draft January 15, 2025 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894

[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894

MarcoGorelli commented Dec 20, 2024

vercel bot commented Dec 20, 2024 •

edited

Loading

MarcoGorelli commented Jan 6, 2025 •

edited

Loading

WillAyd left a comment

WillAyd Jan 6, 2025

pavithraes left a comment

WillAyd Jan 8, 2025

WillAyd left a comment

WillAyd Jan 8, 2025

kylebarron Jan 9, 2025

WillAyd Jan 8, 2025

WillAyd Jan 8, 2025

kylebarron left a comment

kylebarron Jan 9, 2025

kylebarron Jan 9, 2025

kylebarron Jan 9, 2025

kylebarron Jan 9, 2025

kylebarron Jan 9, 2025

kylebarron commented Jan 9, 2025


		The technical details are beyond the scope of this post, but the summary is:

		- We accept any object which implements the


		Let's cover some scenarios:

		- If you want your dataframe logic to stay completely lazy where possible: use Narwhals.

[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894

Are you sure you want to change the base?

[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894

Conversation

MarcoGorelli commented Dec 20, 2024

Text styling

Non-text contents

vercel bot commented Dec 20, 2024 • edited Loading

MarcoGorelli commented Jan 6, 2025 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pavithraes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylebarron commented Jan 9, 2025

vercel bot commented Dec 20, 2024 •

edited

Loading

MarcoGorelli commented Jan 6, 2025 •

edited

Loading