-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894
base: main
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
cc @kylebarron @WillAyd Here's a blog post I've put together on the PyCapsule Interface and Narwhals - just pinging you in case you were interested in reading about the treatment of the former, and whether I've misrepresented anything (if not, no worries at all! Feel free to unsubscribe) The post preview is available at: https://labs-lr1c4romv-quansight.vercel.app/blog/narwhals-pycapsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great
|
||
The technical details are beyond the scope of this post, but the summary is: | ||
|
||
- We accept any object which implements the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how technical you want to be and I realize this post is aimed at dataframe libraries, but there are other methods that can be implemented that still fall under the PyCapsule Interface
- arrow_c_array
- arrow_c_schema
- arrow_c_stream
- arrow_c_device_array
- arrow_device_array_stream
Stream may be the most fitting for exchanging dataframes, but the other ones have use outside of 2-d structures. May want to be careful in saying that the presence of this method exclusively means you are implementing the PyCapsule interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Everything looks good, and it's an interesting read!
without us having to write any specialised code to handle the subtle differences | ||
between them! | ||
|
||
> **_NOTE:_** How many times can `agnostic_sum_i64_column` be called? Strictly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on your target audience, you might be able to remove this note. I love the thoroughness, but it is maybe too into the weeds
I also am not sure I agree with the content - each invocation to the dataframe libraries is going to yield a new Python capsule. In the non-duckdb cases, those capsules provide a non-owning view of the data; the data has a separate lifetime managed by a different object. I think in the duckdb case, they are tying the lifetime of the array to the invocation of the first capsule, but that's more of a technical detail of that library than anything related to the specification itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again minor comments and suggestions, but feel free to ignore. Thanks for writing this - glad to see this get more promotion
|
||
Let's cover some scenarios: | ||
|
||
- If you want your dataframe logic to stay completely lazy where possible: use Narwhals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't quite true either - the PyCapsule Interface says nothing about lifetimes. In the dataframe case, it yields a pointer to a stream of arrays. It is up to the consumer when they want to actually start iterating over the stream
Perhaps the main consumers of the PyCapsule interface choose to do so eagerly, but that's an implementation detail of the libraries themselves, not the PyCapsule Interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by arro3
are lazy. See the overloads for e.g. cast
. If you pass in something with a __arrow_c_stream__
dunder, the stream will not be materialized automatically. Rather, it will return an ArrayReader
, so that it's fully lazy. Only when the output ArrayReader
itself is iterated over will the cast
calls actually happen.
- If you want complete control over your data: use the | ||
PyCapsule Interface. If you have the necessary Rust / C skills, there should be no limit | ||
to how complex and bespoke you make your data processing. | ||
- If you want to keep your library logic to pure-Python and without heavy dependencies so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion on ordering - I think most users will be totally fine with narwhals, especially if your target audience is Python developers. So might be worth putting narwhals first on the list
- If you want to keep your library logic to pure-Python and without heavy dependencies so | ||
it's easy to maintain and install: use Narwhals. Packaging a pure-Python project is very | ||
easy, especially compared with if you need to get Rust or C in there. | ||
- If you want to do part of your processing in Python, and part of it in Rust - **use both**! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cared to cater to non-Rust developers, you might also want to mention nanoarrow. I've also done a youtube video on building Python extensions using C++ - https://www.youtube.com/watch?v=EhUnmXPjTy8&list=PLyPvk-ejh-138onAgM7sVcpzu1HLp-Uiv
Not trying to promote my own content here - just offering in case you wanted to expand the scope of your blog post (which may or may not be worth it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! I've had a long-stalled pycapsule post on my back burner too
|
||
> **_NOTE:_** How many times can `agnostic_sum_i64_column` be called? Strictly | ||
> speaking, the PyCapsule Interface only guarantees that it can be called once for | ||
> any given input. In practice, however, all implementations that we're aware of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW arro3 IO functions like read_parquet
I think will provision a single stream instance. So the first time you use the stream pointer from read_parquet
it would consume the stream I think and you wouldn't be able to use it again.
msg = f"Column '{column_name}' is of type {dtype}, expected Int64" | ||
raise TypeError(msg) | ||
df = lf.collect() | ||
return df[column_name].sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What backend do you use to actually compute the sum?
|
||
Let's cover some scenarios: | ||
|
||
- If you want your dataframe logic to stay completely lazy where possible: use Narwhals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by arro3
are lazy. See the overloads for e.g. cast
. If you pass in something with a __arrow_c_stream__
dunder, the stream will not be materialized automatically. Rather, it will return an ArrayReader
, so that it's fully lazy. Only when the output ArrayReader
itself is iterated over will the cast
calls actually happen.
Let's cover some scenarios: | ||
|
||
- If you want your dataframe logic to stay completely lazy where possible: use Narwhals. | ||
The PyCapsule Interface requires you to materialise the data into memory immediately, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, I think it's the opposite; the Arrow Stream interface encourages laziness by being defined as a stream. It's only common in-memory implementations that fully materialize data.
to do this in Polars: [Expressions Plugins](https://marcogorelli.github.io/polars-plugins-tutorial/). | ||
How does that differ from writing custom code with the PyCapsule Interface? | ||
|
||
- Pros: Polars Plugins slot in seamlessly with Polars' lazy execution. This can enable massive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it would be nice to have a generic way to use UDFs defined in terms of the pycapsule interface with Polars. It wouldn't be that hard to write the glue code
Also in case anyone else is curious, I simplified the rust example: MarcoGorelli/pycapsule-demo#1 |
Text styling
Non-text contents