Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894
base: main
Are you sure you want to change the base?
[BLOG] Arrow PyCapsule Interface and Narwhals for universal dataframe support #894
Changes from 14 commits
72c2cb5
39433f2
577432d
a507051
6cc65d6
783d90b
35e9013
58428ed
1fc063a
f626270
b66e667
45a8384
f9b898e
104bbe4
b02cf9d
8ece81c
28bbb94
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how technical you want to be and I realize this post is aimed at dataframe libraries, but there are other methods that can be implemented that still fall under the PyCapsule Interface
Stream may be the most fitting for exchanging dataframes, but the other ones have use outside of 2-d structures. May want to be careful in saying that the presence of this method exclusively means you are implementing the PyCapsule interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on your target audience, you might be able to remove this note. I love the thoroughness, but it is maybe too into the weeds
I also am not sure I agree with the content - each invocation to the dataframe libraries is going to yield a new Python capsule. In the non-duckdb cases, those capsules provide a non-owning view of the data; the data has a separate lifetime managed by a different object. I think in the duckdb case, they are tying the lifetime of the array to the invocation of the first capsule, but that's more of a technical detail of that library than anything related to the specification itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW arro3 IO functions like
read_parquet
I think will provision a single stream instance. So the first time you use the stream pointer fromread_parquet
it would consume the stream I think and you wouldn't be able to use it again.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What backend do you use to actually compute the sum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't quite true either - the PyCapsule Interface says nothing about lifetimes. In the dataframe case, it yields a pointer to a stream of arrays. It is up to the consumer when they want to actually start iterating over the stream
Perhaps the main consumers of the PyCapsule interface choose to do so eagerly, but that's an implementation detail of the libraries themselves, not the PyCapsule Interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair, it is possible to write lazy logic in Rust/C++ too. Many of the compute functions exported by
arro3
are lazy. See the overloads for e.g.cast
. If you pass in something with a__arrow_c_stream__
dunder, the stream will not be materialized automatically. Rather, it will return anArrayReader
, so that it's fully lazy. Only when the outputArrayReader
itself is iterated over will thecast
calls actually happen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, I think it's the opposite; the Arrow Stream interface encourages laziness by being defined as a stream. It's only common in-memory implementations that fully materialize data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion on ordering - I think most users will be totally fine with narwhals, especially if your target audience is Python developers. So might be worth putting narwhals first on the list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cared to cater to non-Rust developers, you might also want to mention nanoarrow. I've also done a youtube video on building Python extensions using C++ - https://www.youtube.com/watch?v=EhUnmXPjTy8&list=PLyPvk-ejh-138onAgM7sVcpzu1HLp-Uiv
Not trying to promote my own content here - just offering in case you wanted to expand the scope of your blog post (which may or may not be worth it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it would be nice to have a generic way to use UDFs defined in terms of the pycapsule interface with Polars. It wouldn't be that hard to write the glue code