-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Arrow C Stream from Arrow PyCapsule Interface #501
Conversation
Thanks @kylebarron, could you add a quick comment somewhere in the code for how we could replace the vendored code after updating arrow and pulling in pyo3-arrow? |
vegafusion-common/src/data/table.rs
Outdated
// pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> { | ||
// let (batches, schema) = table.into_inner(); | ||
// Ok(VegaFusionTable::try_new(schema, batches)?) | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple lines of code with pyo3_arrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I never know whether there's consistency in the ecosystem between an ordering of (schema, batches)
or (batches, schema)
🫠
@jonmmease are you able to see why the CI checks failed? It just says "job failed" for me |
I haven't had a chance to dig into it, but it's fine to leave things here. I'll finish it up for the next VegaFusion release |
Quick update. I'm slogging through updating DataFusion and arrow-rs over in #504. Once that lands we can rebase this and drop the vendored code. |
Awesome. And I just published pyo3-arrow 0.2 last night |
@kylebarron I made some updates to this PR after updating arrow to version 52, and updating pyo3-arrow to 0.2.0. If you have a chance to see if what I have here makes sense, that would be great. I tested this using Polars, and it seems to be working in general, but I ran into an issue immediately that string columns from Polars are failing to convert with:
Looks like arrow-rs isn't supporting the new UTF8View type in https://github.com/apache/arrow-rs/blob/ca5fe7df5ca0bdcfd6f732cc3f10da511b753c5f/arrow/src/datatypes/ffi.rs#L30. Does that track your understanding? Seems that work may be needed in arrow-rs, but I don't think I can route Polars through this path until this support is added. Update: Oh, this was merged two weeks ago in apache/arrow-rs#6171, but arrow-rs 52.2.0 was released 3 weeks ago. So we'll just need to wait a bit I guess. We can still merge this as a fallback behind the dataframe protocol and the reorder later. |
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> { | ||
let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you keep this method the same to only support PyTable
, PyTable
implements FromPyObject
, so you can make it the parameter directly.
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> { | |
let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?; | |
pub fn from_arrow_c_stream(table: pyo3_arrow::PyTable) -> PyResult<Self> { |
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> { | ||
let pytable = pyo3_arrow::PyTable::from_py_object_bound(table.as_borrowed())?; | ||
let (batches, schema) = pytable.into_inner(); | ||
Ok(VegaFusionTable::try_new(schema, batches)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never know whether batches, schema
or schema, batches
is a better ordering 🥲
@@ -271,6 +272,13 @@ impl VegaFusionTable { | |||
} | |||
} | |||
|
|||
#[cfg(feature = "pyarrow")] | |||
pub fn from_arrow_c_stream(table: &Bound<PyAny>) -> PyResult<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a public API or is this an internal API called by your Python code? If it's a public API, it might be preferred to overload the existing from_pyarrow
method.
You can check for both the pycapsule interface and fall back to pyarrow in a single method. That's a recommended approach here.
You can do this by first trying to extract a PyTable
; this will work for any recent pyarrow table object (or pyarrow record batch reader). And then you can fallback to your existing from_pyarrow
code.
pub fn from_arrow(table: &Bound<PyAny>) -> PyResult<Self> {
if let Ok(table) = data.extract::<PyTable>() {
let (batches, schema) = table.into_inner();
Ok(VegaFusionTable::try_new(schema, batches)?)
} else {
Self::from_pyarrow(table)
}
}
Ah yes, of course we'd hit this. Correct, arrow-rs isn't supporting the new StringView type until arrow 53. (arrow-rs didn't merge any breaking changes in 52.1 or 52.2). I believe everything is set for this support in arrow 53 though (in september). The PyCapsule Interface provides a mechanism to handle this: schema requests. It's a way for consumers to ask producers to cast their data export to a specific data type. I'd like to do a follow up PR to polars to implement this, but I haven't yet: pola-rs/polars#17676 (comment). So Polars will currently ignore schema requests. So the two routes are either:
|
Thanks for the context @kylebarron, I'm fine to wait for Arrow 53. But this will also require waiting for a DataFusion release based on Arrow 53, so I guess that could end up being a month later. Separately, does the Arrow PyCapsule interface provide a way to query for whether extracting the arrow data is a cheap operation? Ibis is one scenario I have in mind. I believe Ibis supports the PyCapsule interface, but this must require running the SQL query and pulling down every possible column, which can be very expensive. For Ibis, it's much more efficient to request only the specific columns that we actually need. For a library that already has the Arrow representation materialized in memory, I would expect it to be cheaper to return the full representation rather than build a new representation that has only the needed columns. It would be nice if there were a way to say, please return these columns, but it's fine to return more if that's cheaper. |
No it doesn't, but I think this largely comes into user semantics. So in the case of ibis, you might have an ibis In my mind the interface is mostly around efficiently communicating between libraries that already have an Arrow representation |
Right. I would like to implement schema requests in polars regardless, but it's a lower priority than getting it implemented to start with. |
As an update here, lamentably arrow 53 supports StringView/BinaryView but not through C FFI, see apache/arrow-rs#6368. So if you're restricted to using a released version of |
I think we are restricted to using a released crate because we publish the Rust crates to crates.io so that conda-forge can rebuild the sdist from source and I don't think this will work with a tagged dependency. |
If you publish to crates.io, then yes, you can only use released versions. I might be able to get around this in arro3 specifically by vendoring the relevant parts of the FFI fixes |
Arrow 53.1 was released, which appears to fully support string view and binary view: kylebarron/arro3#219. I'd love to get this merged! Let me know if you need any help! |
Awesome, I'm working on updating to DataFusion 42 and Arrow 53.1 over in #513. I've started a v2 branch and I'm planning to retarget this PR to v2 after merging that one. For VegaFusion 2, I want to make pandas and pyarrow optional and this PR is a crucial part of that plan. |
I started trying to port this to my v2 branch that's updated to arrow 53.1, but ran into a roadblock. The Are there any blockers for you in updating to PyO3 0.22? I may be able to drop the In particular, some of VegaFusion's logic currently returns Arrow data from Rust to Python as pyarrow, and we'll need to think about what to do here in order for pyarrow to be an optional dependency. I suppose we could return the PyCapsule object from Rust and that create PyArrow or Polars from that. |
Yes. I need the
All of the relevant pyo3_arrow objects have a |
Ok, so I could remove the |
Yep! |
Looks like it's scheduled for Wednesday: PyO3/rust-numpy#453 |
I published pyo3-arrow 0.5 using |
Starting to work on updating to Here are the compilation errors with
|
The minimum Python version does not need to be 3.11. You have three options:
pyo3-arrow now supports buffer-protocol input (like numpy arrays) with zero copy, as long as they are C contiguous. But the buffer protocol only became part of Python's ABI-stable subset as of Python 3.11. Because of this, arro3 builds per-Python version wheels now. https://pypi.org/project/arro3-core/#files |
Thanks for the context @kylebarron! When I disable the default features, I'm seeing this build error (regardless of Python version).
It looks like the import here might need to be conditioned on the |
Ah I need to add a ci check for no default features :/ |
I published pyo3-arrow 0.5.1 with a CI check to ensure |
merged over in #517. Thanks for all of the help on this @kylebarron, the combo of |
Closes #498.
This is essentially just a vendor of these lines of code (and then downported to pyo3 0.20).