Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the python bindings [WIP] #856

Closed
wants to merge 12 commits into from
Closed

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Aug 11, 2021

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@kszucs kszucs added the python label Aug 11, 2021
@github-actions github-actions bot added the datafusion Changes in the datafusion crate label Aug 11, 2021
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks a lot, @kszucs !

My only concern is the addition of the Py prefix in a lot of public stuff. Doesn't this cause all public APIs in Python to be prepended by Py also? IMO there should not be any prefix.

@kszucs
Copy link
Member Author

kszucs commented Aug 11, 2021

My only concern is the addition of the Py prefix in a lot of public stuff. Doesn't this cause all public APIs in Python to be prepended by Py also? IMO there should not be any prefix.

Right, the intent there is to expose the rust python objects under datafusion.internals (this is the current setup, though we can use any other submodule like datafusion._lib) and expose them under datafusion as aliases without the Py prefix. This provides us an additional python layer on top of the rust bindings where we can add pure python related functionality more easily, like:

from datafusion import internals

class DataFrame(internals.PyDataFrame):
    # additional functionality 

One additional advantage of using the Py prefix in the rust code is that it clearly indicates which rust types are actually python extension types.

Note that this is a common approach for python bindings, see py-polars for example.

from .internals import PyExpr as Expr
from .internals import functions

__all__ = [
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao the symbols are exported without the Py prefix.

@kszucs kszucs changed the title Rework the python bindings Rework the python bindings [WIP] Aug 11, 2021
Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh, that makes sense. Thanks for the clarification! LGTM

fn to_pyarrow(&self, py: Python) -> PyResult<PyObject>;
}

impl PyArrowConvert for DataType {
Copy link
Member Author

@kszucs kszucs Aug 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try adding this module to arrow-rs as an optional one, so we can implement the PyO3 conversion traits, like FromPyObject directly for the arrow types.
This will further reduce the required conversion boilerplate in the python bindinds.

For example we should be able to write

fn create_dataframe(
        &mut self,
        partitions: Vec<Vec<RecordBatch>>,
    ) -> PyResult<PyDataFrame> {
        // partitions are going to be converted by PyO3 automatically

instead of the current

fn create_dataframe(
        &mut self,
        partitions: Vec<Vec<&PyAny>>,
    ) -> PyResult<PyDataFrame> {
        let partitions: Vec<Vec<RecordBatch>> = partitions
            .into_iter()
            .map(|batches| {
                batches
                    .into_iter()
                    .map(RecordBatch::from_pyarrow)
                    .collect::<PyResult<_>>()
            })
            .collect::<PyResult<_>>()?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit part of the reason I avoided something like that at the time was to avoid canibalizing pyarrow, since it makes it appealing from there to have arrow-rs exposed as a Python library, which imo would add confusion to the Python ecosystem (having two official Python libraries, one from each implementation).

OTOH, polars, and likely others, would benefit from such a pyo3 library, as anyone building bindings on top of arrow-rs could re-use that code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao created a prototype, see the referenced PRs below. I guess the code should be portable to arrow2 as well.

@kszucs
Copy link
Member Author

kszucs commented Sep 1, 2021

Closing in favor of #873

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants