Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the python bindings using conversion traits from arrow-rs #873

Merged
merged 21 commits into from
Nov 2, 2021

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Aug 13, 2021

Alternative to #856
Inspired by #856 (comment)
Depends on apache/arrow-rs#691

The CI will probably fail since I hardcoded arrow-rs local dependency.

@github-actions github-actions bot added datafusion Changes in the datafusion crate python sql SQL Planner labels Aug 13, 2021
datafusion/src/lib.rs Outdated Show resolved Hide resolved
datafusion/Cargo.toml Outdated Show resolved Hide resolved
python/Cargo.toml Show resolved Hide resolved
python/Cargo.toml Outdated Show resolved Hide resolved
datafusion/Cargo.toml Outdated Show resolved Hide resolved
@cpcloud
Copy link
Contributor

cpcloud commented Aug 13, 2021

Any chance of rewording the commits to be something other than single characters?

@kszucs
Copy link
Member Author

kszucs commented Aug 13, 2021

Any chance of rewording the commits to be something other than single characters?

Squashed.

@kszucs
Copy link
Member Author

kszucs commented Aug 30, 2021

@jorgecarleitao @andygrove we should roll up apache/arrow-rs#691 and this PR so we can proceed with the ibis datafusion backend.

@houqp
Copy link
Member

houqp commented Sep 7, 2021

The simplification looks great, thanks @kszucs! Looks like there are some minor rust build errors that need to be addressed.

@kszucs kszucs changed the title Rework the python bindings using conversion traits from arrow-rs [WIP] Rework the python bindings using conversion traits from arrow-rs Sep 8, 2021
@kszucs
Copy link
Member Author

kszucs commented Sep 8, 2021

The simplification looks great, thanks @kszucs! Looks like there are some minor rust build errors that need to be addressed.

I had to bump the arrow-rs dependency everywhere so we need to address a couple more issues.

@houqp there is an async sort order issue which is rather hard to comprehend, any ideas what could have caused this error?

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! It is a awesome simplification!

I am curious about the naming equivalents but I think we should merge this regardless.

An observation: from here on we are bound to use the same pyo3 version used by arrow-rs.

I suggest an extra reviewer (e.g.  @alamb, @houqp) -- I focused mostly on the Python part.

database: Arc<dyn SchemaProvider>,
}

#[pyclass(name = "Table", subclass)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this related to pa.Table? It is the Python-equivalent of the TableProvider, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it was an attempt to expose the catalog information to ibis, but I may defer the catalog.rs to a follow-up PR instead.

catalog: Arc<dyn CatalogProvider>,
}

#[pyclass(name = "Database", subclass)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here; this is SchemaProvider in DataFusion, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'm going to remove this module from the PR to avoid confusion.

datafusion/src/pyarrow.rs Show resolved Hide resolved
@kszucs
Copy link
Member Author

kszucs commented Sep 8, 2021

I need to cover the newly exposed python API with unittests.

@houqp
Copy link
Member

houqp commented Sep 9, 2021

@kszucs I took a quick look at the physical_plan::sort_preserving_merge::tests::test_async error, it is caused by the arrow-rs upgrade. I haven't had the time to trace down on the offending arrow-rs commit yet. Will probably take a closer look at it today or tomorrow.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2021

@kszucs I took a quick look at the physical_plan::sort_preserving_merge::tests::test_async error, it is caused by the arrow-rs upgrade. I haven't had the time to trace down on the offending arrow-rs commit yet. Will probably take a closer look at it today or tomorrow.

I am pretty sure I saw a failure in the sort_preserving_merge tests related to apache/arrow-rs#656

@kszucs
Copy link
Member Author

kszucs commented Sep 9, 2021

@kszucs I took a quick look at the physical_plan::sort_preserving_merge::tests::test_async error, it is caused by the arrow-rs upgrade. I haven't had the time to trace down on the offending arrow-rs commit yet. Will probably take a closer look at it today or tomorrow.

I am pretty sure I saw a failure in the sort_preserving_merge tests related to apache/arrow-rs#656

Thanks Andrew for the heads up! I wonder, wouldn't it better if we would assert on record batch equality rather than its string representation?

@alamb
Copy link
Contributor

alamb commented Sep 9, 2021

Thanks Andrew for the heads up! I wonder, wouldn't it better if we would assert on record batch equality rather than its string representation?

@kszucs -- I agree that doing so would avoid such "small floating point changes require many test changes" type problem. I think the downside is that (in my opinion) the tests are then harder to read and update.

Let me prepare a draft PR for updating datafusion to the latest arrow-rs so we can at least decouple the "changes needed for just arrow-rs upgrade" from the other changes in this PR

@kszucs
Copy link
Member Author

kszucs commented Sep 9, 2021

@kszucs -- I agree that doing so would avoid such "small floating point changes require many test changes" type problem. I think the downside is that (in my opinion) the tests are then harder to read and update.

I understand, though the test actually tests that the two representations are equal rather than that the results are equal.
Aren't there other assertion crates providing better checks, perhaps with optional additional context (as a possible future improvement)?

Let me prepare a draft PR for updating datafusion to the latest arrow-rs so we can at least decouple the "changes needed for just arrow-rs upgrade" from the other changes in this PR

Great, thank You!

@alamb
Copy link
Contributor

alamb commented Sep 9, 2021

I understand, though the test actually tests that the two representations are equal rather than that the results are equal.
Aren't there other assertion crates providing better checks, perhaps with optional additional context (as a possible future improvement)?

🤔 Yes , you are correct. There may be something deeper going on here and perhaps related to the change to use unstable sort in apache/arrow-rs#552. All the more reason to debug it now. I will work on that

@alamb
Copy link
Contributor

alamb commented Sep 9, 2021

#984 is the PR to update to arrow-rs master. The sort_preserving_merge test is indeed failing and I will continue to debug on that PR.

@houqp
Copy link
Member

houqp commented Sep 10, 2021

@kszucs @jorgecarleitao quick question, do we want to release this change as part of 0.3.0 or delay to 0.4.0?

@kszucs
Copy link
Member Author

kszucs commented Nov 2, 2021

@houqp rebased

@houqp houqp merged commit 1c351ec into apache:master Nov 2, 2021
@houqp
Copy link
Member

houqp commented Nov 2, 2021

Thank you @kszucs ! Great simplification to the python binding :)

@houqp houqp added enhancement New feature or request and removed datafusion Changes in the datafusion crate ballista sql SQL Planner labels Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants