-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: remove automated ballista CI checks from DataFusion #2679
Comments
I think it is worth also highlighting that the arrow releases every 2 weeks are often breaking, at least on paper, and currently require going through this process. I put some thoughts on this - #2583 (comment) |
I'm fine with removing these checks. |
We have two separate projects in the company which use DF as a Query Engine, and we came to the decision to maintain two different versions in the forked repository because it's impossible to use the latest version, there are some problems: It's hard to contribute back to the upstream. We had subqueries before DF introduced it, We have ANY and UDTFs, etc. We tried to push it back to the upstream, but It's not merged and probably, will not be merged soon. Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny 😄 For my project, we rebase every 1-2 months. While rebasing we choose the latest commit from the |
Yeah, I do not envy you this task... I have enough difficulty keeping some of my longer-running PRs up to date 😅 FWIW if there is some way I could help you move onto tracking master more directly I would be happy to help out. In particular I notice your fork of arrow-rs is about 9 months out of date, which is likely to become quite painful as DataFusion comes to rely on it even more... 😅
I'm sorry to hear this has been your experience. As mentioned above I'll happily help get contributions shepherded through, but the further things get from arrow-rs the less context I have to be able to meaningfully review things... |
If it's any consolation, once #2675 and #2682 are merged I have no more refactoring planned other than removing some of the re-exports in the crate (#2683) in an effort to stabilize the public API for DataFusion. We're also moving to monthly releases now, with the next release planned for ~2 weeks from now (#2676) so re-basing monthly might be worth considering. With the refactoring work complete I will have more time for reviews and to help get PRs merged. |
Also @ovr in IOx, we have found that by using the extension mechanisms in DataFusion (user defined Exprs, User Defined Plan Nodes, etc) we have been able to add IOx specific functionality / plans without a fork of DataFusion, which minimizes the amount of maintenance effort we need to expend to use master datafusion directly |
Seems reasonable. It does create a sort of cyclic dependency :) |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Basically the tests added in #2582 to keep ballista and datafusion in sync add significant burden to DataFusion development and I propose removing them, at least temporarily
Here is a description of the process:
https://github.com/apache/arrow-datafusion/blob/907504c/.github/pull_request_template.md?plain=1#L31-L47
I think the rationale for the new CI was to add friction on DataFusion API changes to encourage a more stable API; However, with the currently ongoing efforts to rework the object store and parquet reading, I think all we are doing with the process is slowing things down.
The alternative, to have Ballista keep up with changes in DataFusion, sounds daunting at first, but my firsthand experience suggests it is not that bad. Specifically, https://github.com/influxdata/influxdb_iox, my project, uses DataFusion similarly to Ballista (as the core query engine) and uses a DataFusion pin directly from master. Instead of impinging on the DataFusion development process, we keep IOx up with DataFusion by manually updating the DataFusion pin in IOX about once a week , and sorting out any API changes.
This does take time, but it is mostly mechanical. We do occasionally find bugs that were introduced into DataFusion such as when we tried most recently with https://github.com/influxdata/influxdb_iox/pull/4743 and we then contribute a fix back upstream (e.g. #2674)
I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)
Describe the solution you'd like
I propose removing the Ballista CI check in DataFusion
Specifically this check: https://github.com/apache/arrow-datafusion/blob/907504c5aa768601f9d70ad2c8f928bedfa9b069/.github/workflows/rust.yml#L128-L172
And writing up instructions (maybe even automation) on how to upgrade the datafusion pin in Ballista manually
Describe alternatives you've considered
Additional context
The move of Ballista to a new repo is tracked in: #2502
There are several discussions about this pain:
cc @andygrove @thinkharderdev @ming535 @Ted-Jiang @xudong963 @tustvold @korowa
The text was updated successfully, but these errors were encountered: