[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

alamb · 2024-09-06T13:45:21Z

Is your feature request related to a problem or challenge?

DataFuson is growing by almost all measures: community 🤗 , features 🪶 , and codebase size ✅ which is good 🎉 However, this growth is causing challenges such as:

Lengthy review cycles (especially for new features). For example the PR for lateral subqueries took 5 weeks to review and merge
PRs that are written but then not merged as they seem to be too large in scope (e.g. hugging face from @xinlifoobar , FlightSQLDriver from @ccciudatu, etc)
Uncertainty on feature scope -- for example, should we be adding all the (very cool) DuckDB SQL extensions / aggregates to make the default SQL engine as easy as possible or should those be implemented extension packages?

As described in the Design Goals, it is important for DataFusion to:

Work “out of the box”: Provide a very fast, world class query engine with minimal setup or required configuration.
Customizable everything: All behavior should be customizable by implementing traits.

However, this description doesn't offer any specific criteria about which features should be in the core (to work "out of the box") and which should be implemented as extensions

I am worried that if we take all possiblely useful features, the DataFusion core will become unmanageble / unmaintainable. Already we are struggling with review capacity (it takes days / weeks to review new feautre PRs)

Describe the solution you'd like

I would like a clearly articulated set of criteria of when features should be added to the core vs when they should be in downstream projects / crates built with the extension APIs

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2024-09-06T13:52:28Z

It seems to me we also haven't documented anywhere the "the built in SQL dialect tries to follow postgresql semantics when possible"

alamb · 2024-09-06T13:55:22Z

Some ideas for potential criteria

Bug fixes for existing features: yes
Performance improvements to existing features: yes
Functional improvements to existing features: Yes
New functionality that is part of the "standard sql" (aka postgres dialect): Yes
New APIs for extending various parts of DataFusion: Yes
New functions that aren't part of the "standard sql" (aka postgres dialect): No (rationale being they can be implemented as extensions / packages)
New data sources (e.g. support for ORC) No (rationale being they can be implemented using extensions easily)

jayzhan211 · 2024-09-06T14:54:08Z

Uncertainty on feature scope -- for example, should we be adding https://github.com/apache/datafusion/issues/12254to make the default SQL engine as easy as possible or should those be implemented extension packages?

Great to start the discussion of this. I have questions about this too. I think we don't really need to have all the functions in datafusion. I think core functions like count, and sum without doubt should have one in the datafusion core. The functions that require Trait implementation are also reasonable to keep them with, to showcase that the Trait is extensible enough. Others are nice to have but they could be implemented by downstream projects themselves. At least we should focus on more important things such as performance and extensibility for now. See Roadmap #11442

cisaacson · 2024-09-06T16:14:44Z

@alamb I fully agree with your recommendation. It maintains the power of DataFusion while avoiding too much complexity. In my mind (and I think the project), DataFusion is first and foremost an extensible query engine, so that many new things can be implemented as a result. That purpose means the core features should be limited to those things that enable extensibility, rather than trying to bundle it all into DataFusion itself.

findepi · 2024-09-06T19:26:51Z

thanks for starting this discussion @alamb. this was a very clearly missing clarification. let's get is discussed!

DataFusion is first and foremost an extensible query engine, so that many new things can be implemented as a result.

@cisaacson very good point!
this might be what brings most of the maintainers to the project and funds their time.

Work “out of the box”: Provide a very fast, world class query engine with minimal setup or required configuration.

This implies reasonable collection of functions to be bundled to make the engine useful for the end-user.

For people building things on top of DataFusion, core performance & extensibility are must-haves. If we want DataFusion to be "just" extensible query engine, we can stop here.

For users, functionality (broadly speaking) and out of the box experience are must-haves.
If we want to have users, we need to understand what they want, what will make them happy.
Users may interact with DF via our CLI, our next-gen CLI #11979 or projects that build on top of DF (but not necessarily by repackaging it) like Ibis (https://ibis-project.org/backends/datafusion).
My personal opinion is that we want users & this will help this project grow and indirectly attract more people building on top of it.

Performance improvements to existing features: yes

This is an obviously good call. In practice we will face trade-offs like first-row latency vs throughput, that depends on the intended typical use-case. Hopefully this is not too often.

2010YOUY01 · 2024-09-07T03:44:13Z

Some ideas for potential criteria

Bug fixes for existing features: yes

Performance improvements to existing features: yes

Functional improvements to existing features: Yes

New functionality that is part of the "standard sql" (aka postgres dialect): Yes

New APIs for extending various parts of DataFusion: Yes

New functions that aren't part of the "standard sql" (aka postgres dialect): No (rationale being they can be implemented as extensions / packages)

New data sources (e.g. support for ORC) No (rationale being they can be implemented using extensions easily)

I think only maintain SQL features which are also built-in features in PostgreSQL is a good idea (also a very clear criteria)
Postgres should have a smaller feature set than DuckDB/SparkSQL, though many DuckDB functions are really good for UX, but the larger size make it hard to maintain in DataFusion core

jayzhan211 · 2024-09-07T05:04:37Z

I think the reason why DuckDB is also taken into consideration is that when we start the array function, we found that OLAP style db is a much more suitable choice to follow than Postgres. #6855

Therefore, I'm not sure if we only stick with Postgres Only is a good idea. We might need to discuss it case by case.

alamb · 2024-09-07T10:49:00Z

I think the reason why DuckDB is also taken into consideration is that when we start the array function, we found that OLAP style db is a much more suitable choice to follow than Postgres. #6855

I agree -- this is a good point. There are certain feature sets (VARIANT is another more recent one that comes to mind) where there really isn't a good postgres feature set to follow, and if we want to add such features into DataFusion then we would need to find another standard.

Therefore, I'm not sure if we only stick with Postgres Only is a good idea. We might need to discuss it case by case.

I think it would be an excellent idea to make sure we don't end up with some "hard and fast rule that must always be followed" -- ensuring we can continue to evaluate each idea on a case by case basis is a great point. Maybe in these cases the "bar" is higher (like a good amount of the community thinks it is an important and widely applicable feature 🤔 )

jonahgao · 2024-09-07T14:01:31Z

I agree with keeping the DataFusion core simple and focused.

I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need. These extensions display different properties based on different types, such as TableProvider or UDF. We may need to do some work to make integrating extensions into DataFusion easier.

phillipleblanc · 2024-09-07T15:09:04Z

I like the analogy that DataFusion is to query engines what LLVM is to programming languages. (I think I heard Andrew say that once?) Although the analogy isn't perfect, because you can use DataFusion out of the box for a great SQL query experience whereas LLVM (to my knowledge) requires writing a non-trivial amount of code to integrate with it.

Actually I think its because DataFusion has such a great out of box experience, that people want to naturally add to it to make it even better.

For people building things on top of DataFusion, core performance & extensibility are must-haves. If we want DataFusion to be "just" extensible query engine, we can stop here.

For users, functionality (broadly speaking) and out of the box experience are must-haves.
If we want to have users, we need to understand what they want, what will make them happy.

This is an important distinction, and where we need to decide if we want to be more like LLVM (i.e. focus on people building things on top of DataFusion) or something that attracts users directly. I don't think that those are mutally exclusive (i.e. most users probably are people building on top of DataFusion) - but I do think it makes sense to focus more on the core part of what makes DataFusion great as mentioned above.

I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need. These extensions display different properties based on different types, such as TableProvider or UDF. We may need to do some work to make integrating extensions into DataFusion easier.

Yes, I think part of the solution here is to make it very easy to discover extensions that add to the base DataFusion functionality. I think part of why its tempting to add new features to DataFusion core is that it makes it more discoverable by default/provides a natural coordination point for implementing a set of functionality.

As a concrete example, when I was first integrating with DataFusion for my project I needed the ability to translate DataFusion expressions back into raw SQL strings to implement a TableProvider. I found the unparser module that implements this functionality by looking through the code in the DataFusion repo. Had I not found it, I probably would have gone on to implement it myself. But I could make the argument that the unparser module doesn't really need to be in DataFusion core, it could just be an extension or in its own crate. (Actually in that case, the unparser was already in a separate crate initially then it was brought into the core.) Having a natural way to discover functionality like the unparser without having to be in the core repo is important to solve - otherwise there will always be an incentive to try to get it into the core. Or we end up with multiple people implementing the same thing instead of working on it together - which almost happened in the unparser example.

alamb · 2024-09-10T14:41:18Z

This is an important distinction, and where we need to decide if we want to be more like LLVM (i.e. focus on people building things on top of DataFusion) or something that attracts users directly. I don't think that those are mutally exclusive (i.e. most users probably are people building on top of DataFusion) - but I do think it makes sense to focus more on the core part of what makes DataFusion great as mentioned above.

I agree -- this idea is somewhat mentioned in https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz as well:

Targeted at developers, rather than end users / data scientists.

Yes, I think part of the solution here is to make it very easy to discover extensions that add to the base DataFusion functionality. I think part of why its tempting to add new features to DataFusion core is that it makes it more discoverable by default/provides a natural coordination point for implementing a set of functionality.

I am thinking whether we should maintain an index service or something like VSCode marketplace to showcase third-party extensions developed by other users and make it easy for users to find the extensions they need.

I agree with @phillipleblanc and @jonahgao -- here is a proposal to try and make it easier to discover extensions:

Add 'Extensions List' page to the documentation #12420

It isn't quite as easy as VSCode marketplace (or the newly announced DuckDB community extensions: https://community-extensions.duckdb.org/) but it is a start.

I also very much hope that the https://github.com/datafusion-contrib/datafusion-tui project @matthewmturner and I are working on will become an example / easy to start from place for pre-cooked integrations which will help with discoverability. We still have a ways to go but I am feeling bullish.

ozankabak · 2024-09-11T09:36:22Z

Thank you for starting this discussion. I really agree with this concise statement:

keeping the DataFusion core simple and focused.

When we first joined the project (almost two years ago now), it took us some time to internalize/digest this approach as our first instinct was to contribute as much as we can upstream. However, I can safely say that following this guideline helped us with our engineering too -- it forces one to think about the right boundaries between components, what belongs to the core, etc.

kszlim · 2024-09-12T05:16:41Z

One vote here for the other use case. I'd like datafusion to be usable as a single node query engine (alongside a nice dataframe api). This is in works within the datafusion-python bindings, but I'd personally love for this use case to gain as much priority as datafusion as a library to build other db products on top of.

I really think with a combination of really strong python bindings (and ensuring that all extension points are also appropriately exposed to python), #4285, and a lot of work into making the docs and the python bindings as nice as polars. Datafusion could become the go to solution for ETL/OLAP/ML/data engineering/etc. use cases.

DataFusion has a lot of really excellent foundational engineering. How it's used by so many downstream DB engines attests strongly to that. I think it's a real shame that it isn't quite as suitable for the role that pandas/dask/polars/duckdb currently occupies. This isn't due to anything lacking in the query engine, but the overall user experience for a direct user isn't quite as solid (as opposed to someone using it as a library).

alamb · 2024-09-12T10:45:39Z

DataFusion has a lot of really excellent foundational engineering. How it's used by so many downstream DB engines attests strongly to that. I think it's a real shame that it isn't quite as suitable for the role that pandas/dask/polars/duckdb currently occupies. This isn't due to anything lacking in the query engine, but the overall user experience for a direct user isn't quite as solid (as opposed to someone using it as a library).

Thank you @kszlim -- This is well stated, and I think this is one of the core tensions that has existed in the project from the early days

One way to go is as you suggest and try and make datafusion the superset of all that is good about polars (python dataframes) and duckdb (sql). I worry that this will result in an even larger library that will never be good as either.

Another potential way is to keep the core focused on fundamentals and work to provide open source alternatives to those other libraries built on datafusion. It is my not-so-secret goal with the following discussions:

polars: [DISCUSSION] We need a Hero for datafusion-python datafusion-python#440 (🙌 @timsaucer )
duckdb: Proposal: Create dfdb, a new CLI different than datafusion-cli with pre-built integrations #11979 (🙌 @matthewmturner )

I am hopeing to see datafusion-python (or maybe a library built on datafusion-python) and dft evolve into delightful end user experiences.

The benefit if keeping the core more focused is that it would make it easier to embed and have more usecases, thus drawing more users and thus contributors back.

timsaucer · 2024-09-12T12:31:25Z

I really think with a combination of really strong python bindings (and ensuring that all extension points are also appropriately exposed to python), #4285, and a lot of work into making the docs and the python bindings as nice as polars.

I feel like we've made a ton of progress on this in datafusion-python 40 and 41. As someone who is also using datafusion-python in my project, I can already feel the huge usability improvements that make my day to day work more enjoyable. Now, I'm probably biased since I am focusing on building those as I need them for my projects. But the type hinting, simpler apis, html rendering in notebooks, and rust udfs in python all have made a really different experience from when I first started to use it.

The point I'm still struggling with right now is the extension points and how those can/should fit into the python bindings. There are some parts that are trivially easy to do and some parts that are not supported. I should probably open an issue to find out what all of the extensions people would like to see in the python bindings.

That's a bit of an aside from the central discussion here. My thoughts on the core question is much in line with what @alamb suggests above about supporting core features and a minimal set of extensions to demonstrate the usability.

alamb · 2024-09-12T12:42:40Z

I feel like we've made a ton of progress on this in datafusion-python 40 and 41. As someone who is also using datafusion-python in my project, I can already feel the huge usability improvements that make my day to day work more enjoyable. Now, I'm probably biased since I am focusing on building those as I need them for my projects. But the type hinting, simpler apis, html rendering in notebooks, and rust udfs in python all have made a really different experience from when I first started to use it.

I think all great software is created by someone who is in some way building it for themselves and has an intuitive understanding of what is needed. I am very glad you have started to help craft datafusion-python this way

Omega359 · 2024-09-12T19:20:40Z

One vote here for the other use case. I'd like datafusion to be usable as a single node query engine (alongside a nice dataframe api).

This is my use case - datafusion is an embedded query engine which I use via it's dataframe api. I have a very small set of changes that I've made to datafusion in a branch but for the most part I use it as it is.

alamb · 2024-09-17T18:04:52Z

FYI we created https://github.com/datafusion-contrib/datafusion-functions-extra as a home for extra functions to try and organize our efforts to make new functions outside the core of datafusion

See #12254 (comment) for more details

alamb · 2024-09-18T11:02:36Z

In case it isn't obvious, one of my goals with encouraging / setting up other repositories is to provide an outlet for contributions that isn't the datafusion core

I don't want the answer to be "no we don't want them" -- I just think the answer can't be "put them in the datafusion core" for everything (mostly to keep the maintenance of the project manageable)

cisaacson · 2024-09-18T12:59:37Z

@alamb This is a great way to do this, it allows the core of DataFusion to keep its focus. I support this approach, and it allows other tools to be added that rely on DataFusion.

alamb · 2024-10-21T11:11:00Z

Unless anyone has further comments, I hope to make a PR codifying the discussion above into the documentation over the next week or two

mkarbo · 2024-11-06T10:27:01Z

I don't know if this is the correct thread, and maybe I am just bad at searching - but I spent at least a few hours trying to figure out if it's possible to create & register custom DDL, for instance (just a silly example to get the point across)

create TACO as t WITH toppings ( ... );

or perhaps something entirely different without the keyword create

In reality, it might be to register external secret managers (similar to duckdb's SECRET) or other non-ansi semantics and objects that might belong in an application built leveraging datafusion as a library and foundation.

I suspect it might eventually be covered in the unfinished section here though https://datafusion.apache.org/library-user-guide/extending-operators.html, but I thought to ask either way here for good measure.

alamb · 2024-11-06T22:32:54Z

Hi @mkarbo -- DataFusion actually has its own SQL dialect that was implemented as a small extension to the sqlparser

https://docs.rs/datafusion/latest/datafusion/sql/parser/struct.DFParser.html

I think you can take a look at how DataFusion does it -- namely parse the token stream yourself (unless you need some token that is not defined in sqlparser-rs) and delegate to sqlparser-rs if it isn't your special DDL

Then you have a match statement in front that either inteprets / runs your custom statement or passes to DataFusion's normal Statement

alamb added enhancement New feature or request documentation Improvements or additions to documentation labels Sep 6, 2024

alamb mentioned this issue Sep 6, 2024

[Epic] A collection of issues for extending the Aggregation function #12254

Closed

7 tasks

Xuanwo mentioned this issue Sep 6, 2024

Feat: Implement hf:// / "hugging face" integration in datafusion-cli #10792

Draft

alamb mentioned this issue Sep 7, 2024

Add documentation about performance PRs, add (TBD) section on feature criteria #12372

Merged

alamb pinned this issue Sep 9, 2024

devanbenz mentioned this issue Sep 10, 2024

Casting existing timestamp to timestamp again strips timezone information #12218

Open

dharanad mentioned this issue Sep 15, 2024

Add array_dot_product / list_dot_product function #12476

Closed

alamb mentioned this issue Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Closed

8 tasks

Weijun-H mentioned this issue Sep 18, 2024

feat(function): add greatest function #12474

Open

alamb mentioned this issue Sep 20, 2024

[EPIC] Easier extension configuration SessionState / SessionConfig #12550

Open

6 tasks

findepi mentioned this issue Sep 24, 2024

Proposal: introduced typed expressions, separate AST and IR #12604

Open

Weijun-H mentioned this issue Sep 25, 2024

implement kurtosis udaf #12613

Closed

findepi mentioned this issue Oct 2, 2024

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Open

10 tasks

alamb mentioned this issue Oct 16, 2024

Oct 16, 2024: This week in DataFusion #12973

Closed

timsaucer mentioned this issue Oct 17, 2024

FFI initial implementation #12920

Merged

3 tasks

alamb mentioned this issue Oct 21, 2024

Oct 21, 2024: This week in DataFusion #13035

Closed

4 tasks

alamb mentioned this issue Oct 29, 2024

Oct 28, 2024: This week in DataFusion #13167

Closed

3 tasks

alamb mentioned this issue Nov 5, 2024

Nov 5. 2024: This week in DataFusion #13265

Open

3 tasks

alamb mentioned this issue Nov 13, 2024

Review Backlog and Plan - Andrew Lamb - Nov 2024 #13386

Open

waynexia mentioned this issue Nov 14, 2024

Add greatest(T,...) and least(T,...) SQL functions #6531

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

alamb commented Sep 6, 2024 •

edited

Loading

alamb commented Sep 6, 2024

alamb commented Sep 6, 2024

jayzhan211 commented Sep 6, 2024 •

edited

Loading

cisaacson commented Sep 6, 2024

findepi commented Sep 6, 2024

2010YOUY01 commented Sep 7, 2024

jayzhan211 commented Sep 7, 2024 •

edited

Loading

alamb commented Sep 7, 2024

jonahgao commented Sep 7, 2024

phillipleblanc commented Sep 7, 2024 •

edited

Loading

alamb commented Sep 10, 2024 •

edited

Loading

ozankabak commented Sep 11, 2024 •

edited

Loading

kszlim commented Sep 12, 2024

alamb commented Sep 12, 2024 •

edited

Loading

timsaucer commented Sep 12, 2024

alamb commented Sep 12, 2024

Omega359 commented Sep 12, 2024

alamb commented Sep 17, 2024

alamb commented Sep 18, 2024

cisaacson commented Sep 18, 2024

alamb commented Oct 21, 2024

mkarbo commented Nov 6, 2024

alamb commented Nov 6, 2024

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

[DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357

Comments

alamb commented Sep 6, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Sep 6, 2024

alamb commented Sep 6, 2024

jayzhan211 commented Sep 6, 2024 • edited Loading

cisaacson commented Sep 6, 2024

findepi commented Sep 6, 2024

2010YOUY01 commented Sep 7, 2024

jayzhan211 commented Sep 7, 2024 • edited Loading

alamb commented Sep 7, 2024

jonahgao commented Sep 7, 2024

phillipleblanc commented Sep 7, 2024 • edited Loading

alamb commented Sep 10, 2024 • edited Loading

ozankabak commented Sep 11, 2024 • edited Loading

kszlim commented Sep 12, 2024

alamb commented Sep 12, 2024 • edited Loading

timsaucer commented Sep 12, 2024

alamb commented Sep 12, 2024

Omega359 commented Sep 12, 2024

alamb commented Sep 17, 2024

alamb commented Sep 18, 2024

cisaacson commented Sep 18, 2024

alamb commented Oct 21, 2024

mkarbo commented Nov 6, 2024

alamb commented Nov 6, 2024

alamb commented Sep 6, 2024 •

edited

Loading

jayzhan211 commented Sep 6, 2024 •

edited

Loading

jayzhan211 commented Sep 7, 2024 •

edited

Loading

phillipleblanc commented Sep 7, 2024 •

edited

Loading

alamb commented Sep 10, 2024 •

edited

Loading

ozankabak commented Sep 11, 2024 •

edited

Loading

alamb commented Sep 12, 2024 •

edited

Loading