-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance on wide tables (1000+ columns) #7698
Comments
Thank you for the report @karlovnv -- I agree with your analysis and we are indeed tracking various ways to make DataFusion's planing faster in #5637 Another of the performance issues I think is related to the ones you have already identified, which is related to the representation of Schemas and name resolution (often error strings are created and then ignored, for example) If you (or anyone else) has any time to help with this project it would be most appreciated |
@alamb , thank you for reply! |
Also it's good to consider implementing prepared physical plans (with parametrization) it will add an ability to cache them |
Good issue, I do it. We can't use HashMap because we need to preserve the insertion order. |
Thanks @maruschin I wonder if this is a good time to take a step back and see if we could make DF Schema easier to use in general -- I think it is already pretty complicated and optimizing certain methods will likely make it more so. For example, I wonder if we making the index map will make less complex queries more so, or if we need to take more care to reuse DFSchema Thus I suggest maybe sketching out the type of change you have in mind in a draft PR that we can discuss prior to spending all the time to get the PR polished up. |
@maruschin -- it would help in general to have some idea of thewhy (rationale) behind #7895 -- presumably it is because it makes something easier / less error prone, but I am sorry I don't immediately understand |
@alamb take a look at the PR #7870 please, where @oleggator has implemented BTree instead of list. It's improved physical plan construction x2 times |
I have reviewed #7870 and #7878. Thank you for your work @maruschin and @karlovnv Here are my thoughts:
|
Made the benchmark. |
Hi, |
I've applied the precomputed qualified_name I've mentioned above and the btree draft by @oleggator to DataFusion 31.0.0, then ran valgrind with a simple SELECT many-columns query for a table that has 3617 columns. The following attachment is the resulting call graph in SVG format. |
You can try to put it behind a |
I've additionally changed the type of the precomputed qualified_name from |
I think it's a good idea to cache instances of DFSchema (and Arrow Schema as well). Tho most flexible way is to implement user defined SchemaCacheProvider (let users of datafusion decide how cache schemas). |
Another thought is to use cache of physical plan (I tested serialized into protobuf optimized physical plan as a cache and it leads to increasing of performance dramatically) |
We do something similar to this in IOx (cache schemas that we know don't change rather than recomputing them) It is my opinion that in order to make DFSchema behave well and not be a bottleneck we will need to more fundamentally restructure how it works. Right now the amount of copying required is substantial as has been pointed out several times on this thread. I think with sufficient diligence we could avoid almost all copies when manipulating DFSchema and then the extra complexity of adding a cache or other techniques would become unnecessary.
I think this is a great idea. I think optimizing for the case of the same, reused qualifier, is a very good idea. What do people think about the approach described on #7944? I (admittedly biasedly) think that approach would eliminate almost all allocations (instead it would be ref count updates). We can extend it to incorporate ideas like pre-caching qualified names and hash sets for column checks, and I think it could be pretty fast |
I've tried to optimize the logical planning and optimization routines.
In the following, each optimization step is accumulated. Elapsed times are reduced accordingly. Note that the following optimization steps were not heavily tested. No code change: original timing
Optimization 1: In
Optimization 2: Apply #7870 (Use btree to search fields in DFSchema)
Optimization 3: Change And change other codes accordingly, to avoid string
Optimization 4: precompute Like this: let using_columns = plan.using_columns()?;
for e in expr {
let e = e.into();
match e {
Expr::Wildcard => {
projected_expr.extend(expand_wildcard(input_schema, &plan, None)?)
}
Expr::QualifiedWildcard { ref qualifier } => projected_expr
.extend(expand_qualified_wildcard(qualifier, input_schema, None)?),
_ => projected_expr.push(columnize_expr(
normalize_col_with_using_columns(e, &plan, &using_columns)?,
input_schema,
)),
}
} And implement
Optimization 5: In Like this: let duplicated_field = match field.qualifier() {
Some(q) => self.has_field_with_qualified_name(q, field.name()),
// for unqualified columns, check as unqualified name
None => self.has_field_with_unqualified_name(field.name()),
}; And implement
Optimization 6: In Similar to optimization 5. _ => match e.display_name() {
Ok(name) => match input_schema.get_field_with_unqualified_name(&name) {
Some(field) => Expr::Column(field.qualified_column()),
// expression not provided as input, do not convert to a column reference
None => e,
},
Err(_) => e,
}, And implement
Optimization 7: Use Like this: fn find_exprs_in_exprs<F>(exprs: &[Expr], test_fn: &F) -> Vec<Expr>
where
F: Fn(&Expr) -> bool,
{
exprs
.iter()
.flat_map(|expr| find_exprs_in_expr(expr, test_fn))
.fold(IndexSet::new(), |mut acc, expr| {
acc.insert(expr);
acc
})
.into_iter()
.collect()
}
Optimization 8: In Like this: fn calc_func_dependencies_for_project(
exprs: &[Expr],
input: &LogicalPlan,
) -> Result<FunctionalDependencies> {
let input_schema = input.schema();
if !input_schema.has_functional_dependencies() {
return Ok(FunctionalDependencies::empty());
} And implement
Now, the resulting 0.8 seconds is acceptable for me. |
Thank you for this very detailed report @zeodtr -- this sound great. Several of the optimizations you describe above seem like they could be pulled out into small PRs for DataFusion. How do you suggest we proceed to make progress here? |
@alamb I may not be able to make PRs myself. |
In my (humble, may be wrong) opinion, DataFusion planning code may have the following performance problems.
It's just my humble opinion. (I don't have deep knowledge of plan building) |
I think there are tradeoffs of each approach for sure. We have had some discussions about the various pros and cons in the past that might interest you:
In general, I think you have identified the core improvements necessary to support faster planning with complex schemas. |
@alamb I've read the discussions you shared. Thank you.
Thank you. |
@alamb Hi! Could you please let us know if any work is planned here? We noticed that performance of DaraFusion in case of wide tables slow down significantly from version to version causes us to stay at 31( |
Hi @karlovnv -- I would say we are making some progress -- most of the work is tracked in the parent epic, #5637 and there have been some improvements recently there. I think the best help you could give is is to ensure that the planning benchmarks we have added (see https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/benches/sql_planner.rs) accurately reflect what you are doing. If there are other types of queries / schemas you are using that are not reflected there a PR / examples would be most appreciated as we continue to work to improve things |
@alamb we tested the same perf test on 37.1 and it seems that now 99% of request time is spent on planning and optimizing (creating and optimizing of logical plan, planner, creating and optimizing of logical plan) It seems that it's a huge degradation since 31 (now we use 31 as it is still much more performant on planning) |
Thank you for the report @karlovnv. Any benchmarks you are able to share / contribute would be most helpful for improving the code. With the benchmarks we have in place, I am happy to report we see 10x faster planning in 38.0.0 (just released) compared to 37.0.0 for 1000 columns. Not sure if you have tried that version yet. Details are here #5637 (comment) |
Thank you for your reply @alamb! We'll check it on 38 and share results. This particular example is synthetical as we implemented it using pure memory tables without any external dependencies. In real project (we are developing an in-memory columnar database for antifraud and scoring and using DF as a query engine) we faced with similar perf issues (31 vs 37.1) |
Describe the bug
I'am testing DataFusion for using it in a system which has several thousand columns and billions of rows.
I'm excited about the flexibility and possibilities this technology provides.
The problems we faced with:
We workarounded it with prepared queries (we cache parametrized logical plan)
Some investigation on that showed, that there a lot of string comparisons (take a look at flamegraph)
Now algorithm has O(N^2) complexity (N in iterating all the columns in
datafusion_common::dfschema::DFSchema::index_of_column_by_name
and N in
datafusion_common::table_reference::TableReference::resolved_eq
).https://github.com/apache/arrow-datafusion/blob/22d03c127e7c5e56cf97ae33eb4446d5b7022eaa/datafusion/common/src/dfschema.rs#L211
Some ideas to resolve:
Thank you for developing a such great tool!
To Reproduce
It's hard to extract some code from the project, but I will try to build simple repro
Expected behavior
Creation of physical plan spent much less time in CPU than it's execution
Additional context
No response
The text was updated successfully, but these errors were encountered: