-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce clone of Statistics
in ListingTable
and PartitionedFile
#11802
Conversation
According to simple benchmark about q0 of clickbench.
|
This is very exciting and a great idea. Thank you @Rachelint We have seen similar performance challenges cloning |
Glad to see that it can help! In fact, I want to refactor the returned value of
But I don't know if it is ok to modify such a function in the public trait... |
Statistics
by using arcStatistics
in ListingTable
459bcfc
to
28efa57
Compare
Statistics
in ListingTable
Statistics
in ListingTable
-- 2x faster for ClickBench Q0
Statistics
in ListingTable
-- 2x faster for ClickBench Q0Statistics
in ListingTable
More benchmarks(no change ones):
|
Some strange results found during the benchmarks(see #11807). But when I pulling main and rebasing, the results become different... |
28efa57
to
5162833
Compare
Statistics
in ListingTable
Statistics
in ListingTable
and PartitionedFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @Rachelint -- I went through this PR carefully. I have some comments that I think could make the code better but I don't think they are necessary to merge this PR
@@ -78,10 +78,11 @@ pub struct PartitionedFile { | |||
/// | |||
/// DataFusion relies on these statistics for planning (in particular to sort file groups), | |||
/// so if they are incorrect, incorrect answers may result. | |||
pub statistics: Option<Statistics>, | |||
pub statistics: Option<Arc<Statistics>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 This alone will likely avoid a bunch of copying
It is also an API change, so I marked the PR thusly
@@ -159,6 +160,24 @@ impl From<ObjectMeta> for PartitionedFile { | |||
} | |||
} | |||
|
|||
impl Default for PartitionedFile { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as #[derive(Default)]
?
|
||
total_byte_size = | ||
add_row_stats(file_stats.total_byte_size, total_byte_size); | ||
add_row_stats(file_stats.total_byte_size.clone(), total_byte_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that stats here are Precision<usize>
(and thus this clone is not a performance problem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also made a small experiment to see if an alternate formulation where it might be clearer that the copy is not occuring (last commit in #11828)
) -> Precision<ScalarValue> { | ||
match (&min_values, &min_nominee) { | ||
(Precision::Exact(val1), Precision::Exact(val2)) if val1 > val2 => min_nominee, | ||
min_nominee: &Precision<ScalarValue>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 to reduce this copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 to reduce this copy
I think the alternative may be that we refactor the clone expensive scalars to the clone cheap impl (like String
to Arc<str>
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would also be something interesting to pursue 💯
partitioned_files | ||
.chunks(chunk_size) | ||
.map(|c| c.to_vec()) | ||
.chunks_mut(chunk_size) | ||
.map(|c| c.iter_mut().map(mem::take).collect()) | ||
.collect() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could also be forumulated with drain()
and this avoid the need for Default
: https://doc.rust-lang.org/std/vec/struct.Vec.html#method.drain
Here is a POC of it working: #11829
let mut chunks = Vec::with_capacity(n);
let mut current_chunk = Vec::with_capacity(chunk_size);
for file in partitioned_files.drain(..) {
current_chunk.push(file);
if current_chunk.len() == chunk_size {
chunks.push(mem::take(&mut current_chunk));
}
}
if !current_chunk.is_empty() {
chunks.push(current_chunk)
}
chunks
(I don't know if this matters for performance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this to replace the chunk_mut
, see no changes in performance, but it is really good to eliminiate the default
need of PartitionedFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am running benchmarks on my test machine too but I think this looks great to me
I think there can be some signifiant variation in performance when we are measuring queries that take 100s of ms -- so it may be measurement noise |
I couldn't reproduce the performance improvement 🤔
|
@alamb It mainly improve the short queries in 🤔 But it is strange that it get slower in other cases (especially the not short queries in
I am running |
9a8674f
to
ce03376
Compare
ce03376
to
49ca5cb
Compare
That is my expectation too
My script tries to control for this by comparing against
Yes, maybe
Than thank you. I also hope to spend some time shortly looking into this (and it looks like you have done some additonal work too) |
I am still working on finding the reason why the long queries slower(e.g. q22 as mentions about the strange result above), after pulling and rebasing to the latest main, this branch took 9200ms, and main 8800ms now... The code change here is almost impossible to make such a difference (the planning stages just took less than 5ms). I used perf to collects some cpu metrics, also seems almost same ...
|
It is really Interesting, I profile the two branch with the q22 in
Then, I found Then I only revert the commit Seems the
|
6993a3f
to
3b393d3
Compare
3b393d3
to
56cc8ea
Compare
@alamb finally I think I got the reason, it seems not the measurement noise for the long queries(such as q22 in clickbench)... The introduction for the I guess it is related to the I eliminate the The new benchmarks can see following. |
The target case
|
The other non-target cases
|
I find it very strange that |
I'll rerun and see what I can see |
The statistic is actually not used when the execution started, I guess it may be due to the drop of datafusion/datafusion/core/src/datasource/physical_plan/file_stream.rs Lines 291 to 305 in 16a3557
Maybe a possible alternative worth trying in future: we take and drop the |
Removed the api change label as we have now removed the I agree the timings look good now. Would you be willing to create a new PR with just the Thanks again @Rachelint |
@alamb I found the reason finally,
Yes, planning to, it is really worth puresuing. |
Thanks -- filed #11885 |
Which issue does this PR close?
Part of #11719
Rationale for this change
What changes are included in this PR?
Statistics
using arcget_statistics_with_limit
(seems may tmp vectors exist, but not sure)Are these changes tested?
By exist tests.
Are there any user-facing changes?
No.