Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Dandandan · 2023-10-02T12:06:49Z

Which issue does this PR close?

Closes #7196
Closes #7250

Rationale for this change

Adding TopK to DataFusion limits the resources (memory, CPU) needed for SELECT .. ORDER BY [..] LIMIT N type of queries.
@alamb implemented most of the changes necessary, this PR does some final update / clean up of the code.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb · 2023-10-02T13:02:44Z

Thank you @Dandandan -- let me know if you would like help finishing up this PR. It has been on my list but I haven't had a chance yet.

Maybe I could make a PR that changed the display of plans to show when topk was being used 🤔

datafusion/physical-plan/src/topk/mod.rs

Dandandan · 2023-10-03T11:40:35Z

datafusion/physical-plan/src/topk/mod.rs

+        let schema = self.store.schema().clone();
+
+        // generate sorted rows
+        let topk_rows = std::mem::take(&mut self.inner).into_sorted_vec();


Replaced sort with into_sorted_vec which utilizes the already sorted heap.

Dandandan · 2023-10-03T11:46:47Z

datafusion/physical-plan/src/sorts/sort.rs

-                        write!(f, "SortExec: fetch={fetch}, expr=[{}]", expr.join(","))
+                        write!(
+                            f,
+                            // TODO should this say topk?


I'm not sure if we would like to do this? I think there are some other ExecutionPlan nodes that have the algorithm depend on one of the parameters (for example: HashAggregate modes) .

I think in general it would be good to be able to tell what operator was going to be used from looking at the plan. However, I think we can do so as a follow on PR -- I can file a ticket.

#7750 tracks this work

alamb

Thank you for picking this up @Dandandan -- this looks really nice.

I left some small comments / suggestions. Before merging this PR I think we should review the existing LIMIT test coverage. The Minimum coverage I think would is needed needed:

multiple-record batch input
single and multi-column input,
"large N" where N is greater than 20 on randomized input (to ensure the RecordBatch store is covered)

It would also be awesome to do the following (which I can help with / do perhaps):

Implement a limit "fuzz" test to check the boundary conditions in a wider range
File a follow on ticket to display which algorithm is used in what operator in the explain plan

alamb · 2023-10-03T13:29:35Z

datafusion/physical-plan/src/sorts/sort.rs

-                        write!(f, "SortExec: fetch={fetch}, expr=[{}]", expr.join(","))
+                        write!(
+                            f,
+                            // TODO should this say topk?


I think in general it would be good to be able to tell what operator was going to be used from looking at the plan. However, I think we can do so as a follow on PR -- I can file a ticket.

datafusion/sqllogictest/test_files/aal.slt

alamb · 2023-10-03T13:31:58Z

datafusion/sqllogictest/test_files/window.slt

-289 261 296 301 NULL 275 98 98 98 98 85 85 291 289 291 1004 305 305 296 291 301 305 301 283
-286 259 291 296 NULL 272 97 97 97 97 84 84 289 286 289 1004 305 305 291 289 296 301 296 278
-275 254 289 291 289 269 96 96 96 96 83 83 286 283 286 305 305 305 289 286 291 296 291 275
+264 289 266 305 305 305 278 99 99 99 99 86 86 296 291 296 1004 305 305 301 296 305 1002 305 286


added ts to show that the first two values are tied, and that the output is correct ✅

datafusion/physical-plan/src/topk/mod.rs

alamb · 2023-10-03T13:44:43Z

datafusion/physical-plan/src/topk/mod.rs

+    /// Compact this heap, rewriting all stored batches into a single
+    /// input batch
+    pub fn maybe_compact(&mut self) -> Result<()> {
+        // we compact if the number of "unused" rows in the store is


@Dandandan did you review this heuristic -- I remember I tried it on our high cardinality tracing usecase and it seemed to work well (basically it is needed for large N with random-ish inputs)

I'll take a look :)

I think the heuristic is fine:

it assures we do compaction at most every n (> 20) batches of input or more if batches are utlized

compaction reduces number of rows to k. 20 * 8192 = 163840 rows . If we have some wider columns of 1kB each, the memory usage could be ~200MB with some overhead. Thinking about it, I wonder if we need to trigger the compaction as well if it exceeds the configured memory limit 🤔

for very large k (a number of times the batch size) we avoid doing compaction too often

We can tweak the heuristic later if there is some cases benefiting from that.

Sounds good -- thank you.

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb

Thanks @Dandandan -- I (biasedly) think this PR is looking quite good 👍

I want to run one final set of performance tests prior to merging (which I am starting now) as well as filing the follow on tickets

datafusion/sqllogictest/test_files/topk.slt

alamb

Thank you very much @Dandandan .

I think this is ready to to go now.

I tested with this dataset: traces.zip (240MB):

Dataset	Description	Creation
`traces`	directory of parquet files	N/A
`traces.parquet	same data as `traces` in a single parquet file	`copy (select * from 'traces') to 'traces.parquet'
`traces_ob_random.parquet`	same as `traces.parquet` but `ORDER BY RANDOM` (so topk gets updated a lot)	`copy (select * from 'traces' ORDER BY random()) to 'traces_oby_random.parquet'`

Branch	Query	Time
`topk`	`select * from 'traces.parquet' order by time desc limit 10`	2.416 seconds.
`main`	"	3.403 seconds.
`topk`	`select * from 'traces.parquet' order by time desc limit 10000`	3.073 seconds.
`main`	"	3.868 seconds.
`topk`	`select * from 'traces_oby_random.parquet' order by time desc limit 10`	2.403 seconds.
`main`	"	3.500 seconds.
`topk`	`select * from 'traces_oby_random.parquet' order by time desc limit 10000`	4.024 seconds.
`main`	"	3.997 seconds.
`topk`	`select * from 'traces' order by time desc limit 10`	0.750 seconds.
`main`	"	0.902 seconds.
`topk`	`select * from 'traces' order by time desc limit 10000`	2.256 seconds.
`main`	"	1.244 seconds.

The only query it gets slower for is large N with multiple files. I believe this is because reconstructing the 10,000 row outputs for each of the partitions, merging them, and then reconstructing the heap is fairly expensive. It would be better in this case to avoid the sort and doing a final topK

TableScan: traces projection=[attributes, duration_nano, end_time_unix_nano, service.name, span.kind, span.name, span_id, time, trace_id, otel.status_code, parent_span_id]                                    GlobalLimitExec: skip=0, fetch=10000                                                                                                                                                                           SortPreservingMergeExec: [time@7 DESC], fetch=10000
      SortExec: fetch=10000, expr=[time@7 DESC]
        ParquetExec: file_groups={16 groups: [...]}

I plan to file a follow on ticket for this shortly

alamb · 2023-10-04T21:30:16Z

FYI @gruuya -- it is finally happening

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan · 2023-10-05T11:33:09Z

Thanks @alamb for the review - I plan to test it at our side (Coralogix) and see if there's some follow-up necessary.

The only query it gets slower for is large N with multiple files. I believe this is because reconstructing the 10,000 row outputs for each of the partitions, merging them, and then reconstructing the heap is fairly expensive. It would be better in this case to avoid the sort and doing a final topK.

This sounds like a good idea, although probably in for distributed usage (e.g. Coralogix) might not be beneficial as we'll need to fetch all partitions instead of doing TopK + merge in a distributed manner.

alamb · 2023-10-05T20:30:28Z

Filed

TopK Fuzz Tests 🐝 #7749 to track adding fuzz tests
Update explain plan to show when topk operator is used #7750 to track making it clearer when TopK is used in the plans

…opK operator (apache#7721) * Prototype TopK operator * Avoid use of Row * start working on compaction * checkpoint * update * checkpoint * fmt * Fix compaction * add location for re-encoding * Start sketching dictionary interleave * checkpoint * initial specialized dictionary * finish initial special interleave * Complete dictionary order * Merge * fmt * Cleanup * Fix test * Cleanup * Make test deterministic * Clippy, doctest * Use into_sorted_vec * Fix nondeterministic tests * Update cargo.lock * Update datafusion/physical-plan/src/topk/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/physical-plan/src/topk/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/physical-plan/src/topk/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion/physical-plan/src/topk/mod.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add / update some comments * Rename test file * Rename table as well * Update datafusion/sqllogictest/test_files/topk.slt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb and others added 16 commits August 11, 2023 11:09

Prototype TopK operator

524af05

Avoid use of Row

d4c09f2

Merge remote-tracking branch 'apache/main' into alamb/topk

6c85247

start working on compaction

948c1a2

checkpoint

354d687

update

afea7d3

checkpoint

69b86ab

fmt

c8b415c

Fix compaction

0337e31

add location for re-encoding

db196fb

Start sketching dictionary interleave

f123075

checkpoint

157379a

initial specialized dictionary

682127a

finish initial special interleave

a1ea62e

Complete dictionary order

5e65130

Merge

7f29366

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 2, 2023

Dandandan added 4 commits October 2, 2023 14:11

Merge

4a30c4c

Merge remote-tracking branch 'upstream/main' into topk

78163bd

fmt

d9c596f

Cleanup

c0f89c1

alamb reviewed Oct 2, 2023

View reviewed changes

datafusion/physical-plan/src/topk/mod.rs Outdated Show resolved Hide resolved

Dandandan added 5 commits October 2, 2023 15:11

Fix test

466d4b6

Cleanup

33065ad

Make test deterministic

e31718e

Clippy, doctest

40ef448

Use into_sorted_vec

c373ce3

Dandandan commented Oct 3, 2023

View reviewed changes

datafusion/physical-plan/src/topk/mod.rs Show resolved Hide resolved

Dandandan commented Oct 3, 2023

View reviewed changes

Fix nondeterministic tests

bd72ad8

Dandandan marked this pull request as ready for review October 3, 2023 12:35

Dandandan changed the title ~~Topk~~ Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator Oct 3, 2023

Dandandan added 2 commits October 3, 2023 14:42

Update cargo.lock

84ffae8

Merge

21ea10f

alamb reviewed Oct 3, 2023

View reviewed changes

Dandandan and others added 7 commits October 3, 2023 20:28

Update datafusion/physical-plan/src/topk/mod.rs

592b10e

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/physical-plan/src/topk/mod.rs

47ee199

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/physical-plan/src/topk/mod.rs

2c33637

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/physical-plan/src/topk/mod.rs

c9121cc

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Add / update some comments

0dc3488

Rename test file

0470306

Rename table as well

0c59fe1

alamb reviewed Oct 4, 2023

View reviewed changes

datafusion/sqllogictest/test_files/topk.slt Outdated Show resolved Hide resolved

alamb approved these changes Oct 4, 2023

View reviewed changes

Update datafusion/sqllogictest/test_files/topk.slt

6bb299b

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan merged commit 1cf808d into apache:main Oct 5, 2023
22 checks passed

This was referenced Oct 5, 2023

TopK Fuzz Tests 🐝 #7749

Closed

Update explain plan to show when topk operator is used #7750

Closed

andygrove added the enhancement New feature or request label Oct 7, 2023

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

yyy1000 mentioned this pull request Feb 19, 2024

Remove fetch option from ExternalSorter #9266

Open

gruuya mentioned this pull request Mar 1, 2024

Further refine the Top K sort operator #9417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Dandandan commented Oct 2, 2023 •

edited by alamb

Loading

alamb commented Oct 2, 2023

Dandandan Oct 3, 2023 •

edited

Loading

Dandandan Oct 3, 2023

alamb Oct 3, 2023

alamb Oct 5, 2023 •

edited

Loading

alamb left a comment

alamb Oct 3, 2023

alamb Oct 3, 2023

alamb Oct 3, 2023

Dandandan Oct 3, 2023

Dandandan Oct 4, 2023 •

edited

Loading

alamb Oct 4, 2023

alamb left a comment

alamb left a comment •

edited

Loading

alamb commented Oct 4, 2023

Dandandan commented Oct 5, 2023

alamb commented Oct 5, 2023

Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Optimize "ORDER BY + LIMIT" queries for speed / memory with special TopK operator #7721

Conversation

Dandandan commented Oct 2, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Oct 2, 2023

Dandandan Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

Dandandan Oct 3, 2023

Choose a reason for hiding this comment

alamb Oct 3, 2023

Choose a reason for hiding this comment

alamb Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Oct 3, 2023

Choose a reason for hiding this comment

alamb Oct 3, 2023

Choose a reason for hiding this comment

alamb Oct 3, 2023

Choose a reason for hiding this comment

Dandandan Oct 3, 2023

Choose a reason for hiding this comment

Dandandan Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Oct 4, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Oct 4, 2023

Dandandan commented Oct 5, 2023

alamb commented Oct 5, 2023

Dandandan commented Oct 2, 2023 •

edited by alamb

Loading

Dandandan Oct 3, 2023 •

edited

Loading

alamb Oct 5, 2023 •

edited

Loading

Dandandan Oct 4, 2023 •

edited

Loading

alamb left a comment •

edited

Loading