perf: use v2 to write the sorted IVF partition files #2492

westonpace · 2024-06-20T02:45:19Z

The old v1 approach created a row group for every partition (including empty partitions). This approach converts the input into a List<Struct<...>> array where each row is a partition of data. Empty partitions are not included in the array.

This yields significant performance benefits. I'm not 100% sure if the benefit is changing to v2 or simply changing the format in which we write the data.

westonpace · 2024-06-20T02:48:53Z

Testing locally I see a bunch of failures from test_create_ivf_hnsw_with_empty_partition. I suspect the issue is that the call to shuffle_dataset is running more quickly than it did before and this is making it possible to write to more files which is causing the open files error to be more prevalent but I don't actually know.

BubbleCal · 2024-06-20T07:07:48Z

Testing locally I see a bunch of failures from test_create_ivf_hnsw_with_empty_partition. I suspect the issue is that the call to shuffle_dataset is running more quickly than it did before and this is making it possible to write to more files which is causing the open files error to be more prevalent but I don't actually know.

shuffle_dataset would be called only once during the entire indexing progress.
is that a recall assertion failure? I fixed such an issue before

wjones127 · 2024-06-20T15:22:02Z

This yields significant performance benefits. I'm not 100% sure if the benefit is changing to v2 or simply changing the format in which we write the data.

We never got a chance to look at this, but Rob was claiming just the act of concatenating batches improved performance, but I was skeptical. Maybe there's something to that?

#2384 (comment)

WIP Fix for writing centroids WIP WIP

This reverts commit 062ac38.

westonpace · 2024-06-20T19:27:25Z

We never got a chance to look at this, but Rob was claiming just the act of concatenating batches improved performance, but I was skeptical. Maybe there's something to that?

Yeah, I have to concatenate for v2 (v2 can write multiple arrays but I can't make a list array from multiple arrays) so that might explain it.

This reverts commit 062ac38.

codecov-commenter · 2024-06-22T14:46:13Z

Codecov Report

Attention: Patch coverage is 77.49004% with 113 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@9703e50). Learn more about missing BASE report.
Report is 7 commits behind head on main.

Files	Patch %	Lines
rust/lance-index/src/vector/ivf/shuffler.rs	84.33%	39 Missing and 16 partials ⚠️
rust/lance-index/src/vector/ivf/builder.rs	0.00%	26 Missing ⚠️
rust/lance/src/index/vector/ivf.rs	47.50%	21 Missing ⚠️
rust/lance/src/dataset/fragment/write.rs	84.21%	0 Missing and 9 partials ⚠️
rust/lance-file/src/v2/writer.rs	87.50%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2492   +/-   ##
=======================================
  Coverage        ?   79.61%           
=======================================
  Files           ?      208           
  Lines           ?    59425           
  Branches        ?    59425           
=======================================
  Hits            ?    47314           
  Misses          ?     9387           
  Partials        ?     2724

Flag	Coverage Δ
unittests	`79.61% <77.49%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BubbleCal · 2024-06-24T12:46:01Z

rust/lance-index/src/vector/residual.rs

@@ -77,15 +78,12 @@ where
    });

    let residuals = vectors_slice
-        .par_chunks(dimension)
+        .chunks_exact(dimension)


we need to get this back after fixed #2503 right?

I don't think so. I don't think this parallelism is helpful.

At least, not on the training path. If we do this transform on the query path then it might be useful. Even then, if we have many queries per second (many concurrent queries) I think we might be better without it.

wjones127 · 2024-06-24T18:15:44Z

rust/lance/src/dataset/fragment/write.rs

+        let mut broken_stream = break_stream(stream, break_limit)
+            .map_ok(|batch| vec![batch])
+            .boxed();
+        while let Some(batched_chunk) = broken_stream.next().await {
+            let batch_chunk = batched_chunk?;
+            writer.write_batches(batch_chunk.iter()).await?;
+        }


Seems a little odd here we have to wrap the batch into a Vec. Why not just:

Suggested change

let mut broken_stream = break_stream(stream, break_limit)

.map_ok(|batch| vec![batch])

.boxed();

while let Some(batched_chunk) = broken_stream.next().await {

let batch_chunk = batched_chunk?;

writer.write_batches(batch_chunk.iter()).await?;

}

let mut broken_stream = break_stream(stream, break_limit);

while let Some(batch) = broken_stream.next().await {

writer.write_batches([batch_chunk?].iter()).await?;

}

wjones127 · 2024-06-24T18:21:18Z

rust/lance-index/src/vector/ivf/shuffler.rs

+                                        let row_addr = RowAddress::new_from_id(*row_id);
+                                        partition_map[row_addr.fragment_id() as usize]
+                                            [row_addr.row_id() as usize]


Are you re-using the row id concept but in index files? I do worry it could be confusing.

Index files need to refer to something, either row id or row address. So the problem statement here is that we have a GPU process (possibly distributed on multiple systems in the future) that calculates the partition id for each row id. We then need to pass this mapping forward to a CPU process that relies on that mapping to create the next part of the index.

We could use HashMap<u64, u32> (we did originally) but this gets very expensive at high scales (in the future, if we distribute this task as well, maybe not the end of the world).

So basically it is very nice if there is some reliable mechanism to create a "row identifier" -> X mapping without using a hashmap. With row addresses we can do this easily using two Vec and it should be stable as long as the read version used to create the mapping is the same as the read version used to apply the mapping (it is).

If we wanted to use row ids here then we could maybe assume that the number of "gaps" will be rather small and then just use a straight Vec (one advantage is now it doesn't need to be the same read version but we might have missing entries otherwise) but I don't know if we can always guarantee these small gaps and, if there are large gaps, we end up allocating more RAM than strictly needed. What do you think?

Worst case we can just assume that stage 2 is always run on the same data, in the same order, as stage 1. Then we can just do a list of partition ids. I think this might not work today because we process things in parallel and don't order them (but I don't think the cost of doing that ordering would be significant compared to everything else we do).

Okay this makes sense. I think I was worried that you were re-using the RowAddress idea for something that isn't a row address. It looked at first like it was the address of the vector within the index partitions. If it's actually RowAddresses as we know now, I agree this makes sense.

wjones127 · 2024-06-24T18:23:18Z

rust/lance-index/src/vector/ivf/shuffler.rs

+                        let minibatch_size = std::env::var("LANCE_SHUFFLE_BATCH_SIZE")
+                            .unwrap_or("64".to_string())
+                            .parse::<usize>()
+                            .unwrap_or(64);


If someone does specify this but there is a parsing error, I think we should either log or error, rather than silently fall back to the default.

westonpace · 2024-07-30T17:32:53Z

Closing as this will be tackled piece by piece in other PRs.

westonpace requested review from chebbyChefNEQ and BubbleCal June 20, 2024 02:45

github-actions bot added the performance label Jun 20, 2024

westonpace force-pushed the perf/v2-for-ivf-sorted-part-file-write branch from 548d39f to 791e096 Compare June 20, 2024 17:07

github-actions bot added the python label Jun 20, 2024

westonpace added 9 commits June 20, 2024 12:25

Use v2 to write sorted part files

5709bcd

WIP Fix for writing centroids WIP WIP

Revert dataset.py changes

8b4d7e3

Cleanup changes

5e2da75

Remove unused import

4915ce5

Revert "Revert dataset.py changes"

fd17ef1

This reverts commit 062ac38.

Avoid expensive concatenation by building partitions in-place

c59478b

Parallelize shuffling

7f86689

Apply clippy suggestions

dc205f5

Ignore flaky test

b7da513

westonpace force-pushed the perf/v2-for-ivf-sorted-part-file-write branch from 2f919c6 to b7da513 Compare June 20, 2024 19:25

westonpace added 2 commits June 20, 2024 12:57

Fix bug where wrong value was used to slice

0201696

Unignoring test that wasn't really flaky after all

8d76a00

westonpace mentioned this pull request Jun 20, 2024

Add regression tests for recall #2497

Open

westonpace added 5 commits June 20, 2024 13:11

Revert changes to dataset.py

ce7e331

Remove dbg

e324e81

Revert "Revert dataset.py changes"

94c4277

This reverts commit 062ac38.

Avoid nested concurrency

8d4f42b

Use Vec<Vec<...>> instead of hashmap

860d002

westonpace added 2 commits June 22, 2024 17:10

make it possible for Fragment.create to create v2 files

34cfe00

Add skip_all to instrumented call

855a587

BubbleCal reviewed Jun 24, 2024

View reviewed changes

wjones127 reviewed Jun 24, 2024

View reviewed changes

westonpace closed this Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: use v2 to write the sorted IVF partition files #2492

perf: use v2 to write the sorted IVF partition files #2492

westonpace commented Jun 20, 2024

westonpace commented Jun 20, 2024

BubbleCal commented Jun 20, 2024

wjones127 commented Jun 20, 2024

westonpace commented Jun 20, 2024

codecov-commenter commented Jun 22, 2024 •

edited

Loading

BubbleCal Jun 24, 2024

westonpace Jun 24, 2024

westonpace Jun 24, 2024

wjones127 Jun 24, 2024

wjones127 Jun 24, 2024

westonpace Jun 24, 2024 •

edited

Loading

wjones127 Jun 25, 2024

wjones127 Jun 24, 2024

westonpace commented Jul 30, 2024

perf: use v2 to write the sorted IVF partition files #2492

perf: use v2 to write the sorted IVF partition files #2492

Conversation

westonpace commented Jun 20, 2024

westonpace commented Jun 20, 2024

BubbleCal commented Jun 20, 2024

wjones127 commented Jun 20, 2024

westonpace commented Jun 20, 2024

codecov-commenter commented Jun 22, 2024 • edited Loading

Codecov Report

BubbleCal Jun 24, 2024

Choose a reason for hiding this comment

westonpace Jun 24, 2024

Choose a reason for hiding this comment

westonpace Jun 24, 2024

Choose a reason for hiding this comment

wjones127 Jun 24, 2024

Choose a reason for hiding this comment

wjones127 Jun 24, 2024

Choose a reason for hiding this comment

westonpace Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

wjones127 Jun 25, 2024

Choose a reason for hiding this comment

wjones127 Jun 24, 2024

Choose a reason for hiding this comment

westonpace commented Jul 30, 2024

codecov-commenter commented Jun 22, 2024 •

edited

Loading

westonpace Jun 24, 2024 •

edited

Loading