Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use v2 to write the sorted IVF partition files #2492

Conversation

westonpace
Copy link
Contributor

The old v1 approach created a row group for every partition (including empty partitions). This approach converts the input into a List<Struct<...>> array where each row is a partition of data. Empty partitions are not included in the array.

This yields significant performance benefits. I'm not 100% sure if the benefit is changing to v2 or simply changing the format in which we write the data.

@westonpace
Copy link
Contributor Author

Testing locally I see a bunch of failures from test_create_ivf_hnsw_with_empty_partition. I suspect the issue is that the call to shuffle_dataset is running more quickly than it did before and this is making it possible to write to more files which is causing the open files error to be more prevalent but I don't actually know.

@BubbleCal
Copy link
Contributor

Testing locally I see a bunch of failures from test_create_ivf_hnsw_with_empty_partition. I suspect the issue is that the call to shuffle_dataset is running more quickly than it did before and this is making it possible to write to more files which is causing the open files error to be more prevalent but I don't actually know.

shuffle_dataset would be called only once during the entire indexing progress.
is that a recall assertion failure? I fixed such an issue before

@wjones127
Copy link
Contributor

This yields significant performance benefits. I'm not 100% sure if the benefit is changing to v2 or simply changing the format in which we write the data.

We never got a chance to look at this, but Rob was claiming just the act of concatenating batches improved performance, but I was skeptical. Maybe there's something to that?

#2384 (comment)

@westonpace westonpace force-pushed the perf/v2-for-ivf-sorted-part-file-write branch from 548d39f to 791e096 Compare June 20, 2024 17:07
@westonpace westonpace force-pushed the perf/v2-for-ivf-sorted-part-file-write branch from 2f919c6 to b7da513 Compare June 20, 2024 19:25
@westonpace
Copy link
Contributor Author

We never got a chance to look at this, but Rob was claiming just the act of concatenating batches improved performance, but I was skeptical. Maybe there's something to that?

Yeah, I have to concatenate for v2 (v2 can write multiple arrays but I can't make a list array from multiple arrays) so that might explain it.

@codecov-commenter
Copy link

codecov-commenter commented Jun 22, 2024

Codecov Report

Attention: Patch coverage is 77.49004% with 113 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@9703e50). Learn more about missing BASE report.
Report is 7 commits behind head on main.

Files Patch % Lines
rust/lance-index/src/vector/ivf/shuffler.rs 84.33% 39 Missing and 16 partials ⚠️
rust/lance-index/src/vector/ivf/builder.rs 0.00% 26 Missing ⚠️
rust/lance/src/index/vector/ivf.rs 47.50% 21 Missing ⚠️
rust/lance/src/dataset/fragment/write.rs 84.21% 0 Missing and 9 partials ⚠️
rust/lance-file/src/v2/writer.rs 87.50% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2492   +/-   ##
=======================================
  Coverage        ?   79.61%           
=======================================
  Files           ?      208           
  Lines           ?    59425           
  Branches        ?    59425           
=======================================
  Hits            ?    47314           
  Misses          ?     9387           
  Partials        ?     2724           
Flag Coverage Δ
unittests 79.61% <77.49%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -77,15 +78,12 @@ where
});

let residuals = vectors_slice
.par_chunks(dimension)
.chunks_exact(dimension)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to get this back after fixed #2503 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I don't think this parallelism is helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least, not on the training path. If we do this transform on the query path then it might be useful. Even then, if we have many queries per second (many concurrent queries) I think we might be better without it.

Comment on lines +99 to +105
let mut broken_stream = break_stream(stream, break_limit)
.map_ok(|batch| vec![batch])
.boxed();
while let Some(batched_chunk) = broken_stream.next().await {
let batch_chunk = batched_chunk?;
writer.write_batches(batch_chunk.iter()).await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a little odd here we have to wrap the batch into a Vec. Why not just:

Suggested change
let mut broken_stream = break_stream(stream, break_limit)
.map_ok(|batch| vec![batch])
.boxed();
while let Some(batched_chunk) = broken_stream.next().await {
let batch_chunk = batched_chunk?;
writer.write_batches(batch_chunk.iter()).await?;
}
let mut broken_stream = break_stream(stream, break_limit);
while let Some(batch) = broken_stream.next().await {
writer.write_batches([batch_chunk?].iter()).await?;
}

Comment on lines +285 to +287
let row_addr = RowAddress::new_from_id(*row_id);
partition_map[row_addr.fragment_id() as usize]
[row_addr.row_id() as usize]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you re-using the row id concept but in index files? I do worry it could be confusing.

Copy link
Contributor Author

@westonpace westonpace Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index files need to refer to something, either row id or row address. So the problem statement here is that we have a GPU process (possibly distributed on multiple systems in the future) that calculates the partition id for each row id. We then need to pass this mapping forward to a CPU process that relies on that mapping to create the next part of the index.

We could use HashMap<u64, u32> (we did originally) but this gets very expensive at high scales (in the future, if we distribute this task as well, maybe not the end of the world).

So basically it is very nice if there is some reliable mechanism to create a "row identifier" -> X mapping without using a hashmap. With row addresses we can do this easily using two Vec and it should be stable as long as the read version used to create the mapping is the same as the read version used to apply the mapping (it is).

If we wanted to use row ids here then we could maybe assume that the number of "gaps" will be rather small and then just use a straight Vec (one advantage is now it doesn't need to be the same read version but we might have missing entries otherwise) but I don't know if we can always guarantee these small gaps and, if there are large gaps, we end up allocating more RAM than strictly needed. What do you think?

Worst case we can just assume that stage 2 is always run on the same data, in the same order, as stage 1. Then we can just do a list of partition ids. I think this might not work today because we process things in parallel and don't order them (but I don't think the cost of doing that ordering would be significant compared to everything else we do).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this makes sense. I think I was worried that you were re-using the RowAddress idea for something that isn't a row address. It looked at first like it was the address of the vector within the index partitions. If it's actually RowAddresses as we know now, I agree this makes sense.

Comment on lines +316 to +319
let minibatch_size = std::env::var("LANCE_SHUFFLE_BATCH_SIZE")
.unwrap_or("64".to_string())
.parse::<usize>()
.unwrap_or(64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does specify this but there is a parsing error, I think we should either log or error, rather than silently fall back to the default.

@westonpace
Copy link
Contributor Author

Closing as this will be tackled piece by piece in other PRs.

@westonpace westonpace closed this Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants