Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First-class random access API for KnnVectorValues #13779

Merged
merged 28 commits into from
Sep 28, 2024

Conversation

msokolov
Copy link
Contributor

@msokolov msokolov commented Sep 12, 2024

addresses #13778

Key things in this PR:

  1. Introduces abstract KnnVectorValues from which ByteVectorValues and FloatVectorValues derive;
  2. Folds RandomAccessVectorValues into KnnVectorValues thus eliminating some casts.
  3. RandomAccessVectorValues.Floats becomes FloatVectorValues and RandomAccessVectorValues.Bytes becomes ByteVectorValues. RandomAccessQuantizedByteVectorValues folded into QuantizedByteVectorValues.
  4. IndexInput getSlice() moved to a new HasIndexSlice interface.
  5. Introduces VectorEncoding KnnVectorValues.getEncoding() to enable type-specific branches in a few places where we are dealing with abstract KnnVectorValues (tests only IIRC). Also used that to provide a default getVectorByteLength().
  6. KnnVectorValues no longer extends DocIdSetIterator; rather it provides a tightly-coupled iterator(). This enables refactoring common iteration patterns that were repeated many times in the code base. This new iterator, DocIndexIterator provides an additional method index() analogous to IndexedDISI.

Some of the methods on KnnVectorValues have default impls that throw UnsupportedOperationException enabling subclasses to provide partial implementations and relying on testing to catch missing required methods. I'd like feedback on this. Should we provide implementations we never use, just to make these classes complete? That didn't make sense to me. But the previous alternative of attempting to provide strict adherence to declarative contracts was becoming in my view, overly restrictive and leading to hard-to-maintain code. Some of these readers would only ever be used iteratively. Random access is required for search, but not used when merging the values themselves, and when we merge we do search, but using a temporary file so that searching is always done over a file-based value. Random access also gets used during merging when the index is sorted, again this is provided by specialized readers, so not every reader needs to implement random access. But the API maintenance is greatly simplified if we allow partial implementation. Anyway that is the idea I am trying out here. Can we live with a little less API purity and gain some simplicity and less boilerplate?

Notes for reviewers:

There is a lot of code change here, but much of it is repetitive. I recommend starting with KnnVectorValues and checking its DocIndexIterator inner class. The rest of the changes are basically consequences of introducing those abstrations in place of the Random*Values we removed.

@msokolov
Copy link
Contributor Author

another concern I have is how this would impact ongoing work to enable multiple vectors per doc/field. There would almost certainly be conflicts with that PR on the surface, but I hope this could actually simplify things in that the new DocIndexIterator class could be enhanced / extended to provide access to a series of values (maybe a list or array?) instead of (or in addition to?) a single one, possibly centralizing the required changes (since we have many fewer iterator implementations after this change).

@benwtrent
Copy link
Member

but I hope this could actually simplify things

That is my intuition as well.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few thoughts/questions. In general, I see how such a random-access API change could help with e.g. your BP reordering work and be valuable in general. I was wondering if this API may be too tailored to HNSW and prevent us from supporting other interesting algorithms, but actually I don't think that this is the case?

* Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access
* different values at once, to avoid overwriting the underlying vector returned.
*/
public abstract KnnVectorValues copy() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could make the API a bit nicer by removing this copy() and instead have something like a FloatVectorDictionary { float[] vectorValue(int ord); } and a method here that can return a new FloatVectorDictionary (a bit like SortedDocValues and TermsEnum).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way SortedDocValuesTermsEnum is, calling its next method will overwrite the internal buffer ofd the SortedDocValues on which it is built, defeating the purpose of copy() which is to provide two completely independent sources. Another thing we could do is to add vectorValue(int ord, float[] scratch) allowing the caller to provide the memory to write to. If we had that, we wouldn't need copy(). Maybe we could manage to squeeze that into 10.0 too, but I'd rather do it in a separate PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you call SortedDocValues#termsEnum twice, this would give you two independent sources of terms?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.

if (iterator == null) {
iterator = createIterator();
}
return iterator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this return a new iterator every time to make the API a bit nicer? From a quick look, it seems that call sites could easily be adjusted to not rely on this method returning a shared instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try - I was also a bit unhappy about this, but at one point along this journey I was relying on being able to recover the shared state - maybe I finally was able to get rid of that and just didn't notice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new iterator would be cleaner, if the use sites allow for it.

* Creates an iterator from this instance's ordinal-to-docid mapping which must be monotonic
* (docid increases when ordinal does).
*/
protected DocIndexIterator fromOrdToDoc() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we make it look a bit more like DocIdSetIterator#all by moving it to DocIndexIterator#all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you mean rename this method to all? sure, makes sense

@Override
public int advance(int target) throws IOException {
return slowAdvance(target);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it could be a performance trap, which is why DocIdSetIterator offers this helper method without making it the default impl. Should we leave it without a default impl here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I don't think anything relies on this, makes sense


@Override
public long cost() {
throw new UnsupportedOperationException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here, I'd rather leave it unimplemented to force implementers to decide if having cost() throw an exception is fine. Presumably, most of the time it's not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think cost() is rarely used in the vector reader/writers which instead are concerned with KnnVectorValues.size() -- they typically want to know how many vector values there are and to the extent they care about the number of docs it's only when they must iterate through all of them and have no use for an estimate. These iterators aren't really used during searching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we default cost() to returning size(), that would work for me. But I'm not comfortable with having implementations of DocIdSetIterator#cost that may throw, which means e.g. that they cannot be put in a Conjunction DISI(which will want to sort its clauses by cost).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here. Either it should default to size() via some provided dependency or it shouldn't implement at all and force sub-classes.

@jpountz
Copy link
Contributor

jpountz commented Sep 12, 2024

Am guessing correctly that you're targeting 10.0 for this change?

@msokolov
Copy link
Contributor Author

Thanks for the quick review! I will get started on addressing. As for timeline for this change, it would definitely be convenient to get in to 10.0 release. I think you had said 9/22 would be a feature freeze date; it seems we could possibly meet that timeline. I will be traveling starting tomorrow for a week, but I should be able to put in some quality time on the plane LOL

public byte[] vectorValue() throws IOException {
return current.values.vectorValue();
public byte[] vectorValue(int ord) throws IOException {
return current.values.vectorValue(current.values.iterator().index());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part feels a bit hacky, could we instead merge the ord->vector mappings of the various vector values by concatenating them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can enhance DocIDMerger by adding random access to it

@jpountz
Copy link
Contributor

jpountz commented Sep 13, 2024

think you had said 9/22 would be a feature freeze date

I was thinking of doing it next week, but we can backport this PR even though the branch has been cut if it looks ready/safe.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this change. I see a lot of refactoring similar to what I half started at one point or the other, but never finished. There are some specific comments to be addressed, but otherwise the approach LGTM.

@Override
RandomAccessQuantizedByteVectorValues copy() throws IOException;
/** Returns an IndexInput from which to read this instance's values. */
IndexInput getSlice();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much like this, and had something similar in a past unmarked PR. 👍

* Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access
* different values at once, to avoid overwriting the underlying vector returned.
*/
public abstract KnnVectorValues copy() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.

if (iterator == null) {
iterator = createIterator();
}
return iterator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a new iterator would be cleaner, if the use sites allow for it.

@msokolov
Copy link
Contributor Author

I pushed a new revision here addressing some of the major comments:

  1. KnnVectorValues.iterator() now generally provides a new iterator; no caching is done. I removed createIterator(). Main impact was on VectorScorer (and in tests) where we now create iterators and store them locally. This is much better; thanks for the feedback.
  2. I added implementations for advance() and got rid of the default impl.
  3. I removed impls of cost() and added a default impl that throws UOE. This method is only ever used during search() and most of these values sources will never be searched. The exceptions are those that can be used by the ValueSource API: basically the indexed values returned by a reader. We have lots and lots of other values impls that are used during indexing for which we don't need cost. I briefly considered separating these new iterators from DISI, but that ended up in some trouble.
  4. re: getVectorByteLength() @ChrisHegarty is correct that this is needed as it is today. We could in theory make it final (or inline it whatever) if we added some more VectorEncodings to represent the compressed cases, but I'm inclined to leave it as is. This way we could in theory support a variable size encoding? And anyway it isn't clear we want to mix up the "encoding" with compression?

I didn't have a chance to look seriously at removing copy() API. I don't think we ought to create a simple wrapper though since afaict it would require an additional memory copy of every vector value.

@msokolov
Copy link
Contributor Author

msokolov commented Sep 16, 2024

OK there seem to be some test failures ... I did a complete run, but randomized testing always seems to ferret out something interesting!

Actually those really should have failed on any test run -- not sure how I missed them, oops

@msokolov
Copy link
Contributor Author

Regarding the rename of fromOrdToDoc to all I think it was not helpful and plan to revert or maybe come up with some other name. The problem is we also have createDenseIterator which is also all. Essentially we have Sparse and Dense all-iterators. Maybe instead of fromOrdToDoc we can say createSparseIterator?

@jpountz
Copy link
Contributor

jpountz commented Sep 16, 2024

FWIW I started playing with removing copy() by replacing it with a factory method for a dictionary: msokolov@ae7aca3. Not sure how far I'll go. :)

@msokolov
Copy link
Contributor Author

I'll post one more iteration here addressing the concerns about dangerous default impls that adds back impls of copy() and cost(). I also added a test-and-throw ensuring that the vectorValues impls that require forward-iteration enforce it. We can fully implement random access later without breaking any APIs.

I also think we should go ahead with Adrien's Dictionary idea, but do this in two steps because there is a lot going on here already.

@benwtrent
Copy link
Member

The dictionary idea is OK, but I still don't see how it removes copy(). Besides the caching of values, copy gives us multi-threaded safety by copying the underlying index readers. Otherwise we are using the same reader between threads. For concurrent merging of graphs, this is important.

I agree, any further refactoring should be done in another PR.

@msokolov
Copy link
Contributor Author

I think the idea w/Dictionary is that callers, instead of calling copy().vectorValue(int ord) would call dictionary().vectorValue(int ord). So then the scratch vector storage (if needed) would be in the Dictionary not in the VectorValues, and thus not shared by multiple users of the same values instance. In some sense it's not very different, but in the sense that the Dictionary has a much more limited API than the source it came from, it is different.

@jpountz
Copy link
Contributor

jpountz commented Sep 20, 2024

Exactly. I tried to model it similarly to what doc values do, where SortedDocValues#termsEnum() returns a dictionary with a different backing IndexInput clone on every call.

@msokolov
Copy link
Contributor Author

OK I think we've addressed the blocking concerns that have been raised here and I plan to push later today if nothing else comes up. Regarding removing copy() in favor of dictionary() I'll open a separate issue. If Adrien finishes it up, great, but I may also see if I can find time to take that forward soon; it would be good to get it done for 10 since it would be a breaking change and ideally we don't want copy() to linger as deprecated. As for implementing better random access in merged values I think we can take that up at a more relaxed pace since it doesn't require any API change.

@msokolov
Copy link
Contributor Author

hm interesting there was an EOFException in there - I'll dig

@msokolov
Copy link
Contributor Author

OK, I found an off-by-one error plus a problem with lazy iterator creation that slipped in when we got rid of createIterator(). It makes me a little nervous these didn't show up in earlier testing. I'm now running with tests.iter=20

@msokolov
Copy link
Contributor Author

OK, I think this is ready after a few minor issues have been addressed. I opened #13831 to track replacing copy() with dictionary()

@msokolov msokolov merged commit 6053e1e into apache:main Sep 28, 2024
4 checks passed
@msokolov msokolov deleted the knn-vector-random branch September 28, 2024 13:14
@javanna
Copy link
Contributor

javanna commented Sep 29, 2024

Should there be a migrate entry added with this change?

@msokolov
Copy link
Contributor Author

Should there be a migrate entry added with this change?

oh thanks, yes, and a CHANGES entry. I opened #13833 if you want to review

javanna added a commit to javanna/elasticsearch that referenced this pull request Sep 30, 2024
Our lucene_snapshot branch requires updating after apache/lucene#13779
@javanna javanna added this to the 10.0.0 milestone Sep 30, 2024
javanna added a commit to elastic/elasticsearch that referenced this pull request Sep 30, 2024
benwtrent added a commit that referenced this pull request Oct 2, 2024
…13850)

introduced in the major refactor #13779

Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
benwtrent added a commit that referenced this pull request Oct 2, 2024
…13850)

introduced in the major refactor #13779

Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
benwtrent added a commit that referenced this pull request Oct 2, 2024
…13850)

introduced in the major refactor #13779

Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants