-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931
base: 4.x
Are you sure you want to change the base?
Conversation
i can't get this to compile
am i doing something wrong ? |
Is there a separate ticket for vector similarity functions ? |
@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde | |||
return orderBy(CqlIdentifier.fromCql(columnName), order); | |||
} | |||
|
|||
@NonNull | |||
Select orderBy(@NonNull Ann ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider adding here the direction (ASC, DESC) parameter. Currently we do not support DESC vector ordering, but this may be available in future and CQL syntax allows it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @SiyaoIsHiding! This is basically what I was expecting to see with this change. We can have a conversation about the comments about the API but otherwise there's just a few things to clean up here.
@@ -60,7 +60,8 @@ public String getClassName() { | |||
@NonNull | |||
@Override | |||
public String asCql(boolean includeFrozen, boolean pretty) { | |||
return String.format("'%s(%d)'", getClassName(), getDimensions()); | |||
return String.format( | |||
"VECTOR<%s, %d>", this.subtype.asCql(includeFrozen, pretty).toUpperCase(), getDimensions()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
public static Ann annOf(@NonNull String cqlIdentifier, @NonNull CqlVector<Number> vector) { | ||
return new DefaultAnn(CqlIdentifier.fromCql(cqlIdentifier), vector); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will need to be updated when the PR for JAVA-3143 is merged; the CqlVector constraint won't apply once that's in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested out on C* 5.0.1 and it says ANN only supports float.
cqlsh:default_keyspace> insert INTO vt (key, v ) VALUES ( 1, ['a','b']) ;
cqlsh:default_keyspace> select * from vt order by v ann of ['a', 'c'];
InvalidRequest: Error from server: code=2200 [Invalid query] message="ANN ordering is only supported on float vector indexes"
@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde | |||
return orderBy(CqlIdentifier.fromCql(columnName), order); | |||
} | |||
|
|||
@NonNull | |||
Select orderBy(@NonNull Ann ann); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but it seems more natural to support something like the following:
Select orderByAnnOf(CqlIdentifier columnId, CqlVector ann);
Select orderByAnnOf(String columnName, CqlVector ann);
Advantage is that with this approach you don't even need to introduce an Ann type... which kinda seems right as that type isn't really doing much for you here.
You could also perhaps add a notion of type checking the specified column to make sure it's a vector type (and to make sure it matches the type of the input CqlVector).
To the point made by @lukasz-antoniak above we could add directionality here (and throw warnings if the user tries to use a DESC order before there's server-side support for it) but I'm not sure it's worth it. There's no mention of ordering in the relevant Cassandra docs so my intuition says to just leave it out for now and add it when it becomes more of a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree to save DESC ordering for later because:
- I find it weird to add a feature that does not work yet and we cannot even test
- Neither Bret nor I found relevant doc saying they may support DESC of vector search. @lukasz-antoniak did you find any? I find it hard to imagine in what cases we want to find the approximate farthest neighbor...?
- If we want to add DESC later, we just need to add another function overload
Select orderByAnnOf(String columnName, CqlVector ann, ClusteringOrder order);
. I assume this is not hard.
.alterColumn( | ||
"v", | ||
DataTypes.custom( | ||
"org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.FloatType,3)"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataTypes.custom(DataTypes.vectorOf(DataTypes.Float,3))
deleteFrom("foo") | ||
.column("v") | ||
.whereColumn("k") | ||
.isEqualTo(literal(Arrays.asList(0.1, 0.2)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test relies on the fact that vectors happen to be represented using the same syntax as arrays. This test should be changed to use a CqlVector here instead of an array.
@@ -41,6 +42,12 @@ public void should_generate_column_assignments() { | |||
.hasCql("INSERT INTO foo (a,b) VALUES (?,?)"); | |||
} | |||
|
|||
@Test | |||
public void should_generate_vector_literals() { | |||
assertThat(insertInto("foo").value("a", literal(Arrays.asList(0.1, 0.2, 0.3)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above: this should use a CqlVector rather than an Array
|
||
@Test | ||
public void should_generate_alter_type_with_vector() { | ||
assertThat(alterType("foo", "bar").alterField("vec", new DefaultVectorType(DataTypes.FLOAT, 3))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataTypes.vectorOf(DataTypes.Float, 3)
assertThat( | ||
createTable("foo") | ||
.withPartitionKey("k", DataTypes.INT) | ||
.withColumn("v", new DefaultVectorType(DataTypes.FLOAT, 3))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above:
DataTypes.vectorOf(DataTypes.Float, 3)
assertThat( | ||
createType("ks1", "type") | ||
.withField("c1", DataTypes.INT) | ||
.withField("vec", new DefaultVectorType(DataTypes.FLOAT, 3))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above:
DataTypes.vectorOf(DataTypes.Float, 3)
One other thing worth mentioning: the Cassandra impl also supports a way to get "the similarity calculation of the best scoring node closest to the query data as part of the results". Take a look at the similarity_dot_product() function (and the other choices as well) in the relevant Cassandra docs. The query builder should have support for those as well. |
The revapi thing is fixed and the vector similarity function is already supported by the existing Lines 235 to 274 in 19148d5
In terms of the spring-ai downstream, as we won't actually break any API, is there anything we should test or how? |
Currently, the
SchemaBuilder
works with vector like this:Or
Please let me know if you want something like
.withColumn("v", DataTypes.vector(DataTypes.FLOAT, 3))
.