JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

SiyaoIsHiding · 2024-05-06T19:18:27Z

Currently, the SchemaBuilder works with vector like this:

    assertThat(
            createTable("foo")
                .withPartitionKey("k", DataTypes.INT)
                .withColumn("v", new DefaultVectorType(DataTypes.FLOAT, 3)))
        .hasCql("CREATE TABLE foo (k int PRIMARY KEY,v VECTOR<FLOAT, 3>)");

Or

assertThat(createTable("foo")
            .withPartitionKey("k", DataTypes.INT)
            .withColumn("v", DataTypes.custom("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.FloatType,3)")
            ))
            .hasCql("CREATE TABLE foo (k int PRIMARY KEY,v VECTOR<FLOAT, 3>)");

Please let me know if you want something like .withColumn("v", DataTypes.vector(DataTypes.FLOAT, 3)).

…he.cassandra.db.marshal.FloatType,3)")

michaelsembwever · 2024-05-08T13:39:41Z

what's the CASSANDRA ticket for this ?

i will test this downstream here:

michaelsembwever · 2024-06-11T16:48:05Z

i can't get this to compile

[ERROR] Failed to execute goal org.revapi:revapi-maven-plugin:0.10.5:check (default) on project java-driver-query-builder: The following API problems caused the build to fail:
[ERROR] java.method.addedToInterface: method com.datastax.oss.driver.api.querybuilder.select.Select com.datastax.oss.driver.api.querybuilder.select.Select::orderBy(com.datastax.oss.driver.api.querybuilder.select.Ann): Method was added to an interface.
[ERROR]

am i doing something wrong ?

michaelsembwever · 2024-06-11T20:16:30Z

Is there a separate ticket for vector similarity functions ?
https://cassandra.apache.org/doc/latest/cassandra/developing/cql/functions.html#vector-similarity-functions

lukasz-antoniak · 2024-08-19T15:02:01Z

query-builder/src/main/java/com/datastax/oss/driver/api/querybuilder/select/Select.java

@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde
    return orderBy(CqlIdentifier.fromCql(columnName), order);
  }

+  @NonNull
+  Select orderBy(@NonNull Ann ann);


I would consider adding here the direction (ASC, DESC) parameter. Currently we do not support DESC vector ordering, but this may be available in future and CQL syntax allows it.

absurdfarce

Nice work @SiyaoIsHiding! This is basically what I was expecting to see with this change. We can have a conversation about the comments about the API but otherwise there's just a few things to clean up here.

absurdfarce · 2024-09-09T20:34:46Z

core/src/main/java/com/datastax/oss/driver/internal/core/type/DefaultVectorType.java

@@ -60,7 +60,8 @@ public String getClassName() {
  @NonNull
  @Override
  public String asCql(boolean includeFrozen, boolean pretty) {
-    return String.format("'%s(%d)'", getClassName(), getDimensions());
+    return String.format(
+        "VECTOR<%s, %d>", this.subtype.asCql(includeFrozen, pretty).toUpperCase(), getDimensions());


This should be a lower case "VECTOR" to match what's done with the other collection types (list, set, map)

absurdfarce · 2024-09-09T20:36:34Z

query-builder/src/main/java/com/datastax/oss/driver/api/querybuilder/QueryBuilder.java

+
+  public static Ann annOf(@NonNull String cqlIdentifier, @NonNull CqlVector<Number> vector) {
+    return new DefaultAnn(CqlIdentifier.fromCql(cqlIdentifier), vector);
+  }


These will need to be updated when the PR for JAVA-3143 is merged; the CqlVector constraint won't apply once that's in.

I tested out on C* 5.0.1 and it says ANN only supports float.

cqlsh:default_keyspace> insert INTO vt (key, v ) VALUES ( 1, ['a','b']) ; cqlsh:default_keyspace> select * from vt order by v ann of ['a', 'c']; InvalidRequest: Error from server: code=2200 [Invalid query] message="ANN ordering is only supported on float vector indexes"

absurdfarce · 2024-09-20T22:13:02Z

query-builder/src/main/java/com/datastax/oss/driver/api/querybuilder/select/Select.java

@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde
    return orderBy(CqlIdentifier.fromCql(columnName), order);
  }

+  @NonNull
+  Select orderBy(@NonNull Ann ann);


Maybe I'm missing something but it seems more natural to support something like the following:

Select orderByAnnOf(CqlIdentifier columnId, CqlVector ann); Select orderByAnnOf(String columnName, CqlVector ann);

Advantage is that with this approach you don't even need to introduce an Ann type... which kinda seems right as that type isn't really doing much for you here.

You could also perhaps add a notion of type checking the specified column to make sure it's a vector type (and to make sure it matches the type of the input CqlVector).

To the point made by @lukasz-antoniak above we could add directionality here (and throw warnings if the user tries to use a DESC order before there's server-side support for it) but I'm not sure it's worth it. There's no mention of ordering in the relevant Cassandra docs so my intuition says to just leave it out for now and add it when it becomes more of a thing.

I also agree to save DESC ordering for later because:

I find it weird to add a feature that does not work yet and we cannot even test

Neither Bret nor I found relevant doc saying they may support DESC of vector search. @lukasz-antoniak did you find any? I find it hard to imagine in what cases we want to find the approximate farthest neighbor...?

If we want to add DESC later, we just need to add another function overload Select orderByAnnOf(String columnName, CqlVector ann, ClusteringOrder order);. I assume this is not hard.

absurdfarce · 2024-09-20T22:31:34Z

query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/AlterTableTest.java

+                .alterColumn(
+                    "v",
+                    DataTypes.custom(
+                        "org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.FloatType,3)")))


DataTypes.custom(DataTypes.vectorOf(DataTypes.Float,3))

absurdfarce · 2024-09-20T22:50:58Z

...uilder/src/test/java/com/datastax/oss/driver/api/querybuilder/delete/DeleteSelectorTest.java

+            deleteFrom("foo")
+                .column("v")
+                .whereColumn("k")
+                .isEqualTo(literal(Arrays.asList(0.1, 0.2))))


This test relies on the fact that vectors happen to be represented using the same syntax as arrays. This test should be changed to use a CqlVector here instead of an array.

absurdfarce · 2024-09-20T22:51:18Z

...builder/src/test/java/com/datastax/oss/driver/api/querybuilder/insert/RegularInsertTest.java

@@ -41,6 +42,12 @@ public void should_generate_column_assignments() {
        .hasCql("INSERT INTO foo (a,b) VALUES (?,?)");
  }

+  @Test
+  public void should_generate_vector_literals() {
+    assertThat(insertInto("foo").value("a", literal(Arrays.asList(0.1, 0.2, 0.3))))


As above: this should use a CqlVector rather than an Array

absurdfarce · 2024-09-20T22:52:13Z

query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/AlterTypeTest.java

+
+  @Test
+  public void should_generate_alter_type_with_vector() {
+    assertThat(alterType("foo", "bar").alterField("vec", new DefaultVectorType(DataTypes.FLOAT, 3)))


DataTypes.vectorOf(DataTypes.Float, 3)

absurdfarce · 2024-09-20T22:52:35Z

...y-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/CreateTableTest.java

+    assertThat(
+            createTable("foo")
+                .withPartitionKey("k", DataTypes.INT)
+                .withColumn("v", new DefaultVectorType(DataTypes.FLOAT, 3)))


As above:

DataTypes.vectorOf(DataTypes.Float, 3)

absurdfarce · 2024-09-20T22:52:49Z

query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/schema/CreateTypeTest.java

+    assertThat(
+            createType("ks1", "type")
+                .withField("c1", DataTypes.INT)
+                .withField("vec", new DefaultVectorType(DataTypes.FLOAT, 3)))


As above:

DataTypes.vectorOf(DataTypes.Float, 3)

absurdfarce · 2024-09-20T23:09:10Z

One other thing worth mentioning: the Cassandra impl also supports a way to get "the similarity calculation of the best scoring node closest to the query data as part of the results". Take a look at the similarity_dot_product() function (and the other choices as well) in the relevant Cassandra docs. The query builder should have support for those as well.

SiyaoIsHiding · 2024-10-03T02:02:57Z

The revapi thing is fixed and the vector similarity function is already supported by the existing Function term. I added tests for it as examples:

cassandra-java-driver/query-builder/src/test/java/com/datastax/oss/driver/api/querybuilder/select/SelectSelectorTest.java

Lines 235 to 274 in 19148d5

    
           public void should_generate_similarity_functions() { 
        
             Select similarity_cosine_clause = 
        
                 selectFrom("cycling", "comments_vs") 
        
                     .column("comment") 
        
                     .function( 
        
                         "similarity_cosine", 
        
                         Selector.column("comment_vector"), 
        
                         literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05))) 
        
                     .orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05)) 
        
                     .limit(1); 
        
             assertThat(similarity_cosine_clause) 
        
                 .hasCql( 
        
                     "SELECT comment,similarity_cosine(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1"); 
        
             Select similarity_euclidean_clause = 
        
                 selectFrom("cycling", "comments_vs") 
        
                     .column("comment") 
        
                     .function( 
        
                         "similarity_euclidean", 
        
                         Selector.column("comment_vector"), 
        
                         literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05))) 
        
                     .orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05)) 
        
                     .limit(1); 
        
             assertThat(similarity_euclidean_clause) 
        
                 .hasCql( 
        
                     "SELECT comment,similarity_euclidean(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1"); 
        
             Select similarity_dot_product_clause = 
        
                 selectFrom("cycling", "comments_vs") 
        
                     .column("comment") 
        
                     .function( 
        
                         "similarity_dot_product", 
        
                         Selector.column("comment_vector"), 
        
                         literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05))) 
        
                     .orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05)) 
        
                     .limit(1); 
        
             assertThat(similarity_dot_product_clause) 
        
                 .hasCql( 
        
                     "SELECT comment,similarity_dot_product(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1"); 
        
           }

In terms of the spring-ai downstream, as we won't actually break any API, is there anything we should test or how?

SiyaoIsHiding added 6 commits May 2, 2024 08:16

INSERT and DELETE working

9e87033

orderBy(annOf("c1", CqlVector.newInstance(0.1, 0.2, 0.3))))

9ea8689

fmt

b035763

DataTypes.custom("org.apache.cassandra.db.marshal.VectorType(org.apac…

943e3f7

…he.cassandra.db.marshal.FloatType,3)")

SchemaBuilder add tests

d176f09

Add blank line

953fa47

absurdfarce self-requested a review May 29, 2024 17:37

michaelsembwever self-requested a review June 11, 2024 16:48

lukasz-antoniak reviewed Aug 19, 2024

View reviewed changes

lukasz-antoniak mentioned this pull request Aug 19, 2024

CASSANDRA-19837: Support ORDER BY ANN in query builder #1946

Closed

SiyaoIsHiding mentioned this pull request Sep 3, 2024

JAVA-3143: Extend driver vector support to arbitrary subtypes and fix handling of variable length types (OSS C* 5.0) #1952

Open

absurdfarce requested changes Sep 20, 2024

View reviewed changes

SiyaoIsHiding added 5 commits October 2, 2024 11:47

Merge remote-tracking branch 'upstream/4.x' into vector-support

afa054a

fix tests and fmt

009b6e6

refactor and pass tests

ae90a06

revapi

eb800ba

add test of similarity functions

19148d5

SiyaoIsHiding requested review from lukasz-antoniak and absurdfarce October 3, 2024 02:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

SiyaoIsHiding commented May 6, 2024 •

edited

Loading

michaelsembwever commented May 8, 2024

michaelsembwever commented Jun 11, 2024

michaelsembwever commented Jun 11, 2024

lukasz-antoniak Aug 19, 2024

absurdfarce left a comment

absurdfarce Sep 9, 2024

absurdfarce Sep 9, 2024

SiyaoIsHiding Oct 2, 2024

absurdfarce Sep 20, 2024

SiyaoIsHiding Oct 3, 2024

absurdfarce Sep 20, 2024

absurdfarce Sep 20, 2024

absurdfarce Sep 20, 2024

absurdfarce Sep 20, 2024

absurdfarce Sep 20, 2024

absurdfarce Sep 20, 2024

absurdfarce commented Sep 20, 2024

SiyaoIsHiding commented Oct 3, 2024

JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

Are you sure you want to change the base?

JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

Conversation

SiyaoIsHiding commented May 6, 2024 • edited Loading

michaelsembwever commented May 8, 2024

michaelsembwever commented Jun 11, 2024

michaelsembwever commented Jun 11, 2024

Choose a reason for hiding this comment

absurdfarce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

absurdfarce commented Sep 20, 2024

SiyaoIsHiding commented Oct 3, 2024

SiyaoIsHiding commented May 6, 2024 •

edited

Loading