Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder #1931

Open
wants to merge 11 commits into
base: 4.x
Choose a base branch
from

Conversation

SiyaoIsHiding
Copy link
Contributor

@SiyaoIsHiding SiyaoIsHiding commented May 6, 2024

Currently, the SchemaBuilder works with vector like this:

    assertThat(
            createTable("foo")
                .withPartitionKey("k", DataTypes.INT)
                .withColumn("v", new DefaultVectorType(DataTypes.FLOAT, 3)))
        .hasCql("CREATE TABLE foo (k int PRIMARY KEY,v VECTOR<FLOAT, 3>)");

Or

assertThat(createTable("foo")
            .withPartitionKey("k", DataTypes.INT)
            .withColumn("v", DataTypes.custom("org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.FloatType,3)")
            ))
            .hasCql("CREATE TABLE foo (k int PRIMARY KEY,v VECTOR<FLOAT, 3>)");

Please let me know if you want something like .withColumn("v", DataTypes.vector(DataTypes.FLOAT, 3)).

@absurdfarce absurdfarce self-requested a review May 29, 2024 17:37
@michaelsembwever
Copy link
Member

i can't get this to compile

[ERROR] Failed to execute goal org.revapi:revapi-maven-plugin:0.10.5:check (default) on project java-driver-query-builder: The following API problems caused the build to fail:
[ERROR] java.method.addedToInterface: method com.datastax.oss.driver.api.querybuilder.select.Select com.datastax.oss.driver.api.querybuilder.select.Select::orderBy(com.datastax.oss.driver.api.querybuilder.select.Ann): Method was added to an interface.
[ERROR]

am i doing something wrong ?

@michaelsembwever michaelsembwever self-requested a review June 11, 2024 16:48
@michaelsembwever
Copy link
Member

Is there a separate ticket for vector similarity functions ?
https://cassandra.apache.org/doc/latest/cassandra/developing/cql/functions.html#vector-similarity-functions

@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde
return orderBy(CqlIdentifier.fromCql(columnName), order);
}

@NonNull
Select orderBy(@NonNull Ann ann);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider adding here the direction (ASC, DESC) parameter. Currently we do not support DESC vector ordering, but this may be available in future and CQL syntax allows it.

Copy link
Contributor

@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @SiyaoIsHiding! This is basically what I was expecting to see with this change. We can have a conversation about the comments about the API but otherwise there's just a few things to clean up here.

@@ -60,7 +60,8 @@ public String getClassName() {
@NonNull
@Override
public String asCql(boolean includeFrozen, boolean pretty) {
return String.format("'%s(%d)'", getClassName(), getDimensions());
return String.format(
"VECTOR<%s, %d>", this.subtype.asCql(includeFrozen, pretty).toUpperCase(), getDimensions());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a lower case "VECTOR" to match what's done with the other collection types (list, set, map)


public static Ann annOf(@NonNull String cqlIdentifier, @NonNull CqlVector<Number> vector) {
return new DefaultAnn(CqlIdentifier.fromCql(cqlIdentifier), vector);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will need to be updated when the PR for JAVA-3143 is merged; the CqlVector constraint won't apply once that's in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested out on C* 5.0.1 and it says ANN only supports float.

cqlsh:default_keyspace> insert INTO vt (key, v ) VALUES ( 1, ['a','b']) ;
cqlsh:default_keyspace> select * from vt order by v ann of ['a', 'c'];
InvalidRequest: Error from server: code=2200 [Invalid query] message="ANN ordering is only supported on float vector indexes"

@@ -146,6 +146,8 @@ default Select orderBy(@NonNull String columnName, @NonNull ClusteringOrder orde
return orderBy(CqlIdentifier.fromCql(columnName), order);
}

@NonNull
Select orderBy(@NonNull Ann ann);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but it seems more natural to support something like the following:

Select orderByAnnOf(CqlIdentifier columnId, CqlVector ann);
Select orderByAnnOf(String columnName, CqlVector ann);

Advantage is that with this approach you don't even need to introduce an Ann type... which kinda seems right as that type isn't really doing much for you here.

You could also perhaps add a notion of type checking the specified column to make sure it's a vector type (and to make sure it matches the type of the input CqlVector).

To the point made by @lukasz-antoniak above we could add directionality here (and throw warnings if the user tries to use a DESC order before there's server-side support for it) but I'm not sure it's worth it. There's no mention of ordering in the relevant Cassandra docs so my intuition says to just leave it out for now and add it when it becomes more of a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree to save DESC ordering for later because:

  1. I find it weird to add a feature that does not work yet and we cannot even test
  2. Neither Bret nor I found relevant doc saying they may support DESC of vector search. @lukasz-antoniak did you find any? I find it hard to imagine in what cases we want to find the approximate farthest neighbor...?
  3. If we want to add DESC later, we just need to add another function overload Select orderByAnnOf(String columnName, CqlVector ann, ClusteringOrder order);. I assume this is not hard.

.alterColumn(
"v",
DataTypes.custom(
"org.apache.cassandra.db.marshal.VectorType(org.apache.cassandra.db.marshal.FloatType,3)")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataTypes.custom(DataTypes.vectorOf(DataTypes.Float,3))

deleteFrom("foo")
.column("v")
.whereColumn("k")
.isEqualTo(literal(Arrays.asList(0.1, 0.2))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relies on the fact that vectors happen to be represented using the same syntax as arrays. This test should be changed to use a CqlVector here instead of an array.

@@ -41,6 +42,12 @@ public void should_generate_column_assignments() {
.hasCql("INSERT INTO foo (a,b) VALUES (?,?)");
}

@Test
public void should_generate_vector_literals() {
assertThat(insertInto("foo").value("a", literal(Arrays.asList(0.1, 0.2, 0.3))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above: this should use a CqlVector rather than an Array


@Test
public void should_generate_alter_type_with_vector() {
assertThat(alterType("foo", "bar").alterField("vec", new DefaultVectorType(DataTypes.FLOAT, 3)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataTypes.vectorOf(DataTypes.Float, 3)

assertThat(
createTable("foo")
.withPartitionKey("k", DataTypes.INT)
.withColumn("v", new DefaultVectorType(DataTypes.FLOAT, 3)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above:

DataTypes.vectorOf(DataTypes.Float, 3)

assertThat(
createType("ks1", "type")
.withField("c1", DataTypes.INT)
.withField("vec", new DefaultVectorType(DataTypes.FLOAT, 3)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above:

DataTypes.vectorOf(DataTypes.Float, 3)

@absurdfarce
Copy link
Contributor

One other thing worth mentioning: the Cassandra impl also supports a way to get "the similarity calculation of the best scoring node closest to the query data as part of the results". Take a look at the similarity_dot_product() function (and the other choices as well) in the relevant Cassandra docs. The query builder should have support for those as well.

@SiyaoIsHiding
Copy link
Contributor Author

The revapi thing is fixed and the vector similarity function is already supported by the existing Function term. I added tests for it as examples:

public void should_generate_similarity_functions() {
Select similarity_cosine_clause =
selectFrom("cycling", "comments_vs")
.column("comment")
.function(
"similarity_cosine",
Selector.column("comment_vector"),
literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05)))
.orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05))
.limit(1);
assertThat(similarity_cosine_clause)
.hasCql(
"SELECT comment,similarity_cosine(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1");
Select similarity_euclidean_clause =
selectFrom("cycling", "comments_vs")
.column("comment")
.function(
"similarity_euclidean",
Selector.column("comment_vector"),
literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05)))
.orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05))
.limit(1);
assertThat(similarity_euclidean_clause)
.hasCql(
"SELECT comment,similarity_euclidean(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1");
Select similarity_dot_product_clause =
selectFrom("cycling", "comments_vs")
.column("comment")
.function(
"similarity_dot_product",
Selector.column("comment_vector"),
literal(CqlVector.newInstance(0.2, 0.15, 0.3, 0.2, 0.05)))
.orderByAnnOf("comment_vector", CqlVector.newInstance(0.1, 0.15, 0.3, 0.12, 0.05))
.limit(1);
assertThat(similarity_dot_product_clause)
.hasCql(
"SELECT comment,similarity_dot_product(comment_vector,[0.2, 0.15, 0.3, 0.2, 0.05]) FROM cycling.comments_vs ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05] LIMIT 1");
}

In terms of the spring-ai downstream, as we won't actually break any API, is there anything we should test or how?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants