Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VectorType support to CDC connector #170

Merged
merged 2 commits into from
Jul 10, 2023
Merged

Conversation

aymkhalil
Copy link
Contributor

Minimal CDC connector patch to support vector type. Please note:

  • I have to do custom changes to the agent locally in order to run with C* vsearch branch locally. However, this patch only adds support on the Connector. A new agent version is required once the artifacts of DSE7 (dse-db) and C*5 (cassanda-all) artifacts are released. As long as the Vector Type is not part of the Primary Key, the agent doesn't require any changes.
  • Integration tests needs docker images build. Will send follow up PRs as those images become available.
  • Codecs PR from the messaging commons package were not necessary as the VectorType is explicitly read as via the row.getCqlVector(fieldName) API
  • The resulting schema on the data topic is captured in this example (not the item_vector field in the value section):
{
  "version": 0,
  "schemaInfo": {
    "name": "data-ks1.embeddings",
    "schema": {
      "key": {
        "name": "embeddings",
        "schema": {
          "type": "record",
          "name": "embeddings",
          "namespace": "ks1",
          "doc": "Table ks1.embeddings",
          "fields": [
            {
              "name": "id",
              "type": "int"
            }
          ]
        },
        "type": "AVRO",
        "timestamp": 0,
        "properties": {}
      },
      "value": {
        "name": "embeddings",
        "schema": {
          "type": "record",
          "name": "embeddings",
          "namespace": "ks1",
          "doc": "Table ks1.embeddings",
          "fields": [
            {
              "name": "item_vector",
              "type": [
                "null",
                {
                  "type": "array",
                  "items": "float"
                }
              ]
            },
            {
              "name": "value",
              "type": [
                "null",
                "string"
              ]
            }
          ]
        },
        "type": "AVRO",
        "timestamp": 0,
        "properties": {}
      }
    },
    "type": "KEY_VALUE",
    "timestamp": 1688589573106,
    "properties": {
      "key.schema.name": "embeddings",
      "key.schema.properties": "{}",
      "key.schema.type": "AVRO",
      "kv.encoding.type": "SEPARATED",
      "value.schema.name": "embeddings",
      "value.schema.properties": "{}",
      "value.schema.type": "AVRO"
    }
  }
}

@aymkhalil aymkhalil marked this pull request as ready for review July 10, 2023 15:07
Copy link
Collaborator

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

but please note that CqlVectorType will change to VectorType on the next major release of the Cassandra Driver.

For the current version I think that this is good

@aymkhalil aymkhalil merged commit 3fd76e0 into master Jul 10, 2023
@eolivelli eolivelli deleted the connector-vector-type branch July 11, 2023 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants