Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Range queries V1.1 Range over value fields #114

Closed
raminqaf opened this issue Oct 13, 2022 · 0 comments
Closed

Design: Range queries V1.1 Range over value fields #114

raminqaf opened this issue Oct 13, 2022 · 0 comments
Labels
type/design Design documents for enhancements

Comments

@raminqaf
Copy link
Contributor

raminqaf commented Oct 13, 2022

Design: Range queries V1.1 Range over value fields

Development: 0.9

last update: 28.10.2022


This issue describes our approach for the support of Range queries over value fields in Quick.

Problem definition

Imagine the scenario in which we want to do analytics over the click counts of a user in a range of time. The users are allowed to edit their click count. This means that we need the information of the userId and timestamp (in a format like userId:timestamp) as the key; otherwise, we will lose the updated information. So the data would look like this:

[
    {
        "key": "1:0",
        "value": {
            "userId": 1,
            "clickCount": 1,
            "timestamp": 0
        }
    },
    {
        "key": "2:0",
        "value": {
            "userId": 2,
            "clickCount": 1,
            "timestamp": 0
        }
    },
    {
        "key": "2:0",
        "value": {
            "userId": 2,
            "clickCount": 2,
            "timestamp": 0
        }
    },
    {
        "key": "2:1",
        "value": {
            "userId": 2,
            "clickCount": 1,
            "timestamp": 1
        }
    }
]

Pay attention to the userId 2. If the key were only the integer value 2, we would have lost the values in timestamp 0. Now let's create the GraphQL schema and use the Range query functionality:

type Query {
    userMetrics(
        userId: Int
        timeFrom: Long
        timeTo: Long
    ): [UserMetric] @topic(name: "user-metrics",
        keyArgument: "userId",
        rangeFrom: "timeFrom",
        rangeTo: "timeTo")
}
type UserMetric {
    userId: Int!
    clickCount: Int!
    timestamp: Long
}

And we create a topic with a key string and the value schema of UserMetric and ingest the JSON data defined above.

quick topic user-request-range --key integer --value schema --schema gateway.UserMetric --range-field timestamp

The problem here is that we can only query over the key of the topic. We cannot query the data over the userId 2. So in this design document, we will discuss possible designs and solutions to overcome this limitation.

Goals

  1. Quick-CLI: The users should set the --range-key option when they are creating a queryable topic for range
  2. Gateway: The gateway should be aware of the type of the newly defined range key field
  3. Mirror: The mirror should repartition the data based on the defined range-key

Out of scope

  1. The range-key can only be applied on primitive types (Int, Long, String)

Implementation

1. Quick CLI

Goal: The users should set the --range-key option when they are creating a queryable topic for range

When creating a topic, the user can pass a --range-key <FieldName> option. The manager passes the value to the deployment of the mirror.

Example:

quick topic user-metrics --key string --value schema --schema gateway.UserMetric --range-key userId --range-field timestamp

This command sends a request to the manager, and the manager prepares the deployment of a mirror called user-metrics. This mirror creates two indexes:

  1. Range Index over the new key (userId) and timestamp
  2. Point Index only over the new key (userId)

2. Gateway

Goal: The gateway should be aware of the type of the newly defined range key field

It is important to create a partitioned mirror client based on the newly defined key. Currently, we are using the information in the topic registry (i.e., key serde) to serialize the keys and find the partition. This should change to the type of the new key. One idea is to use the type of keyArgument and supply the SerDe.

3. Mirror

Goal: The mirror should repartition the data based on the defined range-key

The mirror should use the selectKey method to repartition the data based on the new key. Kafka Streams will:

  • Send the rekeyed data to an internal repartition topic
  • Reread the newly rekeyed data back into Kafka Streams
    Below you can find a detailed description of the topology:

repartition strategy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/design Design documents for enhancements
Projects
Status: Done
Development

No branches or pull requests

1 participant