Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumption metering RFC #2884

Merged
merged 1 commit into from
Jan 16, 2023
Merged

Consumption metering RFC #2884

merged 1 commit into from
Jan 16, 2023

Conversation

kelvich
Copy link
Contributor

@kelvich kelvich commented Nov 22, 2022

docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
@kelvich kelvich mentioned this pull request Nov 28, 2022
5 tasks
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Outdated Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
docs/rfcs/021-metering.md Show resolved Hide resolved
Copy link
Member

@petuhovskiy petuhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of good points in the RFC and discussions. I think this RFC is a good starting point, but we can create more smaller RFCs on smaller independent topics, and move our comments with possible approaches there.

Topics for these RFCs can be:

1. Consumption events calculation

It seems that everyone mostly agrees to the following schema:

  • All services can produce "consumption events" at any time
  • We collect all events in a single place
  • End-user costs are calculated based on these events

We can further discuss how to calculate these events in different services, how often we should report them and what content should be in these events. For example, there was a discussion about what is "Synthetic storage size" and how to calculate it, it can be discussed in this separate RFC.

2. Events collection

This RFC can cover what solution we should use to collect events from all our services, e.g. should we use pull/push model or get a ready-made solution like vector.
We can also cover the format of events, e.g. should we use JSON, protobuf or something else.

Our collection solution should:

  • don't lose events, and make retries if one of the services is down
  • don't lose events on restart, that means we likely need to have persistent buffer with events stored on disk
  • be simple to use and configure, to make it easy to test
  • be easy to enable collection from new services and disable collection from old stopped services
  • provide exactly-once event delivery, can be done by appending UUID to events or single batch
  • have reasonable latency for events delivery

3. Events storage

There is an argument that collecting all events in a single place is bad for scalability, we can discuss this in this RFC. From my point of view, we can easily scale by sharding events by tenant_id. I know that large companies use event sourcing and it works for them, so my opinion is that it should work for us too.

We can discuss where we should store events. Requirements for the storage:

  • should be able to store a lot of events
  • it should should be easy to push new events
  • events should be durably stored for some time
  • events should be easy to query/consume at any time

Kafka sounds like a good solution for this, but we can discuss other options too.

4. Further processing

We can discuss how we should process events to calculate costs and pipelines for pushing events to other services.


.

I mostly agree with the proposed solution for the first iteration. I see the overall schema as follows:

  ┌──────────────┬────────────────┐
  │  pageserver  │                │
  └──────────────┘      ┌─────────▼──────────┐        ┌───────┐       ┌─────────────┐
                        │                    │        │       │       │             │
  ┌──────────────┐      │ POST /usage-events ├────────► Kafka ├──────►│   console   │
  │    proxy     ├──────►  (control-plane)   │        │       │       │             │
  └──────────────┘      │                    │        └───────┘       └─────────────┘
                        └─────────▲──────────┘
┌──────────────────┐              │
│ autoscaler-agent ├──────────────┘
└──────────────────┘

@vadim2404
Copy link
Contributor

@petuhovskiy, I saw several implementations like you proposed, except for one moment:
the HTTP endpoint was replaced with a local log file that was pushed to Kafka via filebeat directly (https://www.elastic.co/guide/en/beats/filebeat/current/kafka-output.html).

The one point of failure (HTTP endpoint) is replaced with filebeat, which supports retries/continuation from a previously committed state.

@kelvich
Copy link
Contributor Author

kelvich commented Dec 1, 2022

We had a call with metronome and it seems that they can't do complicated aggregates on their side. So if we bill for gigabytes per hour we should send one event per hour and they can sum() it. But we can't send event two times more often because they will still interpret it as per-hour event ignoring timestamps and doubling price. So we have to be more accurate with our events and handle some aggregations on our side.

With that I can see few other possible pipelines. Let's assume that we always want to end up with aggregated (1 hour or 1 day) usage events on our side stored long-term.

a) POST /usage-events (control plane) -> kafka -> postgres table with 1h or 1day aggregated events
b) POST /usage-events (control plane) -> postgres table with last hour/day of unaggregated events -(move data with dbt and truncate source table)-> postgres table with 1h or 1day aggregated events
c) same but with clickhouse in the middle and at the end
d) POST to Vector.dev -> vector downsamples events -> something that can accept http jsons and put them in postgres (control plane or PostgREST) -> postgres table with 1h or 1day aggregated events
e) POST to Vector.dev -> vector downsamples events -> clickhouse table with 1h or 1day aggregated events
f) two previous variants with Vector.dev polling prometheus endpoints instead of waiting for POST
g) periodic job to query VictoriaMetrics for last hour of usage and put results in postgres table with 1h or 1day aggregated events

@chaporgin
Copy link
Member

chaporgin commented Dec 1, 2022

I personally like d) as an intermediate option. Vector could work with pipelines of HTTP server source -> log to metric transform -> Aggregate transform aggregating needed metrics with needed granularity -> HTTP sink -> console to billing services.

Pipeline sources -> multiple Vector instances for a cell -> Kafka -> single Postgres table for all the cloud -> some accounting engine seems to be a bright future with have following properties: not losing events, tolerant to repeating events, bill multiple events on a per-hour basis, low costs by sending only the needed events to the accounting engine.
Vector:
It gives agility in sources and can accept quite a several formats; HTTP, and Prometheus is easy to implement on storage nodes. Push does not need to implement storage nodes discovery, so the proposed HTTP or statsd seems easier to maintain.
It gives batching, which increases the throughput when working with Kafka.
Not losing events (or losing very little): Vector retries the events.
Kafka
Given Kafka streams, we can aggregate and bill on a per-hour basis, reducing the volume of events (which we are gonna pay for). If the events have an idempotency field (timestamp+hostname) we can probably ignore duplicate events on the Kafka side with log compaction. So at-least-once seems to be OK for us.
Put everything to Postgres with Kafka Postgresql Sink.
Postgres table
Having the events in a Postgres table would allow us to enrich them in the background with client id / other data. Then send it to an external system in the background (Like Metronome, or others) or use our home-grown (which does not exist and is not rocket science to build, but we need to get it first). And add taxes etc. And we could try to show usage from that table to the users and bills from some external system.

But this is a lot to build.

c) the Clickhouse sounds great for showing historical data to downsample it automatically. But it has weak guarantees on ingestion, so I would not use it as a primary data source. But a secondary for graphs, etc.
Option d) seems to be fast to build, but there might be a problem with it in scalability. If we leave >1 instance of vector accepting and aggregating the events, we get each GB-hour of storage billed twice. We could add another layer of single-instanced vector post-aggregating the events for each hour. However, I doubt that would be the consistent solution - it still allows billing each hour twice if events come with significant intervals into that post-aggregator.
And having 1 instance handling all the traffic (15k RPS load) would probably work if we give it enough CPU and mem; it's not that big a deal for vector. I see this scheme as intermediate and thought it would work for sure; we will need to rework ingest pipeline later; we probably do not want to live with a single instance for a long time as long as we don't want to introduce retries to event emitters.
f) needs storage service discovery, I would try to avoid that.
g) not sure we have any guarantees on this metrics storage; probably losing them is not what we want, so I would not rely on Victoria Metrics as a reliable storage for produce billing information heavily.

So I would vote for scheme a) with minor additions (see above), and I would start from the back: deliver data to some accounting engine (which is Orb, it probably has the needed aggregations) with POST /usage via HTTP. That would allow us to start with paying customers. Then add Vector just to bring Kafka later. Then we could add temporary Kafka -> Vector -> POST /usage to just reduce the number of events to the Orb. And then add a Postgres table to store the data ourselves. That would allow us to gradually move step by step to our own solutions while already having a 100% working billing solution. Because building those pipelines from scratch would put us, I would say, human months away from billing customers.

@lubennikovaav lubennikovaav marked this pull request as ready for review January 16, 2023 16:37
@lubennikovaav
Copy link
Contributor

lubennikovaav commented Jan 16, 2023

I resolved all the conversations to be able to merge PR.
If some discussions need revival, please, open follow-up issues.

@lubennikovaav lubennikovaav self-requested a review January 16, 2023 16:48
@lubennikovaav lubennikovaav enabled auto-merge (rebase) January 16, 2023 16:52
@lubennikovaav lubennikovaav merged commit 431e464 into main Jan 16, 2023
@lubennikovaav lubennikovaav deleted the metering_rfc branch January 16, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants