-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumption metering RFC #2884
Consumption metering RFC #2884
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of good points in the RFC and discussions. I think this RFC is a good starting point, but we can create more smaller RFCs on smaller independent topics, and move our comments with possible approaches there.
Topics for these RFCs can be:
1. Consumption events calculation
It seems that everyone mostly agrees to the following schema:
- All services can produce "consumption events" at any time
- We collect all events in a single place
- End-user costs are calculated based on these events
We can further discuss how to calculate these events in different services, how often we should report them and what content should be in these events. For example, there was a discussion about what is "Synthetic storage size" and how to calculate it, it can be discussed in this separate RFC.
2. Events collection
This RFC can cover what solution we should use to collect events from all our services, e.g. should we use pull/push model or get a ready-made solution like vector.
We can also cover the format of events, e.g. should we use JSON, protobuf or something else.
Our collection solution should:
- don't lose events, and make retries if one of the services is down
- don't lose events on restart, that means we likely need to have persistent buffer with events stored on disk
- be simple to use and configure, to make it easy to test
- be easy to enable collection from new services and disable collection from old stopped services
- provide exactly-once event delivery, can be done by appending UUID to events or single batch
- have reasonable latency for events delivery
3. Events storage
There is an argument that collecting all events in a single place is bad for scalability, we can discuss this in this RFC. From my point of view, we can easily scale by sharding events by tenant_id. I know that large companies use event sourcing and it works for them, so my opinion is that it should work for us too.
We can discuss where we should store events. Requirements for the storage:
- should be able to store a lot of events
- it should should be easy to push new events
- events should be durably stored for some time
- events should be easy to query/consume at any time
Kafka sounds like a good solution for this, but we can discuss other options too.
4. Further processing
We can discuss how we should process events to calculate costs and pipelines for pushing events to other services.
.
I mostly agree with the proposed solution for the first iteration. I see the overall schema as follows:
┌──────────────┬────────────────┐
│ pageserver │ │
└──────────────┘ ┌─────────▼──────────┐ ┌───────┐ ┌─────────────┐
│ │ │ │ │ │
┌──────────────┐ │ POST /usage-events ├────────► Kafka ├──────►│ console │
│ proxy ├──────► (control-plane) │ │ │ │ │
└──────────────┘ │ │ └───────┘ └─────────────┘
└─────────▲──────────┘
┌──────────────────┐ │
│ autoscaler-agent ├──────────────┘
└──────────────────┘
@petuhovskiy, I saw several implementations like you proposed, except for one moment: The one point of failure (HTTP endpoint) is replaced with filebeat, which supports retries/continuation from a previously committed state. |
We had a call with metronome and it seems that they can't do complicated aggregates on their side. So if we bill for gigabytes per hour we should send one event per hour and they can sum() it. But we can't send event two times more often because they will still interpret it as per-hour event ignoring timestamps and doubling price. So we have to be more accurate with our events and handle some aggregations on our side. With that I can see few other possible pipelines. Let's assume that we always want to end up with aggregated (1 hour or 1 day) usage events on our side stored long-term. a) |
I personally like d) as an intermediate option. Vector could work with pipelines of HTTP server source -> log to metric transform -> Aggregate transform aggregating needed metrics with needed granularity -> HTTP sink -> console to billing services. Pipeline But this is a lot to build. c) the Clickhouse sounds great for showing historical data to downsample it automatically. But it has weak guarantees on ingestion, so I would not use it as a primary data source. But a secondary for graphs, etc. So I would vote for scheme a) with minor additions (see above), and I would start from the back: deliver data to some accounting engine (which is Orb, it probably has the needed aggregations) with |
1baedc6
to
7841a9f
Compare
I resolved all the conversations to be able to merge PR. |
Rendered: https://github.com/neondatabase/neon/blob/metering_rfc/docs/rfcs/021-metering.md