Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate Limit Receiver Extension #6908

Open
MovieStoreGuy opened this issue Dec 20, 2021 · 36 comments
Open

Rate Limit Receiver Extension #6908

MovieStoreGuy opened this issue Dec 20, 2021 · 36 comments
Labels
Accepted Component New component has been sponsored enhancement New feature or request help wanted Extra attention is needed never stale Issues marked with this label will be never staled and automatically removed release:after-ga

Comments

@MovieStoreGuy
Copy link
Contributor

Is your feature request related to a problem? Please describe.
There is two scenarios that I would like to see some amount of rate limiting happening:

  1. Running the collector in a gateway deployment style (many stateless collectors running as deployment) can be accepting 3x traffic from Service A which can impact delivery of service B sending <1x scale.

  2. Running the collector as a stand alone / sidecar deployment (Single node, potentially stateful with only a single service sending data).

If in either of the cases, the service sends an erroneous amount of traffic, it impact others trying to send through the same pipeline.

Describe the solution you'd like
A new extension made to the collector contrib that can potential rate limit of header fields or HTTP meta data information.

Ideally it should follow an exiting rate limiting convention that is widely adopted so that effort to update instrumentation is low touch.

Considering Option 1 case, the collector deployment will need to send data among each other to ensure that traffic is correctly rate limited (with some tolerance for eventually consistent).

Describe alternatives you've considered
Each collector could have an in memory counter that is of fix number B over T time.
Therefore the considered rate limit would be N nodes X B buckets / (2T) (assuming a drift of T between each node)
Which could work, but the drift might not be tolerable but adopters.

Additional context
There is some public projects that attempt to make generic solutions to this:

@MovieStoreGuy
Copy link
Contributor Author

MovieStoreGuy commented Dec 20, 2021

@tigrannajaryan , @jpkrohling , @bogdandrutu ,

This is a large potential piece of work and I was wondering if it could be brought up for discussion for the next Collector SIG?
Edit: More specific what rate limiting strategy to use, what should be configurable by a user, should the collector use some sort of consensus protocol among each node, or use a third party store like redis, dynamo, etc...

Within Atlassian, we've built an internal version of this that works for our own needs however, I am convinced that we can contribute back to an extension that could be used by all adopters of the collector.

@tigrannajaryan
Copy link
Member

Could rate limiter be implemented as an auth extension? Alternatively, perhaps it could be a processor if the processor has access to the data that is needed to make a rate limiting decision.

Considering Option 1 case, the collector deployment will need to send data among each other to ensure that traffic is correctly rate limited (with some tolerance for eventually consistent).

What do you mean by "correctly rate limited"? Are you for example looking for rate limiting traces by dropping spans from one trace?

Definitely worth discussing in the SIG.

@Aneurysm9
Copy link
Member

I'd second the suggestion to model rate limiting as an auth extension, that should provide the clearest path to integrating it with as many receivers as possible as early in the pipeline as possible. I think using a processor might be insufficient as all the work of the receiver has been performed at that point, reducing the amount of work that can be avoided on rate-limited requests.

In the past I've found it simplest to avoid coordination between nodes and simply set the desired rate limit to limit / #nodes where it is expected that traffic will be evenly distributed across nodes. Is that a workable assumption here, or is coordination truly required?

The thing I worry about with implementing a rate limiter is losing data. OTLP has a built-in mechanism for telling senders to hold off on retries, which should help for correctly implemented OTLP senders, but what about other protocols that do not account for backpressure?

@MovieStoreGuy
Copy link
Contributor Author

Could rate limiter be implemented as an auth extension? Alternatively, perhaps it could be a processor if the processor has access to the data that is needed to make a rate limiting decision.

I would need to double check the interface for Auth extensions, but I thought it would be a nice addition if any receiver could be wrapped by it so that it is one thing that we can extract out from each of those components.

How we had done it internally is that we created a new receiver that wrapped the default otlp http receiver due. It wasn't the cleanest solution but it meant that we could those large deployments from breaking our pipeline.

I also realise while typing that I am assuming that I have access to the raw request which may not be the case in some scenarios. (For example, It might be a use case in future to rate limit host collected metrics from collectd. Out there example but I could see that happening with things like JMX or expvar)

Long winded point of saying that a processor might be a better option here since it could apply to more use cases over time.

What do you mean by "correctly rate limited"? Are you for example looking for rate limiting traces by dropping spans from one trace?

Yes sorry, I had meant that the rate limiting buckets are kept in sync with each node, so each node knows the expiry and reset interval.

However, there is several things I can think of that we may (and currently do internally) attempt to rate limit on:

  • Number of Requests (it is unusual for someone to experiment with new tracing or metrics instrumentation that happens to be high volume)
  • Cumulative Content Length (Not something we currently do but could be handy for those who use the collector to front some data store such as Kinesis, Kafka)

And then after it starts getting into introspection of that data:

  • Number of Resource Values (Number of Spans, Datapoints, Logs)
  • Number of Resource Types (Have different thresholds for Tracing, Metrics, logs)

I'd second the suggestion to model rate limiting as an auth extension, that should provide the clearest path to integrating it with as many receivers as possible as early in the pipeline as possible. I think using a processor might be insufficient as all the work of the receiver has been performed at that point, reducing the amount of work that can be avoided on rate-limited requests.

I am okay with this approach, however, it does feel a misuse of the Auth interface to a degree?

In the past I've found it simplest to avoid coordination between nodes and simply set the desired rate limit to limit / #nodes where it is expected that traffic will be evenly distributed across nodes. Is that a workable assumption here, or is coordination truly required?

So working off the assumption that collector is only given an interval duration and makes no effort to start on nearest duration window (ie. 1m interval wouldn't try to start at 08:44.00000 but would start the interval when the component was ready)

I really think it would come down to:

  • Could the downstream services withstand nearly limit / #nodes * 2 worth of traffic?
    • This could be a result of a new deployment to the gateway (a green, green deployment manor)
    • The nodes starting being ready at the boundaries of interval duration
  • Does there need to be a max number of traffic that can be processed by your deployment?
    • In the event that the gateway deployment needs to scale, the potential limit on a service could be greater that would could be consume by the downstream.

I think Atlassian falls into the second bucket, where we are rather happy to scale as needed to meet demand however, we need to enforce a max limit consume by the entire deployment before it starts impacting our delivery to Kinesis / Kafka.

(We currently have plans for effectively a ETL on Tracing to Metric data that is going to be done within the gateway so it could force a scale up event within the ASG and raise the allowed limits for the service).

The thing I worry about with implementing a rate limiter is losing data. OTLP has a built-in mechanism for telling senders to hold off on retries, which should help for correctly implemented OTLP senders, but what about other protocols that do not account for backpressure?

I think this would be a good chance to further helps those exporters by expanding the exporthelper package that would help account for these situations instead of those extensions needed to reinvent having to handle back pressure?
I am not sure of the best patterns to do so but there is already the Retry helper, Queue helper which do help with this but we could add in for RateLimit so that the user could decided on what was to happen in the event of rate limit event:

  • Drop latest
  • Drop earliest
  • Rollup data (not sure permissible within the collector to mutate data)

However, I suspect for those exporter that do no currently handle cases of back pressure already experience it and offering further helpers to assist with that would drastically reduce effort required by those components.

This is more tangental, but if we do go with an uncoordinated effort (say as a default mode without any external resource to force a sync) that we should recommend routing policies to be round robin instead of least connections or ip hash (assuming that nginx rate limiting naming would work across different ingress controllers)

@jpkrohling
Copy link
Member

Long winded point of saying that a processor might be a better option here

The auth capabilities were also a processor at first, but for security-related tasks, and I would include this one in the mix, it's better to accept/reject a request before much processing has been done. So, something that runs at the same level as authenticators would certainly make sense. I don't think an authenticator is the solution either, as there can only be one authenticator per receiver.

I need to think a bit more about this, but I'll certainly forget everything by the time I come back next year... So, ping me again in a couple of weeks :-)

@jpkrohling jpkrohling added the enhancement New feature or request label Dec 21, 2021
@MovieStoreGuy
Copy link
Contributor Author

If it makes it easier, I can limit the scope of the initial feature to be on receivers that accept a http / RPC instead of all receivers.

We can look resolving the issue for host / pull based receivers in a future issue if that would help?

@Aneurysm9
Copy link
Member

I am okay with this approach, however, it does feel a misuse of the Auth interface to a degree?

It doesn't feel like a misuse to me. I view rate limits as an authorization rule. Once a requestor has exceeded their quota they are no longer authorized to have further requests processed until their quota is again positive.

@MovieStoreGuy
Copy link
Contributor Author

I had honestly never thought about Rate Limiting like @Aneurysm9 , but yeah, that makes sense

@jpkrohling
Copy link
Member

Perhaps we need authorization in addition to authentication? While we can have only one authenticator, we could have multiple authorizers.

@MovieStoreGuy
Copy link
Contributor Author

@jpkrohling , pining you here to help get some ideas flowing :)

@jpkrohling
Copy link
Member

How about something like this:

extensions:
  oidc:

  ratelimiter:
    qps: 1000

receivers:
  otlp:
    protocols:
      grpc:
        auth:
          authenticator: oidc
          authorizers:
          - ratelimiter

processors:

exporters:
  logging:
    logLevel: debug

service:
  extensions: [oidc, ratelimiter]
  pipelines:
    traces:
      receivers: [otlp]
      processors: []
      exporters: [logging]

@MovieStoreGuy
Copy link
Contributor Author

I really like the structure of that configuration file, I think it works well for what as an example authorizer.

@jpkrohling
Copy link
Member

@MovieStoreGuy do you think you have the necessary information to get started with this? If so, we might need a change or two as part of the core before implementing the first authorizer here in the contrib, in which case, we need to get clearance from @bogdandrutu and/or @tigrannajaryan.

@MovieStoreGuy
Copy link
Contributor Author

MovieStoreGuy commented Jan 12, 2022

Yeah, i think I have enough of an idea on how to get started.

My thoughts on this was to do the following:

  • Initial Draft Authorizer interface with an nop type within the core repo
  • Update the configuration types as needed
  • Add in some harness testing for future Authorizer's following the existing life cycle pattern
  • Add an in memory rate limiter as the initial rate limter

In the mean time, I would want to draft up some idea ons how best to implement a global sync of rate limits to allow a gateway style of deployment to scale as required without requiring the downstream services to have the same elasticity.

I believe I had also seen a pattern where this work starts in the contrib and then moves into core, I am also happy to follow that development pattern.

@jpkrohling
Copy link
Member

You'll also need to hook up the authorizers with the authenticators. You might want to do that from the configauth package. The extensios receive a component.Host where you can get a list of extensions, from which you can extract which authorizers are enabled based on the configuration. We have a similar routine for the authenticators, here:

https://github.com/open-telemetry/opentelemetry-collector/blob/87365b7fde4674d0b50d18a368c775ac05a30576/config/configauth/configauth.go#L37-L49

https://github.com/open-telemetry/opentelemetry-collector/blob/87365b7fde4674d0b50d18a368c775ac05a30576/config/confighttp/confighttp.go#L267-L274

@bputt-e
Copy link

bputt-e commented Jan 13, 2022

It would be interesting to see a rate limiter per "tenant" defined as different subjects for authentication. Additionally, allowing rates of qps and bps (bytes per second) would be beneficial. This may be out of scope for this, but an option to limit a tenant to 10m spans / day or 100Gb / day ingested would be helpful

Related issue

@MovieStoreGuy
Copy link
Contributor Author

@bputt-e , this is 100% something that Atlassian would need as well since some services need higher levels of throughput compared to one that interacts with Slack as an example.

I would like spar on this on a seperate PR once we've got approval to go ahead with a trivial implementation of the rate limit authorizer.

@jpkrohling
Copy link
Member

@MovieStoreGuy, should we close this one in favor of open-telemetry/opentelemetry-collector#4438, which would be a more generic approach to this?

@Aneurysm9
Copy link
Member

I don't see how that would solve this issue. This issue is about rate limiting in receivers, on the server side. That is about adding custom RoundTripper implementations to the client side.

@jpkrohling
Copy link
Member

What I had in mind before is that the rate limiters could be implemented as interceptors, to be executed after the authentication phase, as an "authorizer". What did you have in mind?

@Aneurysm9
Copy link
Member

That sounds similar to what I would expect. My concern was that the linked issue related to client interception and not server interception.

I also think that there may be some concerns particular to AuthZ that should be explicitly handled with an authorization framework and not simply free-form attachment of middleware. For instance, if multiple AuthZ handlers are configured what are the requirements for a request to be authorized? Must all of them indicate authorization? Any one of them? Two or more? Is there an order for their evaluation that must be followed?

@jpkrohling
Copy link
Member

jpkrohling commented Jan 25, 2022

My concern was that the linked issue related to client interception and not server interception.

Sorry, my bad. I mixed the two, they should certainly be treated separately.

For instance, if multiple AuthZ handlers are configured what are the requirements for a request to be authorized? Must all of them indicate authorization? Any one of them? Two or more? Is there an order for their evaluation that must be followed?

True. There are typically three types of outcomes for authz routines: pass, block, neutral. For the moment, I think we just need the "block" case, which is where a rate limiter would require. This could be done without a separate API for authZ, so that we can extend it later if we need.

@jpkrohling
Copy link
Member

We talked about this during today's SIG and agreed to give this a try after GA.

@MovieStoreGuy
Copy link
Contributor Author

@jpkrohling could I ask you to add the label for after-ga or something in that vein?

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2022

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Nov 7, 2022
@Aneurysm9 Aneurysm9 removed the Stale label Nov 7, 2022
@Aneurysm9 Aneurysm9 added Sponsor Needed New component seeking sponsor never stale Issues marked with this label will be never staled and automatically removed labels Nov 7, 2022
@jpkrohling jpkrohling added Accepted Component New component has been sponsored and removed Sponsor Needed New component seeking sponsor labels Nov 29, 2022
@jpkrohling
Copy link
Member

I'm open to being an sponsor for this.

@MovieStoreGuy MovieStoreGuy added the help wanted Extra attention is needed label Dec 16, 2022
@blakeroberts-wk
Copy link
Contributor

Similar to @bputt-e's desire to rate limit on rps/bps, I'd like to be able to rate limit on details within the request body.

Specifically, we have a processor that contacts an external resource keying off of unique trace IDs. The external resource is far more quickly exhausted by a high number of unique trace IDs than on number of requests or even spans. i.e. our system’s first bottleneck tends to be number of unique trace IDs, not number of requests or number of spans. Some background here, the resources/clients making requests to the collector that will most often need to be rate limited are those using the java agent and jdbc. Propagating context to jdbc connection pool threads isn’t automatic which tends to greatly increase the number of unique trace IDs to the collector.

I understand client side solutions to this problem; however, this results in fire fighting every time the issue crops up and potentially causes data outages until a client specific fix is deployed. Hence the desire for a rate limiter at the collector level to protect our infrastructure.

And so the rate limiting behavior in this very specific case needs to be: without context propagation to jdbc thread pools, the requests need to be rate limited. With context propagation to jdbc thread pools, the requests do not need to be rate limited. It's not a matter of how many requests or how many spans are sent, but how many traces.

I'm not sure it would be possible to perform this type of rate limiting from a custom authorization extension. I believe a processor would be needed in order to interact with the deserialized request payload. But maybe I'm wrong and an extension could still be given the deserialized request payload?

Nonetheless, coding around my very specific issue probably isn't worth it. So here's an example of a very common telemetry data rate limit that'd require access to the deserialized request payload: metric cardinality rate limiting.

Here's a related issue I've opened about supporting rate limiting within the otlp receiver with a potential solution that relies on custom processors: open-telemetry/opentelemetry-collector#6725

@MovieStoreGuy
Copy link
Contributor Author

Specifically, we have a processor that contacts an external resource keying off of unique trace IDs. The external resource is far more quickly exhausted by a high number of unique trace IDs than on number of requests or even spans. i.e. our system’s first bottleneck tends to be number of unique trace IDs, not number of requests or number of spans. Some background here, the resources/clients making requests to the collector that will most often need to be rate limited are those using the java agent and jdbc. Propagating context to jdbc connection pool threads isn’t automatic which tends to greatly increase the number of unique trace IDs to the collector.

I think this is where an extension makes sense that can be consumed by other components since as you pointed out, the number ingest data isn't the only that can cause "noisy neighbour problems". However, I feel like what you want here is a sampler not a rate limiter. (The difference in implementation, probably not too much).

I understand client side solutions to this problem; however, this results in fire fighting every time the issue crops up and potentially causes data outages until a client specific fix is deployed. Hence the desire for a rate limiter at the collector level to protect our infrastructure.

I feel like the use of sampling (ie, fixed number traces per second) and the use of rate limiting would help to address this in the event that one collector's worth of data would be able to impact the infra.

@blakeroberts-wk
Copy link
Contributor

blakeroberts-wk commented Dec 19, 2022

I feel like what you want here is a sampler not a rate limiter.

I need to share more details here. The processor is a tail based sampler. We are probabilistically sampling on trace ID, but we still must cache all spans in the event, for example, a span with an error is encountered and all spans of the trace must be forwarded. We could more heavily sample these "abusive" services, but we'd like to be able to notify the services that this is happening since a service owner by default will be assuming "if my service encounters an error, I will be able to see the trace related to the error". When that assertion fails, it'd be nice for them to be able to check the service logs and see that their otlp exporter was being rate limited (in the event the exporter exhausts its retries)

@MovieStoreGuy
Copy link
Contributor Author

We could More heavily sample these "abusive" services, but we'd like to be able to notify the services that this is happening since a service owner by default will be assuming "if my service encounters an error, I will be able to see the trace related to the error". When that assertion fails, it'd be nice for them to be able to check the service logs and see that their otlp exporter was being rate limited (in the event the exporter exhausts its retries)

There is a means to report internal signals outward, (currently done in the form of the prom endpoint for metrics) which can contain this information of receiver.rate-limited={service.name:bad-service}.

I don't think the rate limiting function should be combined with sampling since that is a lot of contextual state. At least not initially.

There is also a risk of amplification since an error in traces on the receiver could the sending services to send more data saying they are having issues trying to send data.

@blakeroberts-wk
Copy link
Contributor

There is a means to report internal signals outward, (currently done in the form of the prom endpoint for metrics) which can contain this information of receiver.rate-limited={service.name:bad-service}.

This would help my team (the one's supporting the OpenTelemetry Collector) understand what services are being rate limited (which my custom rate limiter is currently emitting a metric with this information). But for service owners, this data won't be very discoverable. A service owner would need to intuitively guess that rate limiting is the reason for their spans containing orphans. Then somehow find the related data coming from a service (the collector) they won't be familiar with. Having the error in the service's logs would be very discoverable and seems to be in-line with the OpenTelemetry specification around rate limiting and error handling.

I don't think the rate limiting function should be combined with sampling since that is a lot of contextual state. At least not initially.

I agree. Though I do think that rate limiting should be loosely coupled with processor errors. It seems reasonable for a processor to propagate an error that informs clients that the request can be retried.

There is also a risk of amplification since an error in traces on the receiver could the sending services to send more data saying they are having issues trying to send data.

True. And even having clients retry requests at all could amplify server issues regardless of the potential for additional data related to the issues. But I wouldn't think this risk outweighs the reward of having the collector support rate limiting or any other form of retry-able requests.

@vikrambe
Copy link
Contributor

@MovieStoreGuy @jpkrohling @tigrannajaryan Any update/progress on this issue?

@MovieStoreGuy
Copy link
Contributor Author

I don't believe anything has come of this yet unfortunately.

@ebracho
Copy link

ebracho commented Feb 27, 2024

Honeycomb's tail sampler Refinery has an interesting approach to this problem with their Exponential Moving Average (EMA) Sampler.

They way it works is:

  1. Define a goal sample rate (ex: 10%)
  2. Place each trace into a bucket by creating a key derived from attributes (ex: ${service}-${endpoint}-${statusCode})
  3. Periodically adjust the sample rate of each bucket such that:
    a. Less frequent buckets are sampled at a higher rate
    b. More frequent buckets are sampled at a lower rate
    c. Overall sample rate remains at ~$goalSampleRate

In this model, if serviceA experiences a burst in traffic, it's buckets will be dynamically adjusted to a lower sample rate. Once traffic returns to normal it's sample rates will be increased again. I believe traces are also labeled with the rate they were sampled at, that was you know how representative a given trace is.

Perhaps this could be implemented as a policy in the tail sampling processor? If we can avoid the need to synchronize sampling policies across many different collectors that will greatly simplify the problem.

The same approach can be used to achieve a goal throughout (traces/sec) as well.

@cforce
Copy link

cforce commented Sep 4, 2024

Remark: samples based on rate limits are avaible here
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor
https://pkg.go.dev/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor#section-readme

Unfortunately is based on processors and much to late to prevent unwished utilization what is expected from a rate limiter which shall drop very left.

The implementation idea around #6908 (comment) was pretty what would be needed, but no one picked it so far :/

@jpkrohling
Copy link
Member

jpkrohling commented Sep 6, 2024

Perhaps this could be implemented as a policy in the tail sampling processor? If we can avoid the need to synchronize sampling policies across many different collectors that will greatly simplify the problem.

I would be open to having this. Do you want to contribute this new policy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Component New component has been sponsored enhancement New feature or request help wanted Extra attention is needed never stale Issues marked with this label will be never staled and automatically removed release:after-ga
Projects
None yet
Development

No branches or pull requests

9 participants