Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use rq_total in load_stats_reporter, current math might be double counting, also some nits on criteria of "when to report load" #37544

Open
stevenzzzz opened this issue Dec 6, 2024 · 2 comments

Comments

@stevenzzzz
Copy link
Contributor

Title: use rq_total in load_stats_reporter, current math might be double counting,

Description:
In current load_stats_reporter impl, there is a chance that we are double counting requests:
https://github.com/envoyproxy/envoy/blob/main/source/common/upstream/load_stats_reporter.cc#L79-L111

  1. rq_success inc()ed upon seeing "non-5xx" response headers.
  2. rq_active_ dec() happens on the stream deferred deletion time.
  3. if [load reporter latch] happen between 1. and 2., double counting on active requests would happen. This is more visible when stream destruction involving some time consuming operations, e.g. logging to external services.

It's also noteworthy that the if statement is checking on rq_success+rq_error+rq_active, in most happy cases, this number should match rq_total.

There is an error case that leads to load reporting failure tho: with grpc, grpc server fails to send trailers, the rq_success_.inc() or rq_error_.inc() in Router::Filter::onUpstreamTrailers() will never be called:
rq_active_ will dec() as the stream destructs, but rq_[success/error]_ will not inc().

We should probably fix the rq_[success/error]_ inc issue in another PR, but in this issue I hope we can solve the issue on the load_stats_reporter side by changing the if statement to use rq_issued, specifically:
changing if statement to check on rq_issued.

This way it gives the LRS server a chance to infer the "load " totally based on rq_total.

@stevenzzzz stevenzzzz added bug triage Issue requires triage labels Dec 6, 2024
@stevenzzzz
Copy link
Contributor Author

@karthikbox here since #6784 is related.

@stevenzzzz
Copy link
Contributor Author

@adisuissa @yanavlasov as well.

@wbpcode wbpcode added area/upstream and removed triage Issue requires triage labels Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants