-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add warmup period to CPU utilization moving average #5394
Conversation
This fix is well spotted and looks good, but it might be better to add the warmup support directly into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Please see my suggested minor simplification.
Also, @jhalterman's suggestion is worth considering.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
ca90fb3
to
4efab63
Compare
Thanks for the review. Addressed @aknuds1 suggestions. I think @jhalterman suggestion is a good idea, but will keep it for a follow up PR, given we're currently in a hurry to get this PR merged. |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-5394-to-r245 origin/r245
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x f86040d0b31d89de4ed225adbaf24853a2b05c80
# Push it to GitHub
git push --set-upstream origin backport-5394-to-r245
git switch main
# Remove the local backport branch
git branch -D backport-5394-to-r245 Then, create a pull request where the |
* Add warmup period to CPU utilization moving average Signed-off-by: Marco Pracucci <marco@pracucci.com> * Address review comments Signed-off-by: Marco Pracucci <marco@pracucci.com> --------- Signed-off-by: Marco Pracucci <marco@pracucci.com> (cherry picked from commit f86040d)
* Add warmup period to CPU utilization moving average --------- Signed-off-by: Marco Pracucci <marco@pracucci.com> (cherry picked from commit f86040d) --------- Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>
What this PR does
Yesterday I enabled the CPU-based limiting in a production cell and I noticed requests to be limited right after the ingester startup, even if the 1m CPU utilization average (as computed by
rate(container_cpu_usage_seconds_total[1m])
) was way below the limit (about the half).After some investigation, I realised that the limit triggered right after the limiter was started. We start the limiter after we replay the TSDB WALs in the ingester, and right before the ingester is ready to accept samples. Looking at logs of ingesters where the limit triggered, they all show the same pattern:
While reading https://github.com/VividCortex/ewma#readme to learn more how the EWMA works, I reliased that to get more stable results you should have a warmup period (which is also implemented by VividCortex's
VariableEWMA
). The idea of the warmup period is that for the first N samples, you add samples to the EWMA but you don't consider the computed rate as valid.In this PR I'm adding a warmup period to our
UtilizationBasedLimiter
. I've also added a test to show the CPU utilization "sensitivity" (I just assert on the max CPU utilization because that's what we care with regards to the limiting).Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]