Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warmup period to CPU utilization moving average #5394

Merged
merged 2 commits into from
Jul 5, 2023

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jul 1, 2023

What this PR does

Yesterday I enabled the CPU-based limiting in a production cell and I noticed requests to be limited right after the ingester startup, even if the 1m CPU utilization average (as computed by rate(container_cpu_usage_seconds_total[1m])) was way below the limit (about the half).

After some investigation, I realised that the limit triggered right after the limiter was started. We start the limiter after we replay the TSDB WALs in the ingester, and right before the ingester is ready to accept samples. Looking at logs of ingesters where the limit triggered, they all show the same pattern:

ts=2023-06-30T18:07:17.632955811Z caller=ingester.go:2210 level=info msg="successfully opened existing TSDBs"
ts=2023-06-30T18:07:17.633186146Z caller=mimir.go:834 level=info msg="Application started"
ts=2023-06-30T18:07:19.633923839Z caller=utilization.go:149 level=info context="read path" msg="enabling resource utilization based limiting" reason=cpu memory_limit=21474836480 memory_utilization=5136072704 cpu_limit=8 cpu_utilization=8.61
ts=2023-06-30T18:07:22.634063682Z caller=utilization.go:153 level=info context="read path" msg="disabling resource utilization based limiting" memory_limit=21474836480 memory_utilization=5138870272 cpu_limit=8 cpu_utilization=7.892821998317039

While reading https://github.com/VividCortex/ewma#readme to learn more how the EWMA works, I reliased that to get more stable results you should have a warmup period (which is also implemented by VividCortex's VariableEWMA). The idea of the warmup period is that for the first N samples, you add samples to the EWMA but you don't consider the computed rate as valid.

In this PR I'm adding a warmup period to our UtilizationBasedLimiter. I've also added a test to show the CPU utilization "sensitivity" (I just assert on the max CPU utilization because that's what we care with regards to the limiting).

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci marked this pull request as ready for review July 1, 2023 08:09
@pracucci pracucci requested a review from a team as a code owner July 1, 2023 08:09
@jhalterman
Copy link
Member

jhalterman commented Jul 1, 2023

This fix is well spotted and looks good, but it might be better to add the warmup support directly into EwmaRate, so that it doesn't need to be implemented externally and future users don't hit the same problem.

Copy link
Contributor

@aknuds1 aknuds1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Please see my suggested minor simplification.

Also, @jhalterman's suggestion is worth considering.

pkg/util/limiter/utilization.go Outdated Show resolved Hide resolved
pkg/util/limiter/utilization_test.go Outdated Show resolved Hide resolved
@aknuds1 aknuds1 added component/ingester bug Something isn't working labels Jul 4, 2023
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the add-warmup-to-cpu-moving-average branch from ca90fb3 to 4efab63 Compare July 5, 2023 07:55
@pracucci
Copy link
Collaborator Author

pracucci commented Jul 5, 2023

Thanks for the review. Addressed @aknuds1 suggestions. I think @jhalterman suggestion is a good idea, but will keep it for a follow up PR, given we're currently in a hurry to get this PR merged.

@pracucci pracucci enabled auto-merge (squash) July 5, 2023 07:56
@pracucci pracucci merged commit f86040d into main Jul 5, 2023
@pracucci pracucci deleted the add-warmup-to-cpu-moving-average branch July 5, 2023 08:06
@grafanabot
Copy link
Contributor

The backport to r245 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-5394-to-r245 origin/r245
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x f86040d0b31d89de4ed225adbaf24853a2b05c80
# Push it to GitHub
git push --set-upstream origin backport-5394-to-r245
git switch main
# Remove the local backport branch
git branch -D backport-5394-to-r245

Then, create a pull request where the base branch is r245 and the compare/head branch is backport-5394-to-r245.

aknuds1 pushed a commit that referenced this pull request Jul 5, 2023
* Add warmup period to CPU utilization moving average

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Address review comments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

---------

Signed-off-by: Marco Pracucci <marco@pracucci.com>
(cherry picked from commit f86040d)
@aknuds1 aknuds1 mentioned this pull request Jul 5, 2023
3 tasks
aknuds1 added a commit that referenced this pull request Jul 5, 2023
* Add warmup period to CPU utilization moving average

---------

Signed-off-by: Marco Pracucci <marco@pracucci.com>
(cherry picked from commit f86040d)

---------

Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants