Limit capabilities to respect cgroups cpu quota #403

cnr · 2021-10-21T16:39:13Z

Overview

Capabilities

For speed, we poll the environment for the number of available "capabilities", which GHC defines as "the number of Haskell threads that can run truly simultaneously (on separate physical processors)", and we use that number to spawn worker threads for discovery/analysis tasks.

Previous to this PR, we were letting the GHC runtime automatically set the number of available capabilities with the -N flag.

Containers

Containers, especially those run in a cluster environment, are often run with limits for the amount of CPU they're allowed to consume. Linux cgroups ("control groups") presents these limits to the container as cpu.cfs_quota_us and cpu.cfs_period_us, documented here.

In short, quota / period forms the ratio of cpu time we're allowed to consume, e.g.,

quota	period	description
10	10	we can consume up to one core
20	10	we can consume up to two cores
5	10	the scheduler will give us one cpu core for up to 50% of the time
-1	10	we can use all of the machine's available cpu resources

The problem

Unfortunately, rather than propagating cgroup cpu limits to the container's view of the hardware of the machine, the container will happily report that it has all of the available cores on the host machine.

As such, when GHC automatically sets the number of capabilities for us in a cgroups-cpu-limited container on a large machine, we spawn way too many worker threads relative to the amount of CPU time allocated to us. When the disparity is particularly large, we end up with a lot of context-switching and contention. Threads spend a lot of time getting woken up before momentarily getting throttled and yielding to the scheduler.

The solution

Rather than letting the GHC runtime set the number of capabilities for us with the -N flag, this PR adds a new initCapabilities function that sets the number of available capabilities while respecting cgroup limits, when applicable

Acceptance criteria

Within a linux container, cgroups cpu limits are respected, and the number of capabilities is set based on cgroups limits, when applicable
Outside of a linux container, CLI behavior is unchanged
On macOS and Windows, CLI behavior is unchanged

Testing plan

Manual testing for (1), (2), (3) - in particular with print debugging on getNumCapabilities to ensure they're getting set correctly. [This documentation] is helpful for setting up and testing within cgroups
(1) was additionally tested within a customer's target environment
Tests for cgroup/mountinfo parser logic and cgroups quota logic

Risks

N/A

References

https://fossa.zendesk.com/agent/tickets/3198

Checklist

I added tests for this PR's change (or confirmed tests are not viable).
If this PR introduced a user-visible change, I added documentation into docs/.
I updated Changelog.md if this change is externally facing. If this PR did not mark a release, I added my changes into an # Unreleased section at the top.
I linked this PR to any referenced GitHub issues, if they exist.

skilly-lily · 2021-10-21T18:38:46Z

Overall looks good, I'll wait for tests to give a final approval.
Few questions:

What would happen if we didn't call initCapabilities and omitted the RTS -N option?
Is this really a GHC RTS bug, as opposed to just a missing part of the concurrent library options?
Noting that Controller takes a phantom type, are there types other than CPU that you think spectrometer might need in the future?

cnr · 2021-10-21T21:08:05Z

@scruffystuffs

added tests

What would happen if we didn't call initCapabilities and omitted the RTS -N option?

The number of capabilities defaults to 1 without -N set, so the cli would run under a single thread

Is this really a GHC RTS bug, as opposed to just a missing part of the concurrent library options?

It's a bug (missing feature?) insofar as GHC currently doesn't care about cgroups cpu limits -- similarly, the golang runtime doesn't care about cpu limits either, so we use a similar workaround in our golang codebases to set GOMAXPROCS

Noting that Controller takes a phantom type, are there types other than CPU that you think spectrometer might need in the future?

Probably not. I'd like to spin this off into a separate library where the phantom type would probably have more purpose, and it's probably overkill for now

skilly-lily

One issue with CPP, but otherwise LGTM

skilly-lily · 2021-10-22T04:06:29Z

test/System/CGroup/TypesSpec.hs

@@ -0,0 +1,53 @@
+{-# LANGUAGE CPP #-}


I'd really like to avoid doing anything with CPP if possible. Maybe we could use a wrapper around it which does nothing on windows.

you're right -- not sure what I was thinking. updated

cnr requested review from a team and skilly-lily and removed request for a team October 21, 2021 16:39

skilly-lily approved these changes Oct 22, 2021

View reviewed changes

cnr added 11 commits October 22, 2021 09:43

Limit capabilities to respect cgroups cpu quota

86d1f1b

initCapabilities: catch all non-async exceptions

1d434ac

Strongly type cgroup and mount filepaths

5fc1ba1

Add tests for resolveGroupController

d9ad90d

Add tests for getCPUQuota

2da20f0

Kill unused TemplateHaskell pragma

0c6ce49

Parse proc paths at runtime

1e87874

Disable CGroup test on Windows

f9e4dbf

Add missing import

3ede765

Don't use CPP

586ca9b

Update changelog

096b3c8

cnr force-pushed the feat/cgroup-quota branch from 0a7452c to 096b3c8 Compare October 22, 2021 16:51

cnr merged commit 359dd90 into master Oct 22, 2021

cnr deleted the feat/cgroup-quota branch October 22, 2021 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit capabilities to respect cgroups cpu quota #403

Limit capabilities to respect cgroups cpu quota #403

cnr commented Oct 21, 2021 •

edited

Loading

skilly-lily commented Oct 21, 2021

cnr commented Oct 21, 2021

skilly-lily left a comment

skilly-lily Oct 22, 2021

cnr Oct 22, 2021

Limit capabilities to respect cgroups cpu quota #403

Limit capabilities to respect cgroups cpu quota #403

Conversation

cnr commented Oct 21, 2021 • edited Loading

Overview

Capabilities

Containers

The problem

The solution

Acceptance criteria

Testing plan

Risks

References

Checklist

skilly-lily commented Oct 21, 2021

cnr commented Oct 21, 2021

skilly-lily left a comment

Choose a reason for hiding this comment

skilly-lily Oct 22, 2021

Choose a reason for hiding this comment

cnr Oct 22, 2021

Choose a reason for hiding this comment

cnr commented Oct 21, 2021 •

edited

Loading