-
Notifications
You must be signed in to change notification settings - Fork 7
Limit capabilities to respect cgroups cpu quota #403
Conversation
Overall looks good, I'll wait for tests to give a final approval.
|
@scruffystuffs added tests
The number of capabilities defaults to 1 without
It's a bug (missing feature?) insofar as GHC currently doesn't care about cgroups cpu limits -- similarly, the golang runtime doesn't care about cpu limits either, so we use a similar workaround in our golang codebases to set
Probably not. I'd like to spin this off into a separate library where the phantom type would probably have more purpose, and it's probably overkill for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue with CPP, but otherwise LGTM
test/System/CGroup/TypesSpec.hs
Outdated
@@ -0,0 +1,53 @@ | |||
{-# LANGUAGE CPP #-} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to avoid doing anything with CPP if possible. Maybe we could use a wrapper around it
which does nothing on windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right -- not sure what I was thinking. updated
0a7452c
to
096b3c8
Compare
Overview
Capabilities
For speed, we poll the environment for the number of available "capabilities", which GHC defines as "the number of Haskell threads that can run truly simultaneously (on separate physical processors)", and we use that number to spawn worker threads for discovery/analysis tasks.
Previous to this PR, we were letting the GHC runtime automatically set the number of available capabilities with the
-N
flag.Containers
Containers, especially those run in a cluster environment, are often run with limits for the amount of CPU they're allowed to consume. Linux cgroups ("control groups") presents these limits to the container as
cpu.cfs_quota_us
andcpu.cfs_period_us
, documented here.In short,
quota / period
forms the ratio of cpu time we're allowed to consume, e.g.,The problem
Unfortunately, rather than propagating cgroup cpu limits to the container's view of the hardware of the machine, the container will happily report that it has all of the available cores on the host machine.
As such, when GHC automatically sets the number of capabilities for us in a cgroups-cpu-limited container on a large machine, we spawn way too many worker threads relative to the amount of CPU time allocated to us. When the disparity is particularly large, we end up with a lot of context-switching and contention. Threads spend a lot of time getting woken up before momentarily getting throttled and yielding to the scheduler.
The solution
Rather than letting the GHC runtime set the number of capabilities for us with the
-N
flag, this PR adds a newinitCapabilities
function that sets the number of available capabilities while respecting cgroup limits, when applicableAcceptance criteria
Testing plan
getNumCapabilities
to ensure they're getting set correctly. [This documentation] is helpful for setting up and testing within cgroupsRisks
N/A
References
https://fossa.zendesk.com/agent/tickets/3198
Checklist
docs/
.Changelog.md
if this change is externally facing. If this PR did not mark a release, I added my changes into an# Unreleased
section at the top.