-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips #16672
Conversation
bac2ed4
to
c95b13d
Compare
c95b13d
to
3b83232
Compare
…chips This PR adds detection of asymetric core types (Power & Efficiency) (P/E) when running on M1/M2 Apple Silicon CPUs. This functionality is provided by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read undocumented registers containing CPU performance data. Currently working on getting that functionality merged upstream into gopsutil, but gopsutil would still not support detecting P vs E cores like this PR does. Also refactors the CPUFingerprinter code to handle the mixed core types, now setting power vs efficiency cpu attributes. For now the scheduler is still unaware of mixed core types - on Apple platforms tasks cannot reserve cores anyway so it doesn't matter, but at least now the total CPU shares available will be correct. Future work should include adding support for detecting P/E cores on the latest and upcoming Intel chips, where computation of total cpu shares is currently incorrect. For that, we should also include updating the scheduler to be core-type aware, so that tasks of resources.cores on Linux platforms can be assigned the correct number of CPU shares for the core type(s) they have been assigned. node attributes before cpu.arch = arm64 cpu.modelname = Apple M2 Pro cpu.numcores = 12 cpu.reservablecores = 0 cpu.totalcompute = 1000 node attributes after cpu.arch = arm64 cpu.frequency.efficiency = 2424 cpu.frequency.power = 3504 cpu.modelname = Apple M2 Pro cpu.numcores.efficiency = 4 cpu.numcores.power = 8 cpu.reservablecores = 0 cpu.totalcompute = 37728
3b83232
to
a5c121a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions, but LGTM. Just missing a CHANGELOG
entry 🙂
Smoke test on my machine:
Attributes
cpu.arch = arm64
cpu.frequency.efficiency = 2064
cpu.frequency.power = 3228
cpu.modelname = Apple M1 Max
cpu.numcores.efficiency = 2
cpu.numcores.power = 8
cpu.reservablecores = 0
cpu.totalcompute = 29952
helper/stats/cpu.go
Outdated
} | ||
var err error | ||
if cpuPowerCoreCount, err = cpu.Counts(true); err != nil { | ||
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %v", err)) | |
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %w", err)) |
I think this let us unwrap the inner error? I'm still getting used to the new errors
stuff 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// detecting little cores on non-apple platforms | ||
cpuTotalTicks = uint64(cpuPowerCoreCount) * cpuPowerCoreMHz | ||
|
||
initErr = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this overwriting the accumulated errors with just the last one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this bit isn't in a loop - it's just setting the global initErr
inside a sync.Do
so that the cpu detection only happens once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we write to initErr
twice inside the default
clause right?
Lines 60 to 62 in 2b28c83
if cpuInfoStats, err = cpu.InfoWithContext(ctx); err != nil { | |
initErr = errors.Join(initErr, fmt.Errorf("Unable to obtain CPU information: %w", err)) | |
} |
Lines 52 to 54 in 2b28c83
if cpuPowerCoreCount, err = cpu.Counts(true); err != nil { | |
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %w", err)) | |
} |
If both fail I expect initErr
to have two erros, but this initErr = err
is making it so only the last error is returned.
helper/stats/cpu.go
Outdated
|
||
var cpuInfo []cpu.InfoStat | ||
if cpuInfo, err = cpu.InfoWithContext(ctx); err != nil { | ||
errors.Join(err, fmt.Errorf("failed to detect CPU information: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the joined error being ignore? Or does it mutate err
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Join returns an error that wraps the given errors.
So, I think we need to capture the returned error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah @jrasell is right - this line actually does nothing because the result is returned from Join
... however this line is unnecessary anyway since the assignment to initErr already contains a contextual message
logger hclog.Logger | ||
|
||
// accumulates result in these resource structs | ||
resources *structs.Resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep the COMPAT
note in case do we clean this up in the future?
nomad/client/fingerprint/cpu.go
Lines 37 to 38 in c071791
// COMPAT(0.10): Remove in 0.10 | |
resp.Resources = &structs.Resources{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only minor comments and as Luiz mentioned, missing changelog.
client/fingerprint/cpu_default.go
Outdated
@@ -3,6 +3,6 @@ | |||
|
|||
package fingerprint | |||
|
|||
func (f *CPUFingerprint) deriveReservableCores(req *FingerprintRequest) ([]uint16, error) { | |||
return nil, nil | |||
func (f *CPUFingerprint) deriveReservableCores(string) []uint16 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ignore the unused parameter
func (f *CPUFingerprint) deriveReservableCores(string) []uint16 { | |
func (f *CPUFingerprint) deriveReservableCores(_ string) []uint16 { |
helper/stats/cpu.go
Outdated
|
||
var cpuInfo []cpu.InfoStat | ||
if cpuInfo, err = cpu.InfoWithContext(ctx); err != nil { | ||
errors.Join(err, fmt.Errorf("failed to detect CPU information: %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Join returns an error that wraps the given errors.
So, I think we need to capture the returned error.
helper/stats/cpu.go
Outdated
// node to fall into a unique computed node class | ||
cpuMhzPerCore = math.Floor(cpuMhzPerCore) | ||
cpuTotalTicks = math.Floor(float64(cpuNumCores) * cpuMhzPerCore) | ||
for _, cpu := range cpuInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the cpu
variable, as this clashes with an import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR adds detection of asymetric core types (Power (P) & Efficiency (E))
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by
shoenig/go-m1cpu
which makes use of the Apple IOKit framework to readundocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into
gopsutil
, butgopsutil
would still not support detecting P vs E cores like this PR does.
Also refactors the
CPUFingerprinter
code to handle the mixed coretypes, now setting power vs efficiency cpu attributes.
For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.
Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks making use of
resources.cores
on Linux platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.
node attributes before
node attributes after
Fixes #12029
Fixes #11046
Fixes #9408