Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: optimize numa affinity store #2209

Merged

Conversation

ZiMengSheng
Copy link
Contributor

Ⅰ. Describe what this PR does

当前每个节点的最佳 NUMA Affinity 再 Filter 阶段判断之后保存在 CycleState 中,这在通常调度场景下没有问题。但是在如下场景下会有问题,我们提供抢占者为 preemptor、被抢占者为 victim,两者都申请 NUMA 感知、绑核、8卡,并且亲和同一个 8 卡 节点。

  1. Victim 调度成功
  2. Preemptor 调度触发抢占,删除掉 Victim
  3. Preemptor 开始调度,由于目标节点存在 NonimatingPod,所以会在 Filter 时:
    a. 带 Nominating Filter:在 Filter 时候做一次 state.Clone,然后用 clone 的 state 去过 Filter,这时候 numa Affinity 保存到了 clone 的state 中
    b. 不带 Nominating Filter:由于判断到 a 中并没有执行实际的 AddPod,所以这里逻辑认为 a Filter 结果有效,直接返回

上述流程就导致 NUMA Affinity 没有实际保存下来,倒是本来要申请双 NUMA 的节点只分到了单 NUMA 资源,看起来就像没有开启 NUMA 感知调度一样。

本 PR 将 affinityStore 的clone 方式改为浅拷贝,并:

  1. 在调度阶段第一次带着抢占成功的 Pod 为当前 Pod 调度成功时时记录下来,调度流程中的后续阶段都复用该结果就好了
  2. 抢占时忽略已经记录的结果,每次都重新判断是否满足 NUMA 对齐规则

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

Copy link

codecov bot commented Sep 19, 2024

Codecov Report

Attention: Patch coverage is 0% with 14 lines in your changes missing coverage. Please review.

Project coverage is 67.67%. Comparing base (cd6ce25) to head (356f773).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../scheduler/frameworkext/topologymanager/manager.go 0.00% 10 Missing ⚠️
pkg/scheduler/frameworkext/framework_extender.go 0.00% 3 Missing ⚠️
...kg/scheduler/frameworkext/topologymanager/store.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2209      +/-   ##
==========================================
- Coverage   67.71%   67.67%   -0.04%     
==========================================
  Files         446      446              
  Lines       42715    42719       +4     
==========================================
- Hits        28923    28910      -13     
- Misses      11266    11281      +15     
- Partials     2526     2528       +2     
Flag Coverage Δ
unittests 67.67% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ZiMengSheng ZiMengSheng force-pushed the numa_optimize_affinity_store branch 2 times, most recently from 6da94eb to 27bc349 Compare September 19, 2024 13:04
Copy link
Member

@saintube saintube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ZiMengSheng
Copy link
Contributor Author

/approve

Signed-off-by: wangjianyu.wjy <wangjianyu.wjy@alibaba-inc.com>
@ZiMengSheng
Copy link
Contributor Author

ZiMengSheng commented Sep 19, 2024

/approved

@koordinator-bot koordinator-bot bot merged commit c6cddf3 into koordinator-sh:main Sep 20, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants