-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ai): AI增加多机多卡 分布式训练 #1261
feat(ai): AI增加多机多卡 分布式训练 #1261
Conversation
🦋 Changeset detectedLatest commit: 17ae366 The changes in this PR will be included in the next version bump. This PR includes changesets to release 15 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
f01c99e
to
b28579f
Compare
7eac59c
to
c311d9f
Compare
@@ -5,7 +5,7 @@ | |||
"main": "build/index.js", | |||
"private": true, | |||
"scripts": { | |||
"generate": "rimraf generated && buf generate --template buf.gen.yaml https://github.com/PKUHPC/scow-scheduler-adapter-interface.git#branch=ai-multi-pod-train", | |||
"generate": "rimraf generated && buf generate --template buf.gen.yaml https://github.com/PKUHPC/scow-scheduler-adapter-interface.git#branch=feat-ai-release", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个功能准备好了吗?如果准备好了,适配器接口需要发布一个新的版本,并且这个功能需要检查集群的适配器的版本是否是已经实现了新的接口的新版本。SCOW主分支代码不能使用未正式发布的适配器接口。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在scow的master上就是feat-ai-release这个分支,不是版本号。
要在adapter-interface仓库里把feat-ai-release合到master中才能继续之前的检查适配器版本操作。
当时好像是说ai模块还是beta版就先不合
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1361 区分了SCOW AI和SCOW所使用的接口版本。此功能以及其他SCOW部分准备好的接口应该合并到master,并在v1.5.0的基础上发一个新的版本v1.6.0,之后在此PR中使用v1.6.0的接口版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1361 已经合并了,这个PR的proto修改是只对AI的吧?那就
- 在接口项目中,发起一个PR,把这个功能所用到的修改合并到ai的分支里
- 当接口合并到AI的分支后,SCOW项目的本分支合并master,然后在ai项目用的proto项目里用ai的分支
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
先审核接口,关注一下PKUHPC/scow-scheduler-adapter-interface#25 的审核意见
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改动
集群信息新增gpuType和vramMb 参数,同时训练时需要将gpuType传回给适配器
新增多机多卡功能,当用户在AI 训练时,如果选择多个节点(pod),则需要指定对应的算法框架,目前支持 tensorflow、pytorch和mindspore(华为特有),来进行对应的分布式训练
AI应用配置文件新增tag,并且新增三个接口:listApp, listTags,以及根据appId 来获取对应可创建该应用的集群信息 listClusters,这部分接口主要是为了后续 AI作业模块重构准备。
gpu分区提交作业和应用必传gpuType告知适配器是是什么gpu类型
若是华为gpu卡,不管是不是分布式训练,提交应用和作业都需要指定框架