Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ai): AI增加多机多卡 分布式训练 #1261

Merged
merged 17 commits into from
Jul 25, 2024
Merged

Conversation

ZihanChen821
Copy link
Contributor

@ZihanChen821 ZihanChen821 commented May 22, 2024

改动

  1. 集群信息新增gpuType和vramMb 参数,同时训练时需要将gpuType传回给适配器

  2. 新增多机多卡功能,当用户在AI 训练时,如果选择多个节点(pod),则需要指定对应的算法框架,目前支持 tensorflow、pytorch和mindspore(华为特有),来进行对应的分布式训练
    image

  3. AI应用配置文件新增tag,并且新增三个接口:listApp, listTags,以及根据appId 来获取对应可创建该应用的集群信息 listClusters,这部分接口主要是为了后续 AI作业模块重构准备。

  4. gpu分区提交作业和应用必传gpuType告知适配器是是什么gpu类型

  5. 若是华为gpu卡,不管是不是分布式训练,提交应用和作业都需要指定框架

Copy link

changeset-bot bot commented May 22, 2024

🦋 Changeset detected

Latest commit: 17ae366

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 15 packages
Name Type
@scow/scheduler-adapter-protos Patch
@scow/config Patch
@scow/ai Patch
@scow/lib-scheduler-adapter Patch
@scow/lib-server Patch
@scow/mis-server Patch
@scow/portal-server Patch
@scow/test-adapter Patch
@scow/lib-web Patch
@scow/audit-server Patch
@scow/auth Patch
@scow/cli Patch
@scow/mis-web Patch
@scow/portal-web Patch
@scow/gateway Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@OYX-1 OYX-1 force-pushed the feat-multi-pods-train-ai branch from f01c99e to b28579f Compare July 4, 2024 08:48
@OYX-1 OYX-1 force-pushed the feat-multi-pods-train-ai branch from 7eac59c to c311d9f Compare July 5, 2024 06:07
@OYX-1 OYX-1 marked this pull request as ready for review July 11, 2024 01:48
@pkuhpc-review-bot pkuhpc-review-bot bot added the Code-ReviewRequested Code Review Requested label Jul 11, 2024
@pkuhpc-review-bot pkuhpc-review-bot bot requested a review from ddadaal July 11, 2024 01:49
@@ -5,7 +5,7 @@
"main": "build/index.js",
"private": true,
"scripts": {
"generate": "rimraf generated && buf generate --template buf.gen.yaml https://github.com/PKUHPC/scow-scheduler-adapter-interface.git#branch=ai-multi-pod-train",
"generate": "rimraf generated && buf generate --template buf.gen.yaml https://github.com/PKUHPC/scow-scheduler-adapter-interface.git#branch=feat-ai-release",
Copy link
Member

@ddadaal ddadaal Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个功能准备好了吗?如果准备好了,适配器接口需要发布一个新的版本,并且这个功能需要检查集群的适配器的版本是否是已经实现了新的接口的新版本。SCOW主分支代码不能使用未正式发布的适配器接口。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在scow的master上就是feat-ai-release这个分支,不是版本号。
要在adapter-interface仓库里把feat-ai-release合到master中才能继续之前的检查适配器版本操作。
当时好像是说ai模块还是beta版就先不合

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1361 区分了SCOW AI和SCOW所使用的接口版本。此功能以及其他SCOW部分准备好的接口应该合并到master,并在v1.5.0的基础上发一个新的版本v1.6.0,之后在此PR中使用v1.6.0的接口版本

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1361 已经合并了,这个PR的proto修改是只对AI的吧?那就

  1. 在接口项目中,发起一个PR,把这个功能所用到的修改合并到ai的分支里
  2. 当接口合并到AI的分支后,SCOW项目的本分支合并master,然后在ai项目用的proto项目里用ai的分支

@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ChangeRequested and removed Code-ReviewRequested Code Review Requested labels Jul 12, 2024
@ddadaal ddadaal requested a review from OYX-1 July 15, 2024 13:35
@OYX-1 OYX-1 requested a review from ddadaal July 18, 2024 11:46
@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ReviewRequested Code Review Requested and removed Code-ChangeRequested labels Jul 18, 2024
Copy link
Member

@ddadaal ddadaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先审核接口,关注一下PKUHPC/scow-scheduler-adapter-interface#25 的审核意见

@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ChangeRequested and removed Code-ReviewRequested Code Review Requested labels Jul 18, 2024
@OYX-1 OYX-1 requested a review from ddadaal July 19, 2024 05:52
@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ReviewRequested Code Review Requested and removed Code-ChangeRequested labels Jul 19, 2024
Copy link
Member

@ddadaal ddadaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ChangeRequested and removed Code-ReviewRequested Code Review Requested labels Jul 22, 2024
@OYX-1 OYX-1 requested a review from ddadaal July 25, 2024 01:27
@pkuhpc-review-bot pkuhpc-review-bot bot added Code-ReviewRequested Code Review Requested and removed Code-ChangeRequested labels Jul 25, 2024
@OYX-1 OYX-1 merged commit 753a996 into master Jul 25, 2024
9 checks passed
@OYX-1 OYX-1 deleted the feat-multi-pods-train-ai branch July 25, 2024 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Code-ReviewRequested Code Review Requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants