Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet Executor] Support multi carrier #38535
[Fleet Executor] Support multi carrier #38535
Changes from all commits
8e3cc62
6130c9c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不同carrier没有互通的必要吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是因为如果src interceptor和dst interceptor对应不同的carrier,但这两个carrier在相同的rank下,src interceptor里调用的是自己carrier的send,这里不处理的话会出问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,不过我觉得不同carrier没有互通的必要,你这个场景肯定是src和dst有关联,那么应该属于一个carrier。不同carrier可以对应不同program,比如训练和预测program,跑一个epoch训练再预测一下,时间和关系上是相互独立的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同时跑多个carrier吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu carrier与gpu carrier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是想同时启动起来的,不能运行可以等在那
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个时刻应该只跑一个carrier吧,同时跑应该是通过graph的拓扑依赖,两个拓扑依赖的节点可以同时跑
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里应该跑一个carrier,然后不同Program对应不同的carrier,通过外面的接口选择具体是跑哪个carrier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样的话,一个type到另一个type的组合含义是固定的?
如果以后需要interceptor id到thread id,carrier id到stream id的等其他int64到int64的映射,如何复用这个类呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有todo,下一个pr会改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
你是说用InterceptorId的structure� 来代替int 64那个todo?这样如果同时存在interceptor id到thread id,interceptor id到carrier id的映射怎么处理?还是说未来会更新这个global map类,把TODO放在这个类里?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有这个场景,而且要是有的话也可以加ThreadID和CarrierID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前carrier使用场景都是先初始化固定好的,没有读写并发的情况下,可以不用加锁。当然也不排除之后carrier也是动态的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个global map除了用在多carrier,interceptor id 到 carrier id也有一个map,这里会有并发的情况,就统一了一下,不过现在没有动态增删interceptor也可以把这个map的建立放到初始化里。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后面再增加一个没有锁的类吧,这样都加锁确实开销很大
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
barrier那个逻辑应该可以直接放到这来了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把message bus的instance传下来?把message bus也放global map里好像也行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,不过这样也有点麻烦了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用global map的话是不是carrier也不用存message bus的instance了,直接存id就行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里获取了carrier,前面Carrier中不用重新获取了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个和前面carrier的那个场景不一样,这是不同rank的消息收发,前面是相同rank不同carrier的消息收发