-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize sharded tensor address generators #12223
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SeanNijjar
had a problem deploying
to
production
September 4, 2024 17:10
— with
GitHub Actions
Error
SeanNijjar
temporarily deployed
to
production
September 4, 2024 17:10
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 4, 2024 17:10
— with
GitHub Actions
Inactive
SeanNijjar
had a problem deploying
to
production
September 4, 2024 17:10
— with
GitHub Actions
Error
SeanNijjar
temporarily deployed
to
production
September 4, 2024 17:10
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 4, 2024 17:10
— with
GitHub Actions
Inactive
SeanNijjar
had a problem deploying
to
production
September 4, 2024 17:10
— with
GitHub Actions
Error
SeanNijjar
temporarily deployed
to
production
September 5, 2024 01:37
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 5, 2024 01:37
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 5, 2024 01:37
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 5, 2024 01:37
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 5, 2024 01:37
— with
GitHub Actions
Inactive
SeanNijjar
temporarily deployed
to
production
September 5, 2024 01:37
— with
GitHub Actions
Inactive
SeanNijjar
force-pushed
the
snijjar/all-gather-perf
branch
from
September 5, 2024 16:12
95fa233
to
8527423
Compare
cfjchu
approved these changes
Sep 5, 2024
…rgen This change allows the caller to amortize calls to the address generator when multiple contiguous pages are in the row in the bank (shard in case of sharded address generators), which improves SW performance. Additionally, by enabling the caller to sequence larger contiguous chunks of pages, fewer noc commands may be issued - reducing contention on the noc command buffers, and also inducing lower SW overheads.
SeanNijjar
force-pushed
the
snijjar/all-gather-perf
branch
from
September 5, 2024 18:10
8527423
to
0c1f316
Compare
5 tasks
avoraTT
added a commit
that referenced
this pull request
Sep 13, 2024
* #0: Integrating optimization to read a contiguous chunk of pages in reduce scatter read/write wrapped functions. * #0: Clean up args for advance_worker_global_page_interleaved.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ticket
Link to Github Issue
Problem description
CCL kernels are recomputing addresses for sharded tensors for every page. This is not strictly required and presents and optimization opportunity.
What's changed
Add an API to return the number of contiguous pages remaining in the row when doing an address lookup for sharded address generators.
Height sharding did not enable the proper calculation (instead returns an always safe 1 contiguous page) due to some hangs it exposed. Future work will be to correctly enable this mode for height sharding (and update the corresponding g\tests).
This has shown to save several thousand clock cycles for some LLama all-gathers (they end up seeing around a 10 % savings)
Checklist