Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize sharded tensor address generators #12223

Merged
merged 1 commit into from
Sep 5, 2024
Merged

Conversation

SeanNijjar
Copy link
Contributor

@SeanNijjar SeanNijjar commented Sep 4, 2024

Ticket

Link to Github Issue

Problem description

CCL kernels are recomputing addresses for sharded tensors for every page. This is not strictly required and presents and optimization opportunity.

What's changed

Add an API to return the number of contiguous pages remaining in the row when doing an address lookup for sharded address generators.

Height sharding did not enable the proper calculation (instead returns an always safe 1 contiguous page) due to some hangs it exposed. Future work will be to correctly enable this mode for height sharding (and update the corresponding g\tests).

This has shown to save several thousand clock cycles for some LLama all-gathers (they end up seeing around a 10 % savings)

Checklist

…rgen

This change allows the caller to amortize calls to the address generator
when multiple contiguous pages are in the row in the bank (shard in case
of sharded address generators), which improves SW performance.

Additionally, by enabling the caller to sequence larger contiguous
chunks of pages, fewer noc commands may be issued - reducing contention
on the noc command buffers, and also inducing lower SW overheads.
@SeanNijjar SeanNijjar merged commit 227c785 into main Sep 5, 2024
5 checks passed
@SeanNijjar SeanNijjar deleted the snijjar/all-gather-perf branch September 5, 2024 18:12
avoraTT added a commit that referenced this pull request Sep 13, 2024
* #0: Integrating optimization to read a contiguous chunk of pages in reduce scatter read/write wrapped functions.

* #0: Clean up args for advance_worker_global_page_interleaved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants