Optimize sharded tensor address generators #12223

SeanNijjar · 2024-09-04T16:22:07Z

Ticket

Problem description

CCL kernels are recomputing addresses for sharded tensors for every page. This is not strictly required and presents and optimization opportunity.

What's changed

Add an API to return the number of contiguous pages remaining in the row when doing an address lookup for sharded address generators.

Height sharding did not enable the proper calculation (instead returns an always safe 1 contiguous page) due to some hangs it exposed. Future work will be to correctly enable this mode for height sharding (and update the corresponding g\tests).

This has shown to save several thousand clock cycles for some LLama all-gathers (they end up seeing around a 10 % savings)

Checklist

Post commit CI passes: https://github.com/tenstorrent/tt-metal/actions/runs/10711904229
t3000 frequent: https://github.com/tenstorrent/tt-metal/actions/runs/10712981334
- same failure as on main
t3000 model perf: https://github.com/tenstorrent/tt-metal/actions/runs/10712985596
- same failure as on main
t3000 nightly: https://github.com/tenstorrent/tt-metal/actions/runs/10712983786
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

…rgen This change allows the caller to amortize calls to the address generator when multiple contiguous pages are in the row in the bank (shard in case of sharded address generators), which improves SW performance. Additionally, by enabling the caller to sequence larger contiguous chunks of pages, fewer noc commands may be issued - reducing contention on the noc command buffers, and also inducing lower SW overheads.

* #0: Integrating optimization to read a contiguous chunk of pages in reduce scatter read/write wrapped functions. * #0: Clean up args for advance_worker_global_page_interleaved.

SeanNijjar requested a review from cfjchu as a code owner September 4, 2024 16:22

SeanNijjar had a problem deploying to dev September 4, 2024 16:27 — with GitHub Actions Error

SeanNijjar had a problem deploying to dev September 4, 2024 16:27 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev September 4, 2024 16:28 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev September 4, 2024 16:28 — with GitHub Actions Failure

SeanNijjar had a problem deploying to dev September 4, 2024 16:29 — with GitHub Actions Failure

SeanNijjar had a problem deploying to production September 4, 2024 17:10 — with GitHub Actions Error

SeanNijjar temporarily deployed to production September 4, 2024 17:10 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to production September 4, 2024 17:10 — with GitHub Actions Error

SeanNijjar temporarily deployed to production September 4, 2024 17:10 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to production September 4, 2024 17:10 — with GitHub Actions Error

SeanNijjar temporarily deployed to dev September 4, 2024 17:14 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev September 4, 2024 17:15 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to production September 5, 2024 01:37 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev September 5, 2024 02:48 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev September 5, 2024 02:49 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev September 5, 2024 02:59 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev September 5, 2024 02:59 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev September 5, 2024 02:59 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev September 5, 2024 03:01 — with GitHub Actions Inactive

SeanNijjar temporarily deployed to dev September 5, 2024 03:02 — with GitHub Actions Inactive

SeanNijjar had a problem deploying to dev September 5, 2024 03:02 — with GitHub Actions Failure

SeanNijjar temporarily deployed to dev September 5, 2024 03:02 — with GitHub Actions Inactive

SeanNijjar force-pushed the snijjar/all-gather-perf branch from 95fa233 to 8527423 Compare September 5, 2024 16:12

cfjchu approved these changes Sep 5, 2024

View reviewed changes

SeanNijjar force-pushed the snijjar/all-gather-perf branch from 8527423 to 0c1f316 Compare September 5, 2024 18:10

SeanNijjar merged commit 227c785 into main Sep 5, 2024
5 checks passed

SeanNijjar deleted the snijjar/all-gather-perf branch September 5, 2024 18:12

avoraTT mentioned this pull request Sep 10, 2024

Contiguous pages support in Reduce Scatter read/write #12477

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize sharded tensor address generators #12223

Optimize sharded tensor address generators #12223

SeanNijjar commented Sep 4, 2024 •

edited

Loading

Optimize sharded tensor address generators #12223

Optimize sharded tensor address generators #12223

Conversation

SeanNijjar commented Sep 4, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

SeanNijjar commented Sep 4, 2024 •

edited

Loading