#11741: Enable configs in T3000 all gather #12053

Aswinmcw · 2024-08-29T03:54:00Z

Tracking Issue : #11741
Sweep through all possible combinations in all gather

Checklist

Post commit CI passes - https://github.com/tenstorrent/tt-metal/actions/runs/10849166677
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
New/Existing tests provide coverage for changes

SeanNijjar

We are missing a lot of coverage here. I posted requested changes for missing coverage as well as some reorganization comments.

SeanNijjar · 2024-08-29T15:04:53Z

tests/sweep_framework/sweeps/line_all_gather_T3K_sharded.py

Would prefer you have both in the same file (as we will eventually remove the separate "line_" version and merge it with base all-gather. You should be able to use the same run function and just pass the allgather function as an arg (e.g. you could parametrize over [ttnn.all_gather, ttnn.line_all_gather] and pass that as an arg. In the run function you call that.

SeanNijjar · 2024-08-29T15:08:04Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+        "num_links": [1, 2],
+        "input_shape": [
+            # [1, 1, 32, 32], # throwing python segmentation fault
+            [1, 1, 32, 1024],


We need much more comprehensive coverage here. If it's easier as separate parameter dicts, that's fine, but we need to cover a lot more cases.

From the master issue:

Shapes: Constrained to tile/page aligned, Shard grids unpadded

For inner dims (y, x), increment by 32 in each direction. For outer dims, increment by 1.

Outer dims can be swept by basic numbers, then relatively prime numbers up to 128.
More values can be swept over after all-gather is migrated to make more use of runtime args.

SeanNijjar · 2024-08-29T15:11:27Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+            [32, 32],
+        ],
+        "shard_grid": [
+            ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 3))}),


We should be sweeping over many more shard grid sizes (and offsets). These are only 2 shard grids that are very "friendly". We need to include the other shard grid sizes too (those with offsets and those with varying widths/heights). At the very least we need non-power-of-2 and non-multiple-of-2 sizes included.

Additionally, atleast one prime value size per dim should be included.

These are sweeps so we should essentially be able to enumerate all the base cases.

SeanNijjar · 2024-08-29T15:11:45Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+        "dim": [0, 1, 2, 3],
+        "tensor_layout": [ttnn.ROW_MAJOR_LAYOUT, ttnn.TILE_LAYOUT],
+        "input_dtype": [ttnn.bfloat16, ttnn.bfloat8_b],
+        "orientation": [ttnn.ShardOrientation.ROW_MAJOR],


Col major orientation missing

SeanNijjar · 2024-08-29T15:13:05Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+            [1, 8, 32, 1024],
+        ],
+        "input_shard_shape": [
+            [32, 1024],


Need some non-power-of-2 sizes in here. Also need some that are multi-tile high. For every y-x shape here, there should be a corresponding shape with dims flipped.

SeanNijjar · 2024-08-29T15:14:42Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+):
+    all_devices = device
+
+    numel = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] * num_devices


Can we not make this call the baseline run function in test_all_gather? They should ideally have the same underlying code (even if the entrypoints differ)

It has perf analysis in between so I kept this as a new function

I don't see why this can't be shared.

Worst case we wrap each part of the test function into separate smaller functions and call those. Otherwise we just keep the perf time stamping but disable it for the non sweep variants for the time being.

Aswinmcw · 2024-08-30T11:52:24Z

Hi @SeanNijjar I have combined both tests into one and also improved the test coverage, but now the issue is that did I increased a lot than we needed? it looks like while generating the test vectors it is eating up all available RAM (~500GB) and killing itself. Do we need to reduce it somewhere?

SeanNijjar

Requested changes.

Just a reminder that we shouldn't be filtering out cases that are valid and eventually expected to pass. The only exception would be cases that cause the sweep infra to crash and those should be very specifically filtered out and tagged with a clear comment about the issue.

SeanNijjar · 2024-08-30T14:04:55Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+            ttnn.TensorMemoryLayout.WIDTH_SHARDED,
+            ttnn.TensorMemoryLayout.HEIGHT_SHARDED,
+            ttnn.TensorMemoryLayout.BLOCK_SHARDED,
+            ttnn.TensorMemoryLayout.INTERLEAVED,


I didn't notice this before but interleaved and single bank don't make sense with a lot of these sharded specific parameters.

Since you are setting excess memory usage, it'll cut down a lot if you separate the sharded from non sharded tensor memory layouts into separate sets.

SeanNijjar · 2024-08-30T14:07:39Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+            ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(11, 7))}),
+        ],
+        "dim": [0, 1, 2, 3],
+        "tensor_layout": [ttnn.ROW_MAJOR_LAYOUT, ttnn.TILE_LAYOUT],


Similar to my later comment, for these basic case sweeps, the shapes for row major that are valid and aligned are not valid for tile.

I think it is probably worthwhile separating these into separate sets too. That way you can test individual rows for row major ( which would be sub-tile which we want to cover, but in a separate bucket)

SeanNijjar · 2024-08-30T14:09:13Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+            [35, 35],
+            [41, 41],


Padded shard shape? This will be in the padded shards bucket I think.

This reminds me... I'll need to add a sweep category for negative tests that are expected to fail.

SeanNijjar · 2024-08-30T14:11:46Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+    if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT and test_vector["input_dtype"] == ttnn.bfloat8_b:
+        return True, f"bfloat8_b/4_b only supports TILE layout"
+    if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT:
+        return True, f"ROW_MAJOR_LAYOUT not supported"


This shouldn't be a skip. We intend to support RM so this should run and fail.

SeanNijjar · 2024-08-30T14:12:36Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+    if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT:
+        return True, f"ROW_MAJOR_LAYOUT not supported"
+    if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.WIDTH_SHARDED:
+        return True, f"Output mismatch"


Why is this here?

SeanNijjar · 2024-08-30T14:17:42Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+    ):
+        return True, "BLOCK_SHARDED is only supported for dim = 3"
+    if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.SINGLE_BANK:
+        return True, f"SINGLE_BANK is not supported"


SeanNijjar · 2024-08-30T14:19:38Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+
+    start_time = start_measuring_time()
+    tt_out_tensor = gather_function(input_tensor_mesh, dim, num_links=num_links, memory_config=output_mem_config)
+    e2e_perf = stop_measuring_time(start_time)


@jdesousa-TT will we need device sync here or is that handled by the perf reporting infra?

You'll want a sync here, there's no syncing in the infra

SeanNijjar · 2024-08-30T14:22:32Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+):
+    all_devices = device
+
+    numel = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] * num_devices


I don't see why this can't be shared.

Worst case we wrap each part of the test function into separate smaller functions and call those. Otherwise we just keep the perf time stamping but disable it for the non sweep variants for the time being.

SeanNijjar · 2024-08-30T14:25:03Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+    if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.WIDTH_SHARDED:
+        return True, f"Output mismatch"
+    if test_vector["num_links"] == 2 and test_vector["num_devices"] == 8:
+        return True, f"8 devices and 2 links are not supported"


Please add " on t3000 devices"

SeanNijjar · 2024-08-30T14:26:22Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+    unchunked_input_shape[test_vector["dim"]] *= test_vector["num_devices"]
+    if test_vector["num_devices"] < 2:
+        return True, f"Requires multiple devices to run"
+    elif test_vector["num_devices"] == 2 and test_vector["num_links"] == 2:


2 devices could run with 2 links on t3000. It's this meant for n300? If so this should be deleted and added in the n300 sweep when that is added.

SeanNijjar · 2024-08-31T14:10:53Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+        "tensor_mem_layout": [
+            ttnn.TensorMemoryLayout.WIDTH_SHARDED,
+            ttnn.TensorMemoryLayout.HEIGHT_SHARDED,
+            ttnn.TensorMemoryLayout.BLOCK_SHARDED,


We still need the interleaved and single bank memory layouts to be covered. Did you move them somewhere else? I don't see them.

They don't have to be in the same parameters dictionary

Yes I am covering non sharded params as another suite in same module

Just confirming, but that change is not yet present here, correct?

Yes not yet pushed, will re request you once done

SeanNijjar · 2024-08-31T14:19:44Z

tests/sweep_framework/sweeps/all_gather_T3K_sharded.py

+                input_shapes.append([batch_size, channels, height, width])
+
+parameters = {
+    "line_all_gather": {


Should be renamed to just allgather now

SeanNijjar · 2024-09-06T13:14:44Z

Hey @jdesousa-TT,

@Aswinmcw got an initial set of tests generated for sharded all-gather and mentioned that it took 5 hours just to generate tests. This number seems really excessive for only generating 500k-1.5m entries. Is this expected? Is there any low-hanging fruit for optimizing this (e.g. use of generators or comprehensions in places of raw loops?)

SeanNijjar · 2024-09-09T12:58:54Z

tests/sweep_framework/sweeps/ccl/all_gather_t3k_sharded.py

+        ],
+        "all_gather_operation": ["all_gather", "line_all_gather"],
+    },
+    "all_gather_non_sharded": {


Why are input_shard_shape and shard_grid populated here if this is for interleaved and single bank? Also why isn't it sweeping the shapes like mentioned in the issue?

SeanNijjar · 2024-09-10T12:56:29Z

tests/sweep_framework/sweeps/ccl/all_gather_t3k_sharded.py

+        for channels in batch_sizes:
+            for height in shard_Y:
+                for width in shard_X:
+                    input_shapes.append([batch_size, channels, height * batch_size * channels, width])


why are batch_size and channels listed in the W and Z dims and also multiplied through on the height dim?

If the input_shape is the logical tensor shape (it looks like it is), then we shouldn't need to multiply those two in on the height dim.

Also minor nitpick, can we rename batch and channels to W and Z because there is no semantic meaning for those dimensions for all-gather.

SeanNijjar · 2024-09-10T12:59:18Z

tests/sweep_framework/sweeps/ccl/all_gather_t3k_sharded.py

+    ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 0))}),
+    ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 7))}),
+    ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 1))}),
+    ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(10, 7))}),


10 is out of bounds for the shard grid. This case would be expected to fail (and ideally should throw some sort of error). I think we should leave these invalid shard grids for another sweep that will include negative tests.

For now I think we can remove this test as well as the one below with x-dim == 11

SeanNijjar · 2024-09-10T13:13:31Z

tests/sweep_framework/sweeps/ccl/all_gather_t3k_sharded.py

+row_block_shard_input_shapes = generate_shard_input_shapes(row_batch_size, block_shard_Y, block_shard_X)
+
+
+for height in range(32, 128, 32):  # Increment by 32


Maybe to save on test cases for now we do itertools.chain(range (32,128,32), (256, 512, 1024))`

That should cut the number of test by ~4x for width sharded and ~2x for sharded (assuming we do the same values for height and width dim, which we should.

Aswinmcw requested a review from SeanNijjar August 29, 2024 04:00

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from a01cbd9 to 6400e94 Compare August 29, 2024 07:03

Aswinmcw temporarily deployed to dev August 29, 2024 07:04 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev August 29, 2024 07:13 — with GitHub Actions Inactive

SeanNijjar requested changes Aug 29, 2024

View reviewed changes

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from 6400e94 to 811edb0 Compare August 30, 2024 05:20

Aswinmcw requested a review from SeanNijjar August 30, 2024 11:48

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from 811edb0 to ba1dedb Compare August 30, 2024 11:50

SeanNijjar requested changes Aug 30, 2024

View reviewed changes

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from ba1dedb to 83e8178 Compare August 31, 2024 08:13

SeanNijjar requested changes Aug 31, 2024

View reviewed changes

SeanNijjar reviewed Aug 31, 2024

View reviewed changes

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from 29f6659 to 60d1957 Compare September 2, 2024 06:47

Aswinmcw requested a review from SeanNijjar September 2, 2024 10:46

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch 5 times, most recently from ed68a93 to 63e98a6 Compare September 6, 2024 04:54

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch 5 times, most recently from 78fbc4c to dd93da0 Compare September 10, 2024 11:07

SeanNijjar requested changes Sep 10, 2024

View reviewed changes

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from dd93da0 to 81331a9 Compare September 11, 2024 07:08

Aswinmcw and others added 4 commits September 23, 2024 09:48

#0: DROP ME: Increase timeout

a816d0c

#11741: Split based on shard type

f8054c9

#11741: Split based on shard grid

82bb874

#0: Rebased

75990dc

Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch from 77aa8e8 to 75990dc Compare September 23, 2024 09:50

Aswinmcw temporarily deployed to dev September 23, 2024 09:50 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev September 23, 2024 09:59 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev September 23, 2024 10:05 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev September 23, 2024 10:17 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev September 23, 2024 10:26 — with GitHub Actions Inactive

Aswinmcw temporarily deployed to dev September 23, 2024 12:48 — with GitHub Actions Inactive

Aswinmcw had a problem deploying to dev September 23, 2024 12:48 — with GitHub Actions Failure

Aswinmcw temporarily deployed to dev September 23, 2024 13:57 — with GitHub Actions Inactive

Aswinmcw assigned SeanNijjar and Aswinmcw Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#11741: Enable configs in T3000 all gather #12053

#11741: Enable configs in T3000 all gather #12053

Aswinmcw commented Aug 29, 2024 •

edited

Loading

SeanNijjar left a comment

SeanNijjar Aug 29, 2024

SeanNijjar Aug 29, 2024 •

edited

Loading

SeanNijjar Aug 29, 2024

SeanNijjar Aug 29, 2024

SeanNijjar Aug 29, 2024

SeanNijjar Aug 29, 2024

Aswinmcw Aug 30, 2024

SeanNijjar Aug 30, 2024

Aswinmcw commented Aug 30, 2024

SeanNijjar left a comment

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024 •

edited

Loading

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024

jdesousa-TT Aug 30, 2024

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024

SeanNijjar Aug 30, 2024

SeanNijjar Aug 31, 2024

Aswinmcw Aug 31, 2024

SeanNijjar Aug 31, 2024

Aswinmcw Aug 31, 2024

SeanNijjar Aug 31, 2024

SeanNijjar commented Sep 6, 2024

SeanNijjar Sep 9, 2024

SeanNijjar Sep 10, 2024

SeanNijjar Sep 10, 2024

SeanNijjar Sep 10, 2024

SeanNijjar Sep 10, 2024

		row_block_shard_input_shapes = generate_shard_input_shapes(row_batch_size, block_shard_Y, block_shard_X)


		for height in range(32, 128, 32): # Increment by 32

#11741: Enable configs in T3000 all gather #12053

Are you sure you want to change the base?

#11741: Enable configs in T3000 all gather #12053

Conversation

Aswinmcw commented Aug 29, 2024 • edited Loading

Checklist

SeanNijjar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanNijjar Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aswinmcw commented Aug 30, 2024

SeanNijjar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanNijjar Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeanNijjar commented Sep 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aswinmcw commented Aug 29, 2024 •

edited

Loading

SeanNijjar Aug 29, 2024 •

edited

Loading

SeanNijjar Aug 30, 2024 •

edited

Loading