-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#11741: Enable configs in T3000 all gather #12053
base: main
Are you sure you want to change the base?
Conversation
a01cbd9
to
6400e94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are missing a lot of coverage here. I posted requested changes for missing coverage as well as some reorganization comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer you have both in the same file (as we will eventually remove the separate "line_" version and merge it with base all-gather. You should be able to use the same run function and just pass the allgather function as an arg (e.g. you could parametrize over [ttnn.all_gather, ttnn.line_all_gather]
and pass that as an arg. In the run function you call that.
"num_links": [1, 2], | ||
"input_shape": [ | ||
# [1, 1, 32, 32], # throwing python segmentation fault | ||
[1, 1, 32, 1024], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need much more comprehensive coverage here. If it's easier as separate parameter dicts, that's fine, but we need to cover a lot more cases.
From the master issue:
- Shapes: Constrained to tile/page aligned, Shard grids unpadded
- For inner dims (y, x), increment by 32 in each direction. For outer dims, increment by 1.
- Outer dims can be swept by basic numbers, then relatively prime numbers up to 128.
More values can be swept over after all-gather is migrated to make more use of runtime args.
[32, 32], | ||
], | ||
"shard_grid": [ | ||
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 3))}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be sweeping over many more shard grid sizes (and offsets). These are only 2 shard grids that are very "friendly". We need to include the other shard grid sizes too (those with offsets and those with varying widths/heights). At the very least we need non-power-of-2 and non-multiple-of-2 sizes included.
Additionally, atleast one prime value size per dim should be included.
These are sweeps so we should essentially be able to enumerate all the base cases.
"dim": [0, 1, 2, 3], | ||
"tensor_layout": [ttnn.ROW_MAJOR_LAYOUT, ttnn.TILE_LAYOUT], | ||
"input_dtype": [ttnn.bfloat16, ttnn.bfloat8_b], | ||
"orientation": [ttnn.ShardOrientation.ROW_MAJOR], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Col major orientation missing
[1, 8, 32, 1024], | ||
], | ||
"input_shard_shape": [ | ||
[32, 1024], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some non-power-of-2 sizes in here. Also need some that are multi-tile high. For every y-x shape here, there should be a corresponding shape with dims flipped.
): | ||
all_devices = device | ||
|
||
numel = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] * num_devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not make this call the baseline run function in test_all_gather? They should ideally have the same underlying code (even if the entrypoints differ)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has perf analysis in between so I kept this as a new function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why this can't be shared.
Worst case we wrap each part of the test function into separate smaller functions and call those. Otherwise we just keep the perf time stamping but disable it for the non sweep variants for the time being.
6400e94
to
811edb0
Compare
811edb0
to
ba1dedb
Compare
Hi @SeanNijjar I have combined both tests into one and also improved the test coverage, but now the issue is that did I increased a lot than we needed? it looks like while generating the test vectors it is eating up all available RAM (~500GB) and killing itself. Do we need to reduce it somewhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes.
Just a reminder that we shouldn't be filtering out cases that are valid and eventually expected to pass. The only exception would be cases that cause the sweep infra to crash and those should be very specifically filtered out and tagged with a clear comment about the issue.
ttnn.TensorMemoryLayout.WIDTH_SHARDED, | ||
ttnn.TensorMemoryLayout.HEIGHT_SHARDED, | ||
ttnn.TensorMemoryLayout.BLOCK_SHARDED, | ||
ttnn.TensorMemoryLayout.INTERLEAVED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't notice this before but interleaved and single bank don't make sense with a lot of these sharded specific parameters.
Since you are setting excess memory usage, it'll cut down a lot if you separate the sharded from non sharded tensor memory layouts into separate sets.
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(11, 7))}), | ||
], | ||
"dim": [0, 1, 2, 3], | ||
"tensor_layout": [ttnn.ROW_MAJOR_LAYOUT, ttnn.TILE_LAYOUT], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my later comment, for these basic case sweeps, the shapes for row major that are valid and aligned are not valid for tile.
I think it is probably worthwhile separating these into separate sets too. That way you can test individual rows for row major ( which would be sub-tile which we want to cover, but in a separate bucket)
[35, 35], | ||
[41, 41], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padded shard shape? This will be in the padded shards bucket I think.
This reminds me... I'll need to add a sweep category for negative tests that are expected to fail.
if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT and test_vector["input_dtype"] == ttnn.bfloat8_b: | ||
return True, f"bfloat8_b/4_b only supports TILE layout" | ||
if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT: | ||
return True, f"ROW_MAJOR_LAYOUT not supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be a skip. We intend to support RM so this should run and fail.
if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT: | ||
return True, f"ROW_MAJOR_LAYOUT not supported" | ||
if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.WIDTH_SHARDED: | ||
return True, f"Output mismatch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this here?
): | ||
return True, "BLOCK_SHARDED is only supported for dim = 3" | ||
if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.SINGLE_BANK: | ||
return True, f"SINGLE_BANK is not supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete
|
||
start_time = start_measuring_time() | ||
tt_out_tensor = gather_function(input_tensor_mesh, dim, num_links=num_links, memory_config=output_mem_config) | ||
e2e_perf = stop_measuring_time(start_time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jdesousa-TT will we need device sync here or is that handled by the perf reporting infra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want a sync here, there's no syncing in the infra
): | ||
all_devices = device | ||
|
||
numel = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] * num_devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why this can't be shared.
Worst case we wrap each part of the test function into separate smaller functions and call those. Otherwise we just keep the perf time stamping but disable it for the non sweep variants for the time being.
if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.WIDTH_SHARDED: | ||
return True, f"Output mismatch" | ||
if test_vector["num_links"] == 2 and test_vector["num_devices"] == 8: | ||
return True, f"8 devices and 2 links are not supported" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add " on t3000 devices"
unchunked_input_shape[test_vector["dim"]] *= test_vector["num_devices"] | ||
if test_vector["num_devices"] < 2: | ||
return True, f"Requires multiple devices to run" | ||
elif test_vector["num_devices"] == 2 and test_vector["num_links"] == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 devices could run with 2 links on t3000. It's this meant for n300? If so this should be deleted and added in the n300 sweep when that is added.
ba1dedb
to
83e8178
Compare
"tensor_mem_layout": [ | ||
ttnn.TensorMemoryLayout.WIDTH_SHARDED, | ||
ttnn.TensorMemoryLayout.HEIGHT_SHARDED, | ||
ttnn.TensorMemoryLayout.BLOCK_SHARDED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need the interleaved and single bank memory layouts to be covered. Did you move them somewhere else? I don't see them.
They don't have to be in the same parameters dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am covering non sharded params as another suite in same module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming, but that change is not yet present here, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes not yet pushed, will re request you once done
input_shapes.append([batch_size, channels, height, width]) | ||
|
||
parameters = { | ||
"line_all_gather": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be renamed to just allgather now
29f6659
to
60d1957
Compare
ed68a93
to
63e98a6
Compare
Hey @jdesousa-TT, @Aswinmcw got an initial set of tests generated for sharded all-gather and mentioned that it took 5 hours just to generate tests. This number seems really excessive for only generating 500k-1.5m entries. Is this expected? Is there any low-hanging fruit for optimizing this (e.g. use of generators or comprehensions in places of raw loops?) |
78fbc4c
to
dd93da0
Compare
], | ||
"all_gather_operation": ["all_gather", "line_all_gather"], | ||
}, | ||
"all_gather_non_sharded": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are input_shard_shape
and shard_grid
populated here if this is for interleaved and single bank? Also why isn't it sweeping the shapes like mentioned in the issue?
for channels in batch_sizes: | ||
for height in shard_Y: | ||
for width in shard_X: | ||
input_shapes.append([batch_size, channels, height * batch_size * channels, width]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are batch_size
and channels
listed in the W and Z dims and also multiplied through on the height dim?
If the input_shape is the logical tensor shape (it looks like it is), then we shouldn't need to multiply those two in on the height dim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also minor nitpick, can we rename batch
and channels
to W
and Z
because there is no semantic meaning for those dimensions for all-gather.
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 0))}), | ||
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 7))}), | ||
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 1))}), | ||
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(10, 7))}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 is out of bounds for the shard grid. This case would be expected to fail (and ideally should throw some sort of error). I think we should leave these invalid shard grids for another sweep that will include negative tests.
For now I think we can remove this test as well as the one below with x-dim == 11
row_block_shard_input_shapes = generate_shard_input_shapes(row_batch_size, block_shard_Y, block_shard_X) | ||
|
||
|
||
for height in range(32, 128, 32): # Increment by 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to save on test cases for now we do itertools.chain(range (32,128,32), (256, 512, 1024))`
That should cut the number of test by ~4x for width sharded and ~2x for sharded (assuming we do the same values for height and width dim, which we should.
dd93da0
to
81331a9
Compare
77aa8e8
to
75990dc
Compare
Tracking Issue : #11741
Sweep through all possible combinations in all gather
Checklist