Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#11741: Enable configs in T3000 all gather #12053

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Aswinmcw
Copy link
Contributor

@Aswinmcw Aswinmcw commented Aug 29, 2024

Tracking Issue : #11741
Sweep through all possible combinations in all gather

Checklist

Copy link
Contributor

@SeanNijjar SeanNijjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing a lot of coverage here. I posted requested changes for missing coverage as well as some reorganization comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer you have both in the same file (as we will eventually remove the separate "line_" version and merge it with base all-gather. You should be able to use the same run function and just pass the allgather function as an arg (e.g. you could parametrize over [ttnn.all_gather, ttnn.line_all_gather] and pass that as an arg. In the run function you call that.

"num_links": [1, 2],
"input_shape": [
# [1, 1, 32, 32], # throwing python segmentation fault
[1, 1, 32, 1024],
Copy link
Contributor

@SeanNijjar SeanNijjar Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need much more comprehensive coverage here. If it's easier as separate parameter dicts, that's fine, but we need to cover a lot more cases.

From the master issue:

  • Shapes: Constrained to tile/page aligned, Shard grids unpadded
  • For inner dims (y, x), increment by 32 in each direction. For outer dims, increment by 1.
    • Outer dims can be swept by basic numbers, then relatively prime numbers up to 128.
      More values can be swept over after all-gather is migrated to make more use of runtime args.

[32, 32],
],
"shard_grid": [
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 3))}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be sweeping over many more shard grid sizes (and offsets). These are only 2 shard grids that are very "friendly". We need to include the other shard grid sizes too (those with offsets and those with varying widths/heights). At the very least we need non-power-of-2 and non-multiple-of-2 sizes included.

Additionally, atleast one prime value size per dim should be included.

These are sweeps so we should essentially be able to enumerate all the base cases.

"dim": [0, 1, 2, 3],
"tensor_layout": [ttnn.ROW_MAJOR_LAYOUT, ttnn.TILE_LAYOUT],
"input_dtype": [ttnn.bfloat16, ttnn.bfloat8_b],
"orientation": [ttnn.ShardOrientation.ROW_MAJOR],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Col major orientation missing

[1, 8, 32, 1024],
],
"input_shard_shape": [
[32, 1024],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some non-power-of-2 sizes in here. Also need some that are multi-tile high. For every y-x shape here, there should be a corresponding shape with dims flipped.

):
all_devices = device

numel = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] * num_devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not make this call the baseline run function in test_all_gather? They should ideally have the same underlying code (even if the entrypoints differ)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has perf analysis in between so I kept this as a new function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this can't be shared.

Worst case we wrap each part of the test function into separate smaller functions and call those. Otherwise we just keep the perf time stamping but disable it for the non sweep variants for the time being.

@Aswinmcw
Copy link
Contributor Author

Hi @SeanNijjar I have combined both tests into one and also improved the test coverage, but now the issue is that did I increased a lot than we needed? it looks like while generating the test vectors it is eating up all available RAM (~500GB) and killing itself. Do we need to reduce it somewhere?

Copy link
Contributor

@SeanNijjar SeanNijjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes.

Just a reminder that we shouldn't be filtering out cases that are valid and eventually expected to pass. The only exception would be cases that cause the sweep infra to crash and those should be very specifically filtered out and tagged with a clear comment about the issue.

ttnn.TensorMemoryLayout.WIDTH_SHARDED,
ttnn.TensorMemoryLayout.HEIGHT_SHARDED,
ttnn.TensorMemoryLayout.BLOCK_SHARDED,
ttnn.TensorMemoryLayout.INTERLEAVED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this before but interleaved and single bank don't make sense with a lot of these sharded specific parameters.

Since you are setting excess memory usage, it'll cut down a lot if you separate the sharded from non sharded tensor memory layouts into separate sets.

ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(11, 7))}),
],
"dim": [0, 1, 2, 3],
"tensor_layout": [ttnn.ROW_MAJOR_LAYOUT, ttnn.TILE_LAYOUT],
Copy link
Contributor

@SeanNijjar SeanNijjar Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my later comment, for these basic case sweeps, the shapes for row major that are valid and aligned are not valid for tile.

I think it is probably worthwhile separating these into separate sets too. That way you can test individual rows for row major ( which would be sub-tile which we want to cover, but in a separate bucket)

Comment on lines 49 to 50
[35, 35],
[41, 41],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padded shard shape? This will be in the padded shards bucket I think.

This reminds me... I'll need to add a sweep category for negative tests that are expected to fail.

if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT and test_vector["input_dtype"] == ttnn.bfloat8_b:
return True, f"bfloat8_b/4_b only supports TILE layout"
if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT:
return True, f"ROW_MAJOR_LAYOUT not supported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be a skip. We intend to support RM so this should run and fail.

if test_vector["tensor_layout"] == ttnn.ROW_MAJOR_LAYOUT:
return True, f"ROW_MAJOR_LAYOUT not supported"
if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.WIDTH_SHARDED:
return True, f"Output mismatch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here?

):
return True, "BLOCK_SHARDED is only supported for dim = 3"
if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.SINGLE_BANK:
return True, f"SINGLE_BANK is not supported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete


start_time = start_measuring_time()
tt_out_tensor = gather_function(input_tensor_mesh, dim, num_links=num_links, memory_config=output_mem_config)
e2e_perf = stop_measuring_time(start_time)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdesousa-TT will we need device sync here or is that handled by the perf reporting infra?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want a sync here, there's no syncing in the infra

):
all_devices = device

numel = input_shape[0] * input_shape[1] * input_shape[2] * input_shape[3] * num_devices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this can't be shared.

Worst case we wrap each part of the test function into separate smaller functions and call those. Otherwise we just keep the perf time stamping but disable it for the non sweep variants for the time being.

if test_vector["tensor_mem_layout"] == ttnn.TensorMemoryLayout.WIDTH_SHARDED:
return True, f"Output mismatch"
if test_vector["num_links"] == 2 and test_vector["num_devices"] == 8:
return True, f"8 devices and 2 links are not supported"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add " on t3000 devices"

unchunked_input_shape[test_vector["dim"]] *= test_vector["num_devices"]
if test_vector["num_devices"] < 2:
return True, f"Requires multiple devices to run"
elif test_vector["num_devices"] == 2 and test_vector["num_links"] == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 devices could run with 2 links on t3000. It's this meant for n300? If so this should be deleted and added in the n300 sweep when that is added.

"tensor_mem_layout": [
ttnn.TensorMemoryLayout.WIDTH_SHARDED,
ttnn.TensorMemoryLayout.HEIGHT_SHARDED,
ttnn.TensorMemoryLayout.BLOCK_SHARDED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the interleaved and single bank memory layouts to be covered. Did you move them somewhere else? I don't see them.

They don't have to be in the same parameters dictionary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am covering non sharded params as another suite in same module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, but that change is not yet present here, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes not yet pushed, will re request you once done

input_shapes.append([batch_size, channels, height, width])

parameters = {
"line_all_gather": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be renamed to just allgather now

@Aswinmcw Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch 5 times, most recently from ed68a93 to 63e98a6 Compare September 6, 2024 04:54
@SeanNijjar
Copy link
Contributor

Hey @jdesousa-TT,

@Aswinmcw got an initial set of tests generated for sharded all-gather and mentioned that it took 5 hours just to generate tests. This number seems really excessive for only generating 500k-1.5m entries. Is this expected? Is there any low-hanging fruit for optimizing this (e.g. use of generators or comprehensions in places of raw loops?)

@Aswinmcw Aswinmcw force-pushed the Aswinmcw/ccl_configs_set1 branch 5 times, most recently from 78fbc4c to dd93da0 Compare September 10, 2024 11:07
],
"all_gather_operation": ["all_gather", "line_all_gather"],
},
"all_gather_non_sharded": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are input_shard_shape and shard_grid populated here if this is for interleaved and single bank? Also why isn't it sweeping the shapes like mentioned in the issue?

for channels in batch_sizes:
for height in shard_Y:
for width in shard_X:
input_shapes.append([batch_size, channels, height * batch_size * channels, width])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are batch_size and channels listed in the W and Z dims and also multiplied through on the height dim?

If the input_shape is the logical tensor shape (it looks like it is), then we shouldn't need to multiply those two in on the height dim.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also minor nitpick, can we rename batch and channels to W and Z because there is no semantic meaning for those dimensions for all-gather.

ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 0))}),
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 7))}),
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(7, 1))}),
ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), ttnn.CoreCoord(10, 7))}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 is out of bounds for the shard grid. This case would be expected to fail (and ideally should throw some sort of error). I think we should leave these invalid shard grids for another sweep that will include negative tests.

For now I think we can remove this test as well as the one below with x-dim == 11

row_block_shard_input_shapes = generate_shard_input_shapes(row_batch_size, block_shard_Y, block_shard_X)


for height in range(32, 128, 32): # Increment by 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to save on test cases for now we do itertools.chain(range (32,128,32), (256, 512, 1024))`

That should cut the number of test by ~4x for width sharded and ~2x for sharded (assuming we do the same values for height and width dim, which we should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants