-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2D matmul with specific shapes gives bad PCC on a TG #10936
Comments
I think this might be a di/dt issue @pavlepopovic mentioned that they are seeing similar behaviour for di/dt on galaxies. |
@johanna-rock-tt is the grid size 8x4 (32 cores) ? looping in @TT-BrianLiu and @tt-aho is this related to ND PCC here #10673 ? |
also, is this a Galaxy-specific problem, does the same shape/config pass on N300? |
I ran Johanna's test on 4x8 and 8x8 grids on TG with default program config, reproed ND PCC. 4x8 matmul_2d with subblock 1x1 passes 50 iterations of testing. This test uses matmul_2d and #10673 uses dram sharded matmul. They may be related if galaxy is fragile with matmul di/dt. Note that when I first started testing, the 4x8 matmul_2d with default program config failed very consistently, >50% of the time. After the machine warmed up, I found the 4x8 failures to be much less frequent, about 5% of the time. Is there any systems explanation for this? |
Is the matmul running on 32x devices when it fails? Would be great to run the same test on t3k, to see if it's a Galaxy specific issue |
yes the matmul is running on 32 devices when it fails. I'll check on t3k |
Would be great to see also what happen when it runs on 1, 2, 4, 8 on TG |
I ran the failing 4x8 coregrid with default program config on T3K 8 chips and got deterministic good PCC. |
I've managed to run this MM with exactly the same config on t3k as on galaxy this morning, non-determinism occurs every iteration:
I don't know about DM kernels, but the compute kernel seems fishy - it's using Attaching the entire tracy run for galaxy: |
specify 1d/2d matmul config to fix |
@bbradelTT to identify why bad matmul is used |
@pavlepopovic branch jrock/llama3-405b doesn't exist. Is the code in main now? If not, what steps did you use to reproduce the issue? |
Couple of things:
|
Talked to @pavlepopovic
Also, we've started seeing similar behaviour in sweeps recently. See #9059 (comment) |
My current theory is that a fallthrough path of automatically choosing parameters that was added as a last resort is distributing tensors while ignoring their shard shapes (e.g. if the shard shape is 768 and m is 768, 768/32=24 cores will try to be used instead of 1) and the underlying kernels are not expecting such settings. I'm continuing to investigate. @pavlepopovic Could you please
When I tried to specify core_grid in tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture on Galaxy I got an exception |
@johanna-rock-tt
E.g.
Leads to
I still need to investigate why the default is resulting in incorrect behaviour. |
I also tried the following, which lead to nothing happening (probably a hang):
In terms of what the code chooses by default, I got the following debug output:
for all 32 devices. Next steps:
|
The following reproduces the hang I saw on Galaxy on WH:
|
Talked with @pavlepopovic Based on this info, the code would choose MatmulMultiCoreNonOptimized. in0 is pretty narrow, and fits within 16 tiles, and therefore it looks like a 1d matmul. |
@johanna-rock-tt I replaced bfloat4_b with bfloat16 and the test passed:
Output:
Also, I tried first bfloat4_b and then bfloat16 with some local tests and the results are quite different.
|
One of the other tests that passes has smaller K and N half of the failing test:
When N is |
For the 1D matmul, out_subblock_w needs to be 1. |
validate did check for what I expected, but there was a bug in the code where validation was disabled when program cache is off. This bug has been fixed yesterday. I verified with the following that this is a bfloat4_b precision issue.
uses MatmulMultiCoreNonOptimizedReuseProgramConfig and both c and d are all 1s I also verified MatmulMultiCoreProgramConfig is okay as well with the following:
@johanna-rock-tt you'll need to use the program config or not have this specific shape+precision. |
Thanks @bbradelTT However I'm still puzzled by bfp4 with this specific shape producing ~0.0 PCC while similar shapes in bfp 4 (e.g. M = 256, K = 16 * 1024, N = 52 * 1024) produce 0.99 pcc. |
@johanna-rock-tt The tensor has some common info for bfloat4_b and bfloat8_b. Based on what I saw, probably at least for the mantissa. E.g. I would get sets of values such as 0.50000, 0.25000, ..., 0.12500, 0.25000 or 512.00000, 384.00000, ..., 128.00000, 896.00000, with nothing between set boundaries. If that common info is set based on inputs or some criteria that can't handle the right sets of ranges, and then there is enough addition, the values probably drift off. It'd be good to have a reference for bfloat4_b, and then we could do more than speculate. |
I see, thanks for the explanation! |
bad pcc could be explained by overflows, but do we have an explanation for the ND pcc that was initially reported? |
@uaydonat I don't see a fixed seed in the test. Wouldn't we always get slightly different pccs each time in that case? |
Yes, we would. @johanna-rock-tt please check in your top comment, the test you mentioned is setting the seed or not. If not, please check the ND goes away with seed. |
Just checked. We didn't have a manual seed set, but setting the seed (torch.manual_seed(1234)) still results in ND PCC for me: PCC value: Max ATOL Delta: nan, Max RTOL Delta: nan, PCC: -0.00031071428310952017, PCC check failed |
@johanna-rock-tt for random PCCs
|
I'm wondering if the remaining ND PCC behaviour is related to #10673 |
The other shapes of the same test as well as the problematic shape with the program config that gives good PCC have deterministic PCC (tested with 3 runs each). |
Interesting. The kernels used by MatmulMultiCoreNonOptimized may not be configured properly to handle bfloat4_b that shows up when rounding is involved. I'll have to look into that. |
I'm trying with a sample test:
where the in0 dtype >= in1 dtype:
in0 dtype < in1 dtype:
Compute kernel with the issue: Compute kernel used by 2d mcast that does not have the issue: I'll try to figure out what the difference is that could be causing the different behaviour. |
I talked to @tt-aho I verified that after this change the test passes and using torch.manual_seed(1234) and running a couple of times produces the same PCC:
|
That's great! Thanks for investigating and fixing! |
…1947) Co-authored-by: Austin Ho <aho@tenstorrent.com>
@johanna-rock-tt the fix is merged. Please verify that everything works as expected and then you can uncomment the relevant test parameter combination. |
…1947) Co-authored-by: Austin Ho <aho@tenstorrent.com>
Describe the bug
A 2D fractured matmul with specific shapes gives bad PCC when run on a galaxy.
Shape is for llama3-405B FF1 for prefill with sequence length = 512, both activation and weight are in DRAM interleaved format. The matmul works with lower sequence lengths (=M), e.g. 128 or 256.
M = 512, K = 16 * 1024, N = 52 * 1024
The PCC is (so far) always around zero but still non-deterministic, e.g. in three runs:
To Reproduce
Steps to reproduce the behavior:
jrock/llama3-405b
pytest tests/ttnn/multichip_unit_tests/test_multidevice_TG.py::test_galaxy_matmul_2d_fracture[silicon_arch_name=wormhole_b0-silicon_arch_wormhole_b0=True-Llama3-405B_prefill_seq512_FF1-4x8_grid]
Expected behavior
Passes the test with 0.99 PCC target.
The text was updated successfully, but these errors were encountered: