Assertion failure: pipeline_state_ != nullptr on metal #6221

PGZXB · 2022-10-02T04:18:04Z

Describe the bug
Run the following code failed:

import taichi as ti

ti.init(arch=ti.metal)

@ti.kernel
def kernel4(n: ti.i32, m: ti.i32) -> ti.i32:
    res = 0
    for i in range(n):
        for j in range(m):
            res += j
    return res

kernel4(2, 2)

Output:

[Taichi] version 1.1.4, llvm 10.0.0, commit 9d65cbe7, osx, python 3.10.6
[I 10/03/22 09:41:07.419 328128] [misc.py:init@439] Following TI_ARCH setting up for arch=metal
[Taichi] Starting on arch=metal
2022-10-03 09:41:07.579 Python[27767:328128] Error Domain=AGXMetal13_3 Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}
[E 10/03/22 09:41:07.579 328128] [kernel_manager.cpp:CompiledMtlKernelBase@80] Assertion failure: pipeline_state_ != nullptr


Traceback (most recent call last):
  File "/Users/pgzxb/taichi_test/t.py", line 13, in <module>
    kernel4(2, 2)
  File "/Users/pgzxb/taichi/python/taichi/lang/kernel_impl.py", line 915, in wrapped
    return primal(*args, **kwargs)
  File "/Users/pgzxb/taichi/python/taichi/lang/kernel_impl.py", line 842, in __call__
    return self.runtime.compiled_functions[key](*args)
  File "/Users/pgzxb/taichi/python/taichi/lang/kernel_impl.py", line 769, in func__
    raise e from None
  File "/Users/pgzxb/taichi/python/taichi/lang/kernel_impl.py", line 766, in func__
    t_kernel(launch_ctx)
RuntimeError: [kernel_manager.cpp:CompiledMtlKernelBase@80] Assertion failure: pipeline_state_ != nullptr

ailzhang · 2022-10-18T02:40:21Z

Reopening since I'm see exactly the same issue on my M1 MacBook. What makes it more confusing is that if I change += to *= it works perfectly fine, I suspect that this one is related to flaky tests on newer macOS systems.

ailzhang · 2022-10-21T07:01:49Z

Adding this issue to v1.2.0 milestone since it affects many users on macos12, but it's indeed a bit hard to fix so I cannot guarantee a proper fix before that.

ailzhang · 2022-10-27T10:51:25Z

FYI @jim19930609 and @k-ye helped investigate this issue and here're a few observations with another (maybe simpler) repro:

import taichi as ti


@ti.kernel
def kernel4(l: ti.i32) -> ti.i32:
    res = 1
    for i in range(2):
        res1 = 0
        for j in range(l):
            res1 += j
        res += res1
    return res


ti.init(arch=ti.metal,
        make_thread_local=False,
        print_kernel_llvm_ir=True,
        log_level=ti.TRACE)
res = kernel4(3)
print(res)

This error exists for all m1 MacBooks with macos12 but works find on all macos11, with exactly the same metal source code.
It errors even when make_thread_local is off.
If you change res+=j to res+=1 or making the second for loop a constant range for, it passes.
Manually compiling the generated metal source on my Mac passes.
More detailed crash log can be found in Console app on Mac.

A few things worth trying:

create a pipeline with the same metal source in metal-cpp and see if it crashes. If it does we can continue bisecting there.
Try a few different build options in https://github.com/taichi-dev/taichi/blob/master/taichi/runtime/metal/kernel_manager.cpp#L299

ailzhang · 2022-11-20T05:23:48Z

@bobcao3 do you happen to have any insights on this issue? It's blocking us from upgrading mac M1 machine to macos12 and it's becoming a bottleneck in CI. Thanks a lot!

turbo0628 · 2022-11-21T11:54:46Z

Some follow-up investigation:

This metal code snippet cannot work on macOS 12.x

  int32_t tmp7(0);
  int32_t tmp10 = *((device int32_t*) (ctx_addr + 4));
  for (int32_t tmp12_ = 0; tmp12_ < tmp10; tmp12_ = tmp12_ + 1) {
    tmp7 = tmp7 + tmp12_;
  }

The crash point seems to be inside an LLVM pass. I guess it crashes when attempting to automatically apply atomic add operation a top of the loop index tmp12_.

k-ye · 2022-11-21T12:53:52Z

I guess it crashes when attempting to automatically apply atomic add operation a top of the loop index tmp12_.

Out of curiosity, where is the "atomic add" in the above code snippet?

turbo0628 · 2022-11-21T13:49:10Z

It's my guess that the Metal compiler is trying to apply automatic atomic add optimization, it might crash on other passes tho

turbo0628 · 2022-11-21T14:22:16Z

I've worked out a C++ reprod, code repository. This is likely a macOS compiler bug.

It requires following conditions to reproduce the bug:

i32 data type, does not reproduce with f32
Increment some value with respect to the inner loop index, which has dynamic loop range loaded from buffer. I haven't tested if it works with pass-in constant loop range.
res = res + j * 3 triggers the bug, but res = res * 3 + j doesn't.

feisuzhu · 2022-12-30T15:55:33Z

After some tinkering I found that if we annotate tmp7 to be volatile it would not trigger the bug on Ventura (13.1).

  volatile int32_t tmp7(0);
  int32_t tmp10 = *((device int32_t*) (ctx_addr + 4));
  for (int32_t tmp12_ = 0; tmp12_ < tmp10; tmp12_ = tmp12_ + 1) {
    tmp7 = tmp7 + tmp12_;
  }

Didn't test its performance though...

PENGUINLIONG · 2023-01-12T01:29:42Z

For this second repro I found this numerically equivalent implementation working:

import taichi as ti

@ti.kernel
def kernel4(l: ti.i32) -> ti.i32:
    res = 1
    for i in range(2):
        res += (l - 1) * l // 2
    return res

ti.init(arch=ti.metal,
        print_ir=True,
        log_level=ti.TRACE)
res = kernel4(3)
print(res)

I guess the LLVM pass to transform summing loops into gaussian sum formula is broken in Apple's MSL compilation pipeline. We can implement it ourselves to workaround this issue.

@PENGUINLIONG

Issue: #6221 ### Brief Summary `kernel4` in `test_offline_cache.py` triggers a weird LLVM internal error on metal backend for macos12+. After moving metal runtime to gfxruntime by @PENGUINLIONG and #7201, this is now the only blocker for us to run full CI on macos ventura. As discussed with @PENGUINLIONG and @feisuzhu offline, let's disable this test to unblock CI and add it back once the issue is fixed.

@PENGUINLIONG

…dev#7154) Issue: taichi-dev#6221 ### Brief Summary `kernel4` in `test_offline_cache.py` triggers a weird LLVM internal error on metal backend for macos12+. After moving metal runtime to gfxruntime by @PENGUINLIONG and taichi-dev#7201, this is now the only blocker for us to run full CI on macos ventura. As discussed with @PENGUINLIONG and @feisuzhu offline, let's disable this test to unblock CI and add it back once the issue is fixed.

PGZXB added the potential bug Something that looks like a bug but not yet confirmed label Oct 2, 2022

taichi-gardener added this to Taichi Lang Oct 2, 2022

taichi-gardener moved this to Untriaged in Taichi Lang Oct 2, 2022

PGZXB changed the title ~~Assert failed on metal~~ Assertion failure: pipeline_state_ != nullptr on metal Oct 2, 2022

PGZXB closed this as completed Oct 6, 2022

Repository owner moved this from Untriaged to Done in Taichi Lang Oct 6, 2022

ailzhang reopened this Oct 18, 2022

ailzhang added this to the v1.2.0 milestone Oct 21, 2022

ailzhang removed this from the v1.2.0 milestone Oct 27, 2022

ailzhang added bug We've confirmed that this is an BUG metal Metal backend and removed potential bug Something that looks like a bug but not yet confirmed labels Oct 27, 2022

turbo0628 self-assigned this Nov 1, 2022

ailzhang mentioned this issue Jan 18, 2023

[metal] Disable a kernel test in offline cache to unblock CI #7154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failure: pipeline_state_ != nullptr on metal #6221

Assertion failure: pipeline_state_ != nullptr on metal #6221

PGZXB commented Oct 2, 2022 •

edited

Loading

ailzhang commented Oct 18, 2022 •

edited

Loading

ailzhang commented Oct 21, 2022

ailzhang commented Oct 27, 2022

ailzhang commented Nov 20, 2022

turbo0628 commented Nov 21, 2022 •

edited

Loading

k-ye commented Nov 21, 2022

turbo0628 commented Nov 21, 2022 •

edited

Loading

turbo0628 commented Nov 21, 2022 •

edited

Loading

feisuzhu commented Dec 30, 2022 •

edited

Loading

PENGUINLIONG commented Jan 12, 2023

Assertion failure: pipeline_state_ != nullptr on metal #6221

Assertion failure: pipeline_state_ != nullptr on metal #6221

Comments

PGZXB commented Oct 2, 2022 • edited Loading

ailzhang commented Oct 18, 2022 • edited Loading

ailzhang commented Oct 21, 2022

ailzhang commented Oct 27, 2022

ailzhang commented Nov 20, 2022

turbo0628 commented Nov 21, 2022 • edited Loading

k-ye commented Nov 21, 2022

turbo0628 commented Nov 21, 2022 • edited Loading

turbo0628 commented Nov 21, 2022 • edited Loading

feisuzhu commented Dec 30, 2022 • edited Loading

PENGUINLIONG commented Jan 12, 2023

PGZXB commented Oct 2, 2022 •

edited

Loading

ailzhang commented Oct 18, 2022 •

edited

Loading

turbo0628 commented Nov 21, 2022 •

edited

Loading

turbo0628 commented Nov 21, 2022 •

edited

Loading

turbo0628 commented Nov 21, 2022 •

edited

Loading

feisuzhu commented Dec 30, 2022 •

edited

Loading