mamba2 training speed is very very very slow #389

with45 · 2024-06-12T13:02:32Z

I change the mamba to mamba2,trainning spped very very slow why?

catalpaaa · 2024-06-12T14:15:50Z

For my task, image classification, mamba 1 takes 40 mins to run one epoch on rtx 6000 ada and mamba 2 takes only 20.

mamba 2 also use less ram too!

vasqu · 2024-06-12T17:52:20Z

Although, I've also encountered issues similar to those being described in the issue later on (graph compilation errors). I'd also suggest using torch==2.2.0 with triton 2.2.0 (no idea why but it ran faster than 2.3.0 in my case).

Gaodzlearn · 2024-06-16T15:23:53Z

I also encountered this problem. Running the demo takes around 30 seconds:

Code

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

Output

Time taken: 32.440 s

Environment Information

GPU: NVIDIA A6000,
CPU: AMD EPYC 7513
System: Ubuntu 20.04.6 LTS
Python: 3.9
CUDA: 11.8
Pytorch: 2.3.1
Triton: 2.3.1
Transformers: 4.41.2

Additional

I tried adding a decorator in ssd_combined.py as suggested by @Kiet0712 in this comment, but it resulted in a bug similar to what @arelkeselbri described in this comment.

Is this inference speed for the demo normal? Or is there something wrong with my code? I would appreciate any help or suggestions!

tridao · 2024-06-16T16:10:29Z

Try warming up by running it once first. The first time will invoke the triton compiler & autotune so it'll be slow.

Gaodzlearn · 2024-06-16T16:27:08Z

Try warming up by running it once first. The first time will invoke the triton compiler & autotune so it'll be slow.

Thank you so much! The second inference process takes only 0.005 sec.

ChenJunhao-Fighting · 2024-06-17T07:46:45Z

我也遇到了这个问题。运行演示大约需要30秒：

代码
from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")
输出

Time taken: 32.440 s

环境信息

GPU：NVIDIA A6000， CPU：AMD EPYC 7513 系统：Ubuntu 20.04.6 LTS Python：3.9 CUDA：11.8 Pytorch：2.3.1 Triton：2.3.1 Transformers：4.41.2

额外的

我尝试按照建议在 ssd_combined.py 中添加一个装饰器@Kiet0712在这个评论中，但它导致了一个类似于@arelkeselbri 在此评论中描述。

这个演示的推理速度正常吗？还是我的代码有问题？任何帮助或建议我都会很感激！

I used the same code for testing, I tried running it multiple times and there was no improvement in speed.

first： Time taken: 246.404 s
second： Time taken: 245.945 s
third： Time taken: 256.347 s

Gaodzlearn · 2024-06-17T07:57:28Z

The compile happens every time you launch 'python demo.py'. Try forward twice in the same script like:

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")

# warm up
y = model(x)

t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

ChenJunhao-Fighting · 2024-06-17T08:40:16Z

@Gaodzlearn The problem is solved, thank you very much for your reply.

Zhou-CyberSecurity-AI · 2024-06-30T12:09:37Z

我也遇到了这个问题。运行演示大约需要30秒：

代码
from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")
输出

Time taken: 32.440 s

环境信息

GPU：NVIDIA A6000， CPU：AMD EPYC 7513 系统：Ubuntu 20.04.6 LTS Python：3.9 CUDA：11.8 Pytorch：2.3.1 Triton：2.3.1 Transformers：4.41.2

额外的

我尝试按照建议在 ssd_combined.py 中添加一个装饰器@Kiet0712在这个评论中，但它导致了一个类似于@arelkeselbri 在此评论中描述。
这个演示的推理速度正常吗？还是我的代码有问题？任何帮助或建议我都会很感激！
I used the same code for testing, I tried running it multiple times and there was no improvement in speed.

first： Time taken: 246.404 s

second： Time taken: 245.945 s

third： Time taken: 256.347 s

Why is it that I run Mamba no problem, but when I run Mamba-2 I get "'NoneType' object has no attribute 'causal_conv1d_fwd'". I also install the casual-covn1d==1.2.1.

TimothyChen225 mentioned this issue Jun 21, 2024

Mamba2 9 times slower inference time than Mamba1 #355

Closed

vasqu mentioned this issue Jul 30, 2024

Add codestral mamba2 huggingface/transformers#32080

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mamba2 training speed is very very very slow #389

mamba2 training speed is very very very slow #389

with45 commented Jun 12, 2024

catalpaaa commented Jun 12, 2024

vasqu commented Jun 12, 2024 •

edited

Loading

Gaodzlearn commented Jun 16, 2024

tridao commented Jun 16, 2024

Gaodzlearn commented Jun 16, 2024

ChenJunhao-Fighting commented Jun 17, 2024

代码

输出

环境信息

额外的

Gaodzlearn commented Jun 17, 2024

ChenJunhao-Fighting commented Jun 17, 2024

Zhou-CyberSecurity-AI commented Jun 30, 2024

代码

输出

环境信息

额外的

mamba2 training speed is very very very slow #389

mamba2 training speed is very very very slow #389

Comments

with45 commented Jun 12, 2024

catalpaaa commented Jun 12, 2024

vasqu commented Jun 12, 2024 • edited Loading

Gaodzlearn commented Jun 16, 2024

Code

Output

Environment Information

Additional

tridao commented Jun 16, 2024

Gaodzlearn commented Jun 16, 2024

ChenJunhao-Fighting commented Jun 17, 2024

代码

输出

环境信息

额外的

Gaodzlearn commented Jun 17, 2024

ChenJunhao-Fighting commented Jun 17, 2024

Zhou-CyberSecurity-AI commented Jun 30, 2024

代码

输出

环境信息

额外的

vasqu commented Jun 12, 2024 •

edited

Loading