why context parallel do not support alibi? #1046

Monekyzoon · 2024-07-26T06:13:46Z

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L3237
assert ( alibi_slopes is None ), "Alibi slope bias addition is not supported with context parallelism."
Why not support alibi?

The text was updated successfully, but these errors were encountered:

ksivaman · 2024-07-26T14:02:58Z

@xrennvidia

xrennvidia · 2024-07-26T15:19:00Z

Hi @Monekyzoon this is because after we split sequence into multiple chunks, we cannot tell flash attn kernel what's the sequence offset of each chunk, so flash attn kernel cannot know the correct relative positions of tokens, and the ALiBi bias generated inside the kernel is wrong for each sequence chunk.

ksivaman assigned cyanguwa Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why context parallel do not support alibi? #1046

why context parallel do not support alibi? #1046

Monekyzoon commented Jul 26, 2024

ksivaman commented Jul 26, 2024

xrennvidia commented Jul 26, 2024

why context parallel do not support alibi? #1046

why context parallel do not support alibi? #1046

Comments

Monekyzoon commented Jul 26, 2024

ksivaman commented Jul 26, 2024

xrennvidia commented Jul 26, 2024