-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CodeGeeX inference support oneflow backend #65
Conversation
from oneflow.nn.parameter import Parameter | ||
|
||
|
||
def fast_gelu(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
优化1: quick_gelu
# Query, Key, and Value | ||
# ===================== | ||
|
||
if hasattr(torch._C, 'grouped_matmul_bias'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
优化2: group matmul
origin_key_layer = key_layer | ||
origin_value_layer = value_layer | ||
|
||
if hasattr(torch._C, 'fused_multi_head_attention_inference'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
优化3: fused_fmha
context_length=None, | ||
): | ||
|
||
# hidden_states: [sq, b, h] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TopQueryAttention和SelfAttrention优化方式相同
from codegeex.oneflow import CodeGeeXModel | ||
from codegeex.tokenizer import CodeGeeXTokenizer | ||
from codegeex.quantization import quantize | ||
os.environ["ONEFLOW_KERNEL_ENABLE_FUSED_LINEAR"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
优化4: matmul支持和bias_add进行融合。
以下是oneflow和Faster Transformer以及PyTorch在Fp16情况下的性能结果:
由于mock torch在这个例子的应用中存在一些问题,无法一键以 oneflow 的方式运行 pytorch 版本的代码,我们以单独新增脚本的方式支持 oneflow 后端的 codegeex 推理。