-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inference support llama3(wint8|4/a8w8) #8630
inference support llama3(wint8|4/a8w8) #8630
Conversation
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8630 +/- ##
========================================
Coverage 55.80% 55.80%
========================================
Files 620 620
Lines 96642 96642
========================================
Hits 53928 53928
Misses 42714 42714 ☔ View full report in Codecov by Sentry. |
e40bdf1
to
a81c16d
Compare
llm/predict/predictor.py
Outdated
@@ -1213,8 +1214,8 @@ def create_predictor( | |||
init_chat_template(tokenizer, predictor_args.model_name_or_path, predictor_args.chat_template) | |||
|
|||
# TODO(wj-Mcat): fix llama tokenzier pad_token bug | |||
if isinstance(tokenizer, LlamaTokenizer) and not tokenizer.pad_token: | |||
tokenizer.pad_token = tokenizer.unk_token | |||
if (isinstance(tokenizer, LlamaTokenizer) or isinstance(tokenizer, Llama3Tokenizer)) and not tokenizer.pad_token: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块地方可以简化为:if (isinstance(tokenizer, (LlamaTokenizer, Llama3Tokenizer)) and not tokenizer.pad_token:
。isintance支持元组输入。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
@@ -549,7 +549,7 @@ def init_weight_shape(self, config): | |||
self.qkv_weight_shape = ( | |||
[(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim] | |||
if config.trans_qkvw | |||
else [(self.num_heads + 2 * self.kv_num_heads) * self.head_dim, self.embed_dim] | |||
else [self.embed_dim, (self.num_heads + 2 * self.kv_num_heads) * self.head_dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块shape为啥前后修改了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为之前是错误的
833a8a7
to
7130c18
Compare
如讨论,目前llama3模型,在动态图非fuse场景下推理正常,在fuse场景下推理存在多进程问题。待后续排查。另外动转静时不可以设置src_length进行推理,以及高性能推理下无法正确eos。 @yuanlehome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Others
Description
inference support llama3(wint8|4/a8w8)