Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM]Update yuan model #8786

Merged
merged 50 commits into from
Aug 12, 2024
Merged

[LLM]Update yuan model #8786

merged 50 commits into from
Aug 12, 2024

Conversation

zhaogf01
Copy link
Contributor

PR types

New features

PR changes

Models

Description

增加了源2.0的其他模型(51B、102B)、微调(lora、sft)、预训练以及auto_convert_from_torch

Copy link

paddle-bot bot commented Jul 19, 2024

Thanks for your contribution!

Copy link

codecov bot commented Jul 19, 2024

Codecov Report

Attention: Patch coverage is 24.59016% with 138 lines in your changes missing coverage. Please review.

Project coverage is 55.37%. Comparing base (57000fa) to head (ec8ee56).
Report is 278 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/yuan/tokenizer.py 28.30% 76 Missing ⚠️
paddlenlp/transformers/yuan/modeling.py 10.29% 61 Missing ⚠️
paddlenlp/transformers/model_utils.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8786      +/-   ##
===========================================
- Coverage    55.44%   55.37%   -0.07%     
===========================================
  Files          626      633       +7     
  Lines        98065    99888    +1823     
===========================================
+ Hits         54368    55311     +943     
- Misses       43697    44577     +880     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhaogf01
Copy link
Contributor Author

麻烦尽快review下,谢谢


## 1. 模型介绍

[源2.0](https://github.com/IEIT-Yuan/Yuan-2.0)是浪潮信息发布的新一代基础语言大模型。源2.0是在源1.0的基础上,利用更多样的高质量预训练数据和指令微调数据集,令模型在语义、数学、推理、代码、知识等不同方面具备更强的理解能力。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

乱码了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是浏览器的编码显示问题,本地和我的git上都没问题,我使用的UTF-8编码

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过手机验证确实没有乱码,不过在mac上会产生乱码,可能是win和mac存在冲突,建议删除文件后重新使用sublime等工具保存试试,格式选UTF-8(unix)

@@ -0,0 +1,35 @@
{
"model_name_or_path": "/workspace/yuan",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个模型可以上传到 bos 或者 aistudio,不用本地的名字

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请问bos 或者 aistudio应该如何上传权重,有没有相应的readme或者链接?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模型空间在此处上传模型数据,上传方式可通过paddlenlp进行上传,方式如下:

# pip install aistudio_sdk tqdm
from paddlenlp.transformers import AutoModelForCausalLM, AutoTokenizer
# 注意传入正确dtype
model_name_or_path = "IEITYuan/Yuan2-51B-hf"
dtype = "bfloat16"
repo_id = "user_id/Yuan2-51B-hf" # user_id 需根据用户创建模型判断
token = "xxxxxxxxxxx" # token需在aistudio上“个人中心-访问令牌”中获取

model = AutoModelForCausalLM.from_pretrained(model_name_or_path, dtype=dtype)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

# safetensor 版本
model.save_to_aistudio(
    repo_id = repo_id,
    token = token,
    private=True,
    license="Apache License 2.0",
    exist_ok=True,
    safe_serialization=True
)
# 非safetensor 版本
model.save_to_aistudio(
    repo_id = repo_id,
    token = token,
    private=True,
    license="Apache License 2.0",
    exist_ok=True,
    safe_serialization=False
)
tokenizer.save_to_aistudio(
    repo_id = repo_id,
    token = token,
    private=True,
    license="Apache License 2.0",
    exist_ok=True,
)

@@ -0,0 +1,41 @@
{
"model_name_or_path": "/workspace/yuan",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

预训练这个可能还需要做数据。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
我测试了paddlenlp提供的数据集,是可以使用的。

@@ -249,7 +249,7 @@ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):

class YuanPretrainedModel(PretrainedModel):
config_class = YuanConfig
base_model_prefix = "model"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个修改需要兼容之前合入的模型参数吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -282,7 +287,7 @@ def _get_name_mappings(cls, config: YuanConfig) -> List[StateDictNameMapping]:
if "YuanModel" not in config.architectures:
for mapping in model_mappings:
mapping[0] = "model." + mapping[0]
mapping[1] = "yuan." + mapping[1]
mapping[1] = "model." + mapping[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

额,这个prefix 是不是已经可以了,需要这么改吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -249,7 +249,7 @@ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):

class YuanPretrainedModel(PretrainedModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

你在这里 加个 __all__ 字段 限定一下需要import的模型吧。__init__里面是import *,很多其他东西也会import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wawltor wawltor merged commit 30fc639 into PaddlePaddle:develop Aug 12, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants