add llama and nv-embed training #9323

Li-Z-Q · 2024-10-28T07:37:57Z

Description

add llama and nv-embed training

paddle-bot · 2024-10-28T07:38:02Z

Thanks for your contribution!

CLAassistant · 2024-10-28T07:38:11Z

All committers have signed the CLA.

codecov · 2024-10-28T08:10:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 52.94%. Comparing base (c9d5673) to head (8b14719).
Report is 154 commits behind head on develop.

❗ Current head 8b14719 differs from pull request most recent head b538dea

Please upload reports for the commit b538dea to get more accurate results.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9323      +/-   ##
===========================================
+ Coverage    52.86%   52.94%   +0.08%     
===========================================
  Files          669      676       +7     
  Lines       107240   107919     +679     
===========================================
+ Hits         56688    57134     +446     
- Misses       50552    50785     +233

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sijunhe · 2024-11-12T02:47:36Z

slm/pipelines/examples/contrastive_training/README.md

+单卡训练效率过低，batch_size 较小，建议使用多卡训练，对于对比学习训练推荐使用大 batch_size，多卡训练，示例命令如下：
+
+```
+python -m paddle.distributed.launch --gpus "1,2,3,4" train.py --do_train \


--gpus "1,2,3,4" 这个应该是0开始吧

好的，已修改

sijunhe · 2024-11-12T02:47:46Z

slm/pipelines/examples/contrastive_training/README.md

-### 单卡训练
-
+## 训练


保持单卡训练

当前代码在“训练“这个标题下既包括单卡训练也包括多卡训练，请问您的意思是把”单卡训练“也单独写成一个子标题吗？

是的。维持之前的单卡训练/多卡训练的子标题，清晰一些

好的，已修改

sijunhe · 2024-11-12T02:49:02Z

slm/pipelines/examples/contrastive_training/eval_mteb.py

-#
+# 
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
-#
+# 
 #     http://www.apache.org/licenses/LICENSE-2.0
-#
+# 


这个维持原样吧

好的，已修改

sijunhe · 2024-11-12T03:01:14Z

slm/pipelines/examples/contrastive_training/models/modeling_nv.py

@@ -0,0 +1,517 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.


这个模型定义是怎么来的？是从 mteb_models_nv.py 移过来的吗？

对于nv-embed的训练和测试，之前版本代码会使用两个代码文件：训练时使用modeling_nv.py，测试时使用mteb_models_nv.py；
本次更新将这两个代码文件进行了合并，训练以及测试时，都使用modeling_nv.py加载nv-embed权重

ok. 那mteb_models里面的逻辑，迁入哪里了呢？我看这个文件也被删了

mteb_models.py的逻辑被合并入 models/modeling.py

sijunhe · 2024-11-12T03:01:25Z

slm/pipelines/examples/contrastive_training/evaluation/benchmarks.py

@@ -1,216 +0,0 @@
-# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.


为什么删掉了?

新版本代码使用eval_mteb.py对模型进行评估，因此删除了此前的evaluation文件夹

之前的评估代码是T2Ranking, 为什么要删掉呢？ T2Ranking和MTEB也不冲突呀

已恢复evaluation文件夹

sijunhe · 2024-11-12T03:02:24Z

slm/pipelines/examples/contrastive_training/eval_mteb.py

@@ -87,21 +79,29 @@ def get_args():
        passage_prefix = ""

        if args.task_name == "QuoraRetrieval":
-            assert args.document_instruction != "document: ", "QuoraRetrieval requires a document instruction"
+            assert args.document_instruction != "document: ", f"QuoraRetrieval requires a document instruction"


没有replacement, 不要使用f-string

好的，已修改

sijunhe · 2024-11-12T03:03:45Z

slm/pipelines/examples/contrastive_training/eval_mteb.py

+            lora_config = LoRAConfig.from_pretrained(args.peft_model_name_or_path)
+            lora_config.merge_weights = True
+            encode_model = LoRAModel.from_pretrained(
+                encode_model, args.peft_model_name_or_path, lora_config=lora_config, dtype="bfloat16"


这里的dtype可以hardcode吗?

已修改为 dtype=lora_config.dtype

sijunhe · 2024-11-12T03:45:49Z

slm/pipelines/examples/contrastive_training/eval_mteb.py

@@ -1,11 +1,11 @@
 # Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.


这个还是放入evaluation文件夹吧

新版本代码中已经将evaluation文件夹删除了，请问您的意思是仍然保留evaluation文件夹并且将eval_mteb.py放进去吗？

这个还是放入evaluation文件夹吧

好的，已放入

DrownFish19

LGTM

add llama and nv-embed training

21ca2e8

paddle-bot bot added the contributor label Oct 28, 2024

paddle-bot bot assigned KB-Ding Oct 28, 2024

add llama and nv-embed training and refine evaluation

d380ba3

sijunhe reviewed Nov 12, 2024

View reviewed changes

Li-Z-Q added 7 commits November 12, 2024 18:07

add llama and nv-embed training and refine evaluation

dc2ad6b

add llama and nv-embed training and refine evaluation

c4d4001

add llama and nv-embed training and refine evaluation

9d1a873

add llama and nv-embed training and refine evaluation

3968149

add llama and nv-embed training and refine evaluation

1fa993f

add llama and nv-embed training and refine evaluation

8b14719

add llama and nv-embed training and refine evaluation

b538dea

sijunhe requested a review from DrownFish19 November 18, 2024 11:02

sijunhe added the Beijing Innovation Consortium label Nov 18, 2024

DrownFish19 approved these changes Dec 16, 2024

View reviewed changes

DrownFish19 merged commit c8aa7bf into PaddlePaddle:develop Dec 16, 2024
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add llama and nv-embed training #9323

add llama and nv-embed training #9323

Li-Z-Q commented Oct 28, 2024

paddle-bot bot commented Oct 28, 2024

CLAassistant commented Oct 28, 2024 •

edited

Loading

codecov bot commented Oct 28, 2024 •

edited

Loading

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 13, 2024

Li-Z-Q Nov 14, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 13, 2024

Li-Z-Q Nov 14, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 13, 2024

Li-Z-Q Nov 14, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

sijunhe Nov 12, 2024

Li-Z-Q Nov 12, 2024

Li-Z-Q Nov 14, 2024

DrownFish19 left a comment

		@@ -0,0 +1,517 @@
		# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.

		@@ -1,216 +0,0 @@
		# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.

		@@ -1,11 +1,11 @@
		# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.

add llama and nv-embed training #9323

add llama and nv-embed training #9323

Conversation

Li-Z-Q commented Oct 28, 2024

Description

paddle-bot bot commented Oct 28, 2024

CLAassistant commented Oct 28, 2024 • edited Loading

codecov bot commented Oct 28, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DrownFish19 left a comment

Choose a reason for hiding this comment

CLAassistant commented Oct 28, 2024 •

edited

Loading

codecov bot commented Oct 28, 2024 •

edited

Loading