[Embedding] Add embedding training #9508

DesmonDay · 2024-11-27T09:18:42Z

PR types

New features

PR changes

Others

Description

Support embedding training.

…d_qwen2_embedding

paddle-bot · 2024-11-27T09:18:47Z

Thanks for your contribution!

DesmonDay · 2024-11-27T09:20:21Z

paddlenlp/trainer/trainer.py

@@ -1093,9 +1093,9 @@ def _inner_training_loop(
                if is_no_sync:
                    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
                    with model.no_sync():
-                        tr_loss_step = self.training_step(model, inputs)
+                        tr_loss_step = self.training_step(model, inputs, step_control=step_control)


这块得看看有无更好的方法

这里要兼容，判断 self.training_step 有没有 step_control 参数

codecov · 2024-11-27T09:52:20Z

Codecov Report

Attention: Patch coverage is 24.39024% with 186 lines in your changes missing coverage. Please review.

Project coverage is 52.71%. Comparing base (db38937) to head (39d324d).
Report is 17 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/datasets/embedding_dataset.py	24.40%	96 Missing ⚠️
paddlenlp/data/data_collator.py	24.09%	63 Missing ⚠️
paddlenlp/transformers/qwen2/modeling.py	22.85%	27 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9508      +/-   ##
===========================================
- Coverage    52.77%   52.71%   -0.07%     
===========================================
  Files          709      710       +1     
  Lines       111172   111417     +245     
===========================================
+ Hits         58674    58733      +59     
- Misses       52498    52684     +186

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…er' into dev_20241121_add_qwen2_embedding

…d_qwen2_embedding

CLAassistant · 2024-12-03T11:25:34Z

All committers have signed the CLA.

…/DrownFish19/PaddleNLP into add_embedding_trainer

ZHUI · 2024-12-06T08:35:03Z

llm/config/qwen/emb_argument.json

+  "max_query_len": 1024,
+  "max_passage_len": 2048,
+  "group_size": 4,
+  "bp16": true,


ZHUI · 2024-12-06T08:36:27Z

llm/run_embedding.py

+    model_config.embedding_negatives_cross_device = embedding_args.embedding_negatives_cross_device
+    logger.info(f"Final model config: {model_config}")
+
+    model_class = Qwen2SentenceEmbedding


后面改掉吧可以搞一个 Auto的class

ZHUI · 2024-12-06T08:39:30Z

llm/run_embedding.py

+    trainable_parameters = [p for p in model.parameters() if not p.stop_gradient]
+    trainer.set_optimizer_grouped_parameters(trainable_parameters)


这个 @lugimzzz 之前是为啥加来着？

ZHUI · 2024-12-06T08:40:43Z

paddlenlp/trainer/trainer.py

@@ -1093,9 +1093,9 @@ def _inner_training_loop(
                if is_no_sync:
                    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
                    with model.no_sync():
-                        tr_loss_step = self.training_step(model, inputs)
+                        tr_loss_step = self.training_step(model, inputs, step_control=step_control)


这里要兼容，判断 self.training_step 有没有 step_control 参数

…nto add_embedding_trainer

ZHUI · 2024-12-11T03:00:45Z

llm/run_embedding.py

+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)


使用方法，数据集格式，加一个readme吧

…nto add_embedding_trainer

ZHUI

先你补一下 readme，样例数据吧。热启恢复问题后续分析。

DrownFish19 and others added 4 commits November 21, 2024 12:46

add Qwen2SentenceEmbedding

c141720

update modeling

6e9efb2

Merge remote-tracking branch 'paddlenlp/develop' into dev_20241121_ad…

89d23e6

…d_qwen2_embedding

add embedding trainer

4d974e5

DesmonDay commented Nov 27, 2024

View reviewed changes

DrownFish19 added 5 commits November 28, 2024 13:22

embedding

f408f2e

fix

a3da81b

Merge remote-tracking branch 'paddlenlp-daisiming/add_embedding_train…

18405ed

…er' into dev_20241121_add_qwen2_embedding

Merge remote-tracking branch 'paddlenlp/develop' into dev_20241121_ad…

f8e877b

…d_qwen2_embedding

update

e6394ad

support cross device

ba2c286

DesmonDay force-pushed the add_embedding_trainer branch from 20f56e6 to ba2c286 Compare December 4, 2024 09:37

DesmonDay added 7 commits December 5, 2024 14:31

Merge branch 'dev_20241121_add_qwen2_embedding' of https://github.com…

a71783b

…/DrownFish19/PaddleNLP into add_embedding_trainer

update trainer

759d832

add loss

b92df93

delete unused code

d3d5a7f

delete unused code

88ba45e

optimize code

b5c08aa

update

d815fce

DesmonDay force-pushed the add_embedding_trainer branch from 7964ad3 to d815fce Compare December 6, 2024 05:58

ZHUI reviewed Dec 6, 2024

View reviewed changes

DesmonDay force-pushed the add_embedding_trainer branch 2 times, most recently from 1328048 to 63ba9d0 Compare December 10, 2024 08:09

update

0a618b0

DesmonDay force-pushed the add_embedding_trainer branch from 63ba9d0 to 0a618b0 Compare December 10, 2024 13:43

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

344d4f0

…nto add_embedding_trainer

ZHUI reviewed Dec 11, 2024

View reviewed changes

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

39d324d

…nto add_embedding_trainer

DesmonDay changed the title ~~[Embedding] Add embedding trainer~~ [Embedding] Add embedding training Dec 13, 2024

ZHUI reviewed Dec 17, 2024

View reviewed changes

ZHUI approved these changes Dec 17, 2024

View reviewed changes

ZHUI merged commit 2231feb into PaddlePaddle:develop Dec 17, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Embedding] Add embedding training #9508

[Embedding] Add embedding training #9508

DesmonDay commented Nov 27, 2024 •

edited

Loading

paddle-bot bot commented Nov 27, 2024

DesmonDay Nov 27, 2024 •

edited

Loading

ZHUI Dec 6, 2024

codecov bot commented Nov 27, 2024 •

edited

Loading

CLAassistant commented Dec 3, 2024 •

edited

Loading

ZHUI Dec 6, 2024

ZHUI Dec 6, 2024

ZHUI Dec 6, 2024

ZHUI Dec 6, 2024

ZHUI Dec 11, 2024

ZHUI left a comment

		trainable_parameters = [p for p in model.parameters() if not p.stop_gradient]
		trainer.set_optimizer_grouped_parameters(trainable_parameters)

[Embedding] Add embedding training #9508

[Embedding] Add embedding training #9508

Conversation

DesmonDay commented Nov 27, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Nov 27, 2024

DesmonDay Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

ZHUI Dec 6, 2024

Choose a reason for hiding this comment

codecov bot commented Nov 27, 2024 • edited Loading

Codecov Report

CLAassistant commented Dec 3, 2024 • edited Loading

ZHUI Dec 6, 2024

Choose a reason for hiding this comment

ZHUI Dec 6, 2024

Choose a reason for hiding this comment

ZHUI Dec 6, 2024

Choose a reason for hiding this comment

ZHUI Dec 6, 2024

Choose a reason for hiding this comment

ZHUI Dec 11, 2024

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

DesmonDay commented Nov 27, 2024 •

edited

Loading

DesmonDay Nov 27, 2024 •

edited

Loading

codecov bot commented Nov 27, 2024 •

edited

Loading

CLAassistant commented Dec 3, 2024 •

edited

Loading