Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Embedding] Add embedding training #9508

Merged
merged 20 commits into from
Dec 17, 2024

Conversation

DesmonDay
Copy link
Contributor

@DesmonDay DesmonDay commented Nov 27, 2024

PR types

New features

PR changes

Others

Description

Support embedding training.

Copy link

paddle-bot bot commented Nov 27, 2024

Thanks for your contribution!

@@ -1093,9 +1093,9 @@ def _inner_training_loop(
if is_no_sync:
# Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
with model.no_sync():
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs, step_control=step_control)
Copy link
Contributor Author

@DesmonDay DesmonDay Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块得看看有无更好的方法

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里要兼容,判断 self.training_step 有没有 step_control 参数

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 24.39024% with 186 lines in your changes missing coverage. Please review.

Project coverage is 52.71%. Comparing base (db38937) to head (39d324d).
Report is 17 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/datasets/embedding_dataset.py 24.40% 96 Missing ⚠️
paddlenlp/data/data_collator.py 24.09% 63 Missing ⚠️
paddlenlp/transformers/qwen2/modeling.py 22.85% 27 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9508      +/-   ##
===========================================
- Coverage    52.77%   52.71%   -0.07%     
===========================================
  Files          709      710       +1     
  Lines       111172   111417     +245     
===========================================
+ Hits         58674    58733      +59     
- Misses       52498    52684     +186     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CLAassistant
Copy link

CLAassistant commented Dec 3, 2024

CLA assistant check
All committers have signed the CLA.

@DesmonDay DesmonDay force-pushed the add_embedding_trainer branch from 20f56e6 to ba2c286 Compare December 4, 2024 09:37
@DesmonDay DesmonDay force-pushed the add_embedding_trainer branch from 7964ad3 to d815fce Compare December 6, 2024 05:58
"max_query_len": 1024,
"max_passage_len": 2048,
"group_size": 4,
"bp16": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bf16

model_config.embedding_negatives_cross_device = embedding_args.embedding_negatives_cross_device
logger.info(f"Final model config: {model_config}")

model_class = Qwen2SentenceEmbedding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面改掉吧 可以搞一个 Auto的class

Comment on lines +264 to +265
trainable_parameters = [p for p in model.parameters() if not p.stop_gradient]
trainer.set_optimizer_grouped_parameters(trainable_parameters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 @lugimzzz 之前是为啥加来着?

@@ -1093,9 +1093,9 @@ def _inner_training_loop(
if is_no_sync:
# Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
with model.no_sync():
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs, step_control=step_control)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里要兼容,判断 self.training_step 有没有 step_control 参数

@DesmonDay DesmonDay force-pushed the add_embedding_trainer branch 2 times, most recently from 1328048 to 63ba9d0 Compare December 10, 2024 08:09
@DesmonDay DesmonDay force-pushed the add_embedding_trainer branch from 63ba9d0 to 0a618b0 Compare December 10, 2024 13:43
# Detecting last checkpoint.
last_checkpoint = None
if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
last_checkpoint = get_last_checkpoint(training_args.output_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用方法,数据集格式,加一个readme吧

@DesmonDay DesmonDay changed the title [Embedding] Add embedding trainer [Embedding] Add embedding training Dec 13, 2024
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先你补一下 readme,样例数据吧。 热启恢复问题后续分析。

@ZHUI ZHUI merged commit 2231feb into PaddlePaddle:develop Dec 17, 2024
9 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants