Feature/fsdp lora #435

danbider · 2023-07-07T00:53:14Z

add a fix to wrap trainable lora modules with fsdp.
verified successful training on 8 gpus.

related, but not affecting this PR: @bcui19 and I discussed a small enhancement of composer to accompany this PR which will spare big untrained modules from being fetched by fsdp

codestar12

this looks good I approve to merge if tests pass

danbider · 2023-07-12T15:25:42Z

fsdp works well
model configs are handled in the same way as every other model
training script and hf_causal_lm scripts are now cleaner.
tutorial updated as well

palash04 · 2023-07-23T17:01:55Z

Hey @danbider , With this PR can we do fsdp lora on llama models? Basically any other model but MPT?

danbider · 2023-07-24T18:33:09Z

should support any model including MPT.

alextrott16 · 2023-08-04T18:22:49Z

I'm a fan of the re-design -- in particular moving things into the hf_causal_lm builder and out of the train script. It makes everything more readable and more general. Also, I really appreciate seeing the addition to the TUTORIAL faqs.

One area that gives me pause is integration with HF for models that have been LoRA-fied. Are there any gotchas when it comes to converting to HF (and uploading to HF) from a Composer checkpoint? Similarly, are there any gotchas when it comes to working with a HF model that is already LoRA-fied?

My sense is that, with the latter, everything should be OK as long as the right things are installed, but I'd like to get a sanity check on that.

The former seems like it will still be missing support. @dakinggg can probably add some insight here, because I'm worried that the code that actually modifies the model (in hf_causal_lm) won't be reflected in the model construction code that gets uploaded to the HF repo along with the model weights. Has anyone tested that workflow?

llmfoundry/models/hf/hf_causal_lm.py

dakinggg

Could you please also add some basic tests for the lora addition?

TUTORIAL.md

dakinggg · 2023-08-16T21:39:13Z

TUTORIAL.md

+<!--pytest.mark.skip-->
+```yaml
+fsdp_config:
+  use_orig_params: true


Can we confirm if this is necessary?

will verify this tomorrow AM, good point

dakinggg · 2023-08-16T21:39:54Z

TUTORIAL.md

 ```
+or default to DDP, as follows:


I think to default DDP just leaving out the FSDP section entirely is a bit cleaner?

llmfoundry/models/hf/hf_fsdp.py

dakinggg · 2023-08-16T21:46:17Z

scripts/train/train.py

                                                   'lora',
                                                   must_exist=False,
                                                   default_value=None)
+    if lora_config is not None:
+        if lora_config.get('rank', None) is not None:


What is supposed to happen if lora config is provided but rank is none? Should that be an error?

edit from daniel Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

danbider · 2023-08-17T01:32:44Z

Could you please also add some basic tests for the lora addition?

will add those now.

samhavens · 2023-08-18T01:11:00Z

Any chance we could include scripts/train/yamls/finetune/lora-mpt-7b.yaml

max_seq_len: 2048
global_seed: 17
dist_timeout: 5400

# Run Name
run_name: # If left blank, will be read from env var $RUN_NAME

model:
  name: hf_causal_lm
  pretrained: true
  pretrained_model_name_or_path: mosaicml/mpt-7b
  config_overrides:
    attn_config:
      attn_impl: triton
      attn_uses_sequence_id: false
  lora:
    # UPDATE these as needed
    args:
      r: 1
      lora_alpha: 32
      # target_modules: ["Wqkv", "out_proj", "up_proj", "down_proj"]
      target_modules: ["up_proj", "down_proj"]
      lora_dropout: 0.05
      bias: none
      task_type: "CAUSAL_LM"
# Tokenizer
tokenizer:
  name: mosaicml/mpt-7b
  kwargs:
    model_max_length: ${max_seq_len}

# Dataloaders
train_loader:
  name: finetuning
  dataset:
    hf_name: danbider/codegen
    split: train
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    packing_ratio: 19.6 # concat examples
    shuffle: true
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0

eval_loader:
  name: finetuning
  dataset:
    hf_name: danbider/codegen
    split: test
    max_seq_len: ${max_seq_len}
    allow_pad_trimming: false
    decoder_only_format: true
    packing_ratio: 19.6
    shuffle: true
  drop_last: true
  num_workers: 8
  pin_memory: false
  prefetch_factor: 2
  persistent_workers: true
  timeout: 0

# Optimization
# Based on MPT pretraining
scheduler:
  name: cosine_with_warmup
  t_warmup: 50ba
  alpha_f: 0.0

optimizer:
  name: decoupled_lionw
  lr: 1.0e-4  # lora needs higher LR
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-8
  weight_decay: 1.0e-4

algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0

max_duration: 2ep
eval_interval: 1ep
eval_first: true
global_train_batch_size: 48

# System
seed: ${global_seed}
device_eval_batch_size: 4
device_train_microbatch_size: 1
precision: amp_bf16

# FSDP
# fsdp_config:
#   sharding_strategy: FULL_SHARD
#   mixed_precision: PURE
#   activation_checkpointing: true
#   activation_checkpointing_reentrant: false
#   activation_cpu_offload: false
#   limit_all_gathers: true
#   verbose: false

# leave out fsdp_config for DDP or single GPU

# Logging
progress_bar: false
log_to_console: true
console_log_interval: 1ba

callbacks:
  speed_monitor:
    window_size: 10
  lr_monitor: {}
  memory_monitor: {}
  runtime_estimator: {}

# loggers:
#   wandb: {}
# Checkpoint to local filesystem or remote object store
save_interval: 5000ba
save_num_checkpoints_to_keep: 1  # Important, this cleans up checkpoints saved to DISK
save_folder: ./{run_name}/checkpoints
# save_folder: s3://my-bucket/my-folder/{run_name}/checkpoint

Any example we could point users to would be great, even if we plan on refining it later

germanjke · 2023-08-25T14:35:16Z

Hi, i tested your branch and got some bugs:

Bug about build Lora (we can easy fix this)
Bug about wrap FSDP Lora (need to fix)

I'm using your suggestion from TUTORIAL.md:

I'm using LLaMA-2 LorA

model:
  name: hf_causal_lm
  pretrained: true
  ...
  lora:
    args:
      r: 16
      lora_alpha: 32
      target_modules: ["Wqkv", "out_proj", "up_proj", "down_proj"] # or any subset of these for MPT-7B
      lora_dropout: 0.05
      bias: none
      task_type: "CAUSAL_LM"

fsdp_config:
  use_orig_params: true
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true

So, let's start with bug №1:

You define model_config here, everything is ok, we have lora section there.

Here you build the model, but lora section alraedy not here. So, lora_cfg is None here.

It happens, because you poping this lora section here and not using anywhere, only using for print here

main branch have this function, and they building model with 2 different configs: model_config and lora_cfg, you building this with single model_config, but LoRA popped from there, so we dont build LoRA here.

You need not to pop, or use function from main branch.
After this LoRA builds.

About bug №2:

We build LoRA here

Later, you making reinit of your LoRA model, so we have error here cause of FSDP here

For LLaMA 2 we can get this, but we can't get this for LLaMA 2 LoRA, cause of this we have None here and we raising here

I think we need to rename this for Lora Llama maybe or refactor this in other way

Thanks! I hope soon we will can train LoRA models with FSDP 👍

danbider · 2023-08-30T01:56:23Z

Hi, i tested your branch and got some bugs:

Bug about build Lora (we can easy fix this)

Bug about wrap FSDP Lora (need to fix)

I'm using your suggestion from TUTORIAL.md:

I'm using LLaMA-2 LorA
model:
  name: hf_causal_lm
  pretrained: true
  ...
  lora:
    args:
      r: 16
      lora_alpha: 32
      target_modules: ["Wqkv", "out_proj", "up_proj", "down_proj"] # or any subset of these for MPT-7B
      lora_dropout: 0.05
      bias: none
      task_type: "CAUSAL_LM"
fsdp_config:
  use_orig_params: true
  sharding_strategy: FULL_SHARD
  mixed_precision: PURE
  activation_checkpointing: true
  activation_checkpointing_reentrant: false
  activation_cpu_offload: false
  limit_all_gathers: true
So, let's start with bug №1:

You define model_config here, everything is ok, we have lora section there.

Here you build the model, but lora section alraedy not here. So, lora_cfg is None here.

It happens, because you poping this lora section here and not using anywhere, only using for print here

main branch have this function, and they building model with 2 different configs: model_config and lora_cfg, you building this with single model_config, but LoRA popped from there, so we dont build LoRA here.

You need not to pop, or use function from main branch. After this LoRA builds.

About bug №2:

We build LoRA here

Later, you making reinit of your LoRA model, so we have error here cause of FSDP here

For LLaMA 2 we can get this, but we can't get this for LLaMA 2 LoRA, cause of this we have None here and we raising here

I think we need to rename this for Lora Llama maybe or refactor this in other way

Thanks! I hope soon we will can train LoRA models with FSDP 👍

thanks for this. fixed the first one, we think. will take care of the second as well.

…oundry into feature/fsdp-lora

dakinggg · 2023-10-17T17:16:53Z

For Jose:

Two main issues with this PR currently that I know of:
(1) I believe there was a bad merge with main. You may want to go back a commit and redo the merge
(2) run hangs at the start of training when using init_device: mixed + FSDP + LoRA. I don't know the root cause, but I would start with printing out the model on each rank and making sure it is wrapped the same (and wrapped correctly), and all ranks end up with non meta weights before training starts.

dakinggg · 2024-01-21T02:09:29Z

Closing in favor of #886

danbider and others added 5 commits July 6, 2023 19:13

attempt to wrfsdp wrap lora modules

93cdae8

Merge branch 'mosaicml:main' into feature/fsdp-lora

0ec0de1

fsdp works by iterating over modulers

20ab8b6

merged remote

57659c5

cleaned up fsdp loop for peft

d6cf053

danbider requested review from samhavens, codestar12 and bcui19 July 7, 2023 00:53

danbider and others added 5 commits July 7, 2023 01:06

robust peft import

a44b641

fsdp known issue deleted

d957d55

more info in tutorial about fsdp

e5e012d

conditioning on peft installation for cpu tests

f7b5e70

Merge branch 'main' into feature/fsdp-lora

1cf348c

codestar12 approved these changes Jul 7, 2023

View reviewed changes

danbider mentioned this pull request Jul 9, 2023

Support evaluation using peft(lora) weight #433

Closed

Merge branch 'mosaicml:main' into feature/fsdp-lora

6a1c172

danbider requested a review from codestar12 July 10, 2023 15:13

danbider and others added 3 commits July 11, 2023 15:10

moved lora model building to ComposerHFCausalLM

a3f370c

formatting

082f71e

updated tutorial to move lora config under model config

058951d

Merge branch 'mosaicml:main' into feature/fsdp-lora

f57c84f

Merge branch 'mosaicml:main' into feature/fsdp-lora

cc7a8f9

dakinggg mentioned this pull request Jul 25, 2023

Feature/peft compatible models #346

Merged

Merge branch 'main' into feature/fsdp-lora

433ae51

palash04 mentioned this pull request Aug 2, 2023

Is Llama 2 compatible for finetuning using lora? #482

Closed

alextrott16 reviewed Aug 4, 2023

View reviewed changes

llmfoundry/models/hf/hf_causal_lm.py Outdated Show resolved Hide resolved

danbider and others added 5 commits August 15, 2023 11:53

more pyright fixes

5db8c74

more typechecking in training script

3a3342f

Merge branch 'main' into feature/fsdp-lora

622e51d

pyright following main merge

9ec0f69

model_config instead of cfg.model

a4439c5

dakinggg reviewed Aug 16, 2023

View reviewed changes

danbider and others added 2 commits August 16, 2023 21:30

Update TUTORIAL.md

0a9e542

edit from daniel Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

Update llmfoundry/models/hf/hf_fsdp.py

9bc0b50

Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>

DDP tutorial edit

f2fd418

edit fsdp stuff

050267f

fixed popping

c0f5148

danbider added 2 commits August 30, 2023 15:37

eliminated bnb dep

1c47c23

Merge branch 'feature/fsdp-lora' of https://github.com/danbider/llm-f…

5b905b0

…oundry into feature/fsdp-lora

dakinggg mentioned this pull request Sep 12, 2023

Fix ComposerHFCausalLM instantiation with PeftModel #593

Merged

josejg force-pushed the feature/fsdp-lora branch from 1351637 to 5b905b0 Compare October 21, 2023 01:04

josejg added 8 commits October 20, 2023 18:33

Merge branch 'main' into feature/fsdp-lora

27d186d

Update accelerate for peft

2f59377

Simplify LoRA validation logic

5bc5240

Proper import checking

79cf8d6

Fix indent

7f72c25

Prevent FDSP wrapping empty embedding LoRA attributes

02d949c

Merge branch 'main' into feature/fsdp-lora

b955696

Fix bad indent

4a430bd

dakinggg closed this Jan 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/fsdp lora #435

Feature/fsdp lora #435

danbider commented Jul 7, 2023

codestar12 left a comment

danbider commented Jul 12, 2023

palash04 commented Jul 23, 2023

danbider commented Jul 24, 2023

alextrott16 commented Aug 4, 2023

dakinggg left a comment

dakinggg Aug 16, 2023

danbider Aug 17, 2023

dakinggg Aug 16, 2023

dakinggg Aug 16, 2023

danbider commented Aug 17, 2023

samhavens commented Aug 18, 2023 •

edited

Loading

germanjke commented Aug 25, 2023

danbider commented Aug 30, 2023

dakinggg commented Oct 17, 2023

dakinggg commented Jan 21, 2024

Feature/fsdp lora #435

Feature/fsdp lora #435

Conversation

danbider commented Jul 7, 2023

codestar12 left a comment

Choose a reason for hiding this comment

danbider commented Jul 12, 2023

palash04 commented Jul 23, 2023

danbider commented Jul 24, 2023

alextrott16 commented Aug 4, 2023

dakinggg left a comment

Choose a reason for hiding this comment

dakinggg Aug 16, 2023

Choose a reason for hiding this comment

danbider Aug 17, 2023

Choose a reason for hiding this comment

dakinggg Aug 16, 2023

Choose a reason for hiding this comment

dakinggg Aug 16, 2023

Choose a reason for hiding this comment

danbider commented Aug 17, 2023

samhavens commented Aug 18, 2023 • edited Loading

germanjke commented Aug 25, 2023

danbider commented Aug 30, 2023

dakinggg commented Oct 17, 2023

dakinggg commented Jan 21, 2024

samhavens commented Aug 18, 2023 •

edited

Loading