-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral loss instability #26498
Comments
I have tried: with and without max_grad_norm: 1.0. I've basically run out of hyperparams to try tuning - several on fresh venv's |
I am facing the same issue and loss is going up while finetuning on Dolly-15k dataset. |
I am using SFTtrainer from trl. Noted that both runs failed. Orange one cannot converge. Green one dropped to loss=0.0 but in fact the model produced garbages |
@teknium1 these both 404 😞 |
Sorry, my projects default to private, public'ed them |
How did you load your model? |
with transformers? or do you mean precision? |
I was just wondering if you used one of the HuggingFace AutoModel classes or if you loaded it using the Mistral reference implementation. |
MistralForCausalLM |
I see. I guess one idea to sanity check could be to load the model using the reference implementation and ensure it behaves similarly to the HuggingFace version. |
Do you mean outside of huggingface/hf trainer? The mistral dev did do this, we have totally different training results when he trains the same dataset, same hyperparams, without hf trainer. |
Yeah I mean just making sure both models are behaving similarly for a single forward/backwards pass on the same data without the trainer. If they are the same, then my guess is it probably narrows it down to the Trainer |
Indeed, they are not the same. They are actually completely inverse lol |
interesting. |
Trying the Pippa-ShareGPT dataset from huggingface, the loss is big. https://huggingface.co/Undi95/Mistral-pippa-sharegpt-7b-qlora Result are not the one I expected, and I can't find a way to train properly. |
I made a script that compares the last hidden state embeddings of both Sampled values from Mistral embedding: [[-1.635 0.4966 -1.647 ] see comparison script at https://github.com/bdytx5/mistral7B_finetune/blob/main/train/dev/cmp_models.py also, you will have to add
into the 'transformer' class of the reference implementation |
So is this the cause of the loss issues or just a cleaner more proper implementation? |
It's definitely possible that a difference in initial weights is causing the strange training behavior. I might try using the official weights and converting it with their script to make sure the weights on huggingface are the same as the official weights. One thing I have noticed is the config class for the model has default "rms_norm_eps": 1e-06 where the config used on huggingface hub uses 1e-05. I'm not sure if this matters but I might try converting the weights to make sure that they were originally converted using the right config. You can find the default config here https://github.com/huggingface/transformers/blob/main/src/transformers/models/mistral/configuration_mistral.py |
To follow up Tek, fter looking a little closer at this final layer embeddings Sampled values from Mistral embedding: [[-1.635 0.4966 -1.647 2.324 -0.1011 ] The huggingface outputs seem pretty high in comparison to the official ones which does seem suspicious... |
Reading through the thread and the options you have tried I first suspected that the issue might come from the new window causal mask 1- Using vanilla causal mask I have fine-tuned the 7B using QLoRA, this script and using a context length of 512 and sliding window size of 256 to make sure the sliding window mask will behave correctly: https://gist.github.com/younesbelkada/9f7f75c94bdc1981c8ca5cc937d4a4da with model_id being changed to mistral 7b, with packing and here is the behaviour of the losses Despite the model not "nicely" converging as the ideal loss curve you shared, the model manages to produce generation that are coherent with Guanaco dataset
Model weights here: https://huggingface.co/ybelkada/mistral-7b-guanaco What @bdytx5 said makes sense, there might be some differences between original model's logits and ours, indeed HF version uses 1e-5: https://huggingface.co/mistralai/Mistral-7B-v0.1/blob/main/config.json#L16 whereas mistral uses 1e-6: https://github.com/mistralai/mistral-src/blob/main/mistral/model.py#L129 @teknium1 can you try to run a training with this version of the model instead: https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/35 just pass |
I haven't looked into much detail yet, but the mask seems to unconditionally attend to cached key/values. Shouldn't the sliding window apply to cached key/values as well?
(In the case of generating a batch of single tokens at a time, there is also https://github.com/huggingface/transformers/blob/ae9a344cce52ff244f721425f660b55ebc522b88/src/transformers/models/mistral/modeling_mistral.py#L795C30-L795C30, which skips applying the window to the k/v cache.) |
Next time I try a full finetune I will. I actually did succeed at training airoboros' dataset over mistral 7b, with a qlora. Leading me to one of two conclusions: One (or more) of the datasets for hermes 2.0 is malformed, or, qlora is the only way to get the reliable training/good loss curves that I want atm. Will try with the revision next full finetune I try. |
On a side note about Mistral, @younesbelkada, When I inference 7b Mistral on a 4090, with just 2k max seq length, It uses >24gb of vram. It hits 23.3GB of vram used then starts offloading to CPU. The code I run to make this happen:
|
@teknium1 import torch#, json, os, sys
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import LlamaTokenizer, LlamaForCausalLM, MistralForCausalLM
#import bitsandbytes
tokenizer = LlamaTokenizer.from_pretrained('./collectivecognition-run6', trust_remote_code=True)
model = MistralForCausalLM.from_pretrained(
"./collectivecognition-run6",
torch_dtype=torch.bfloat16,
device_map="auto",
use_flash_attention_2=True
)
benchmarks = [
"Hello, tell me about the history of the United States",
"Roleplay as a scientist, who just discovered artificial general intelligence. What do you think about this discovery? What possibilities are there now?"]
index = 0
for obj in benchmarks:
index += 1
if index < 1:
continue
else:
start_time = time.time() # Start timing
prompt = f"USER:\n{obj}\n\nASSISTANT:\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=2048, temperature=None)#, do_sample=True, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
print(f"Response {index}: {response}")
end_time = time.time() # End timing
elapsed_time = end_time - start_time # Calculate time taken for the iteration
print(f"Time taken for Response {index}: {elapsed_time:.4f} seconds")
print(f"tokens total: {len(tokenizer.encode(response))}") Check the results of my benchmark here: #26464 (comment) |
Did you see my other comment above wrt inference and rotating cache? |
If you use the vanilla HF attention yes, that is the case we did not implemented the rotating buffer cache mechanism as it requires an important refactor However we tried to mimic the rotating buffer caching mechanism by constraining it only in the case where padding_side=left for FA-2 models by shifting the cache and slicing out the previous tokens when generating the next token. See my benchmarks here for more details: #26464 (comment) |
ok I get it, here: https://github.com/huggingface/transformers/pull/26464/files#diff-fa1653b47666859672060712644a8c40b2e61eb1b79c06a21f9b94569217ed43R372-R393 |
yes exactly
No you can scale to very large sequence length as the cache will be always having Per my understanding (cc @timlacroix please correct me if I am wrong) since we always use absolute positional embedding the model is able to keep the whole context even if we go beyond 4096 tokens. |
hmm the cache size is not the only limiting factor. You still need to forward the full sequence to the model, and the flash2 still happens with the full length even if the mechanism makes it linear to length (and not quadratic) |
but that's the case in any case right? for the first forward is you pass a large context you'll need to compute the attention scores on all tokens. |
@younesbelkada @bdytx5 @vince62s @arthurmensch Okay update on the issue. The above image is testing with deepspeed zero 2 vs FSDP. Zero 2 is the more stable trajectory run. Same hyperparams on all else. I feel like I tested with zero3 in the past, and found same as FSDP run, a U shaped pattern, but I am not sure atm. At the moment I dont know if it is being caused by axolotl's interactions with FSDP, or if it is something in transformers/accelerate/who knows what. But this seems like an important development in figuring out whats going on, not sure how much you guys can look into it, but figured I'd place the info here in case it isn't axolotl's code. however, it still looks far better than my loss curves on runs with much lower LR's than this one above (it has 2.5e-5) |
hi @nps798 |
Thanks for your reply. I'll give it a try soon. BTW, I have just encountered another issue with my previous float16 and padding left setting, qlora
Nothing was printed. So... input[0] has nans Detected inf/nan during batch_number=54681 |
@younesbelkada thank you successfully qlora fine tuning with 5 epoch without exploding loss or zero loss. will keep experiment some other combinations of parameters |
Hi everyone |
I think this was a misunderstanding, and actually it's not successfully training. However @tmabraham did show a workaround in that thread. |
Hello, I ran the below experiment to see the fine-tuning using FSDP and Mistral was as expected. Below are the results:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
accelerate launch \
--config_file configs/fsdp_config.yaml \
train.py \
--model_name "mistralai/Mistral-7B-v0.1" \
--dataset_name "smangrul/chat-instruct-mixer" \
--max_seq_len 4096 \
--max_steps 5000 \
--logging_steps 25 \
--eval_steps 1000 \
--save_steps 1000 \
--bf16 True \
--packing True \
--output_dir "/fsx/sourab/experiments/full-finetune-mistral-7b-fsdp-chat-asst" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--dataset_text_field "content" \
--use_gradient_checkpointing False \
--learning_rate 5e-6 \
--lr_scheduler_type "cosine" \
--weight_decay 0.01 \
--warmup_ratio 0.03 \
--max_grad_norm 1.0 \
--use_flash_attn True
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Is this solved due to the previous mention? |
System Info
Hello, I've been working with dhokas who finetuned Mistral's official instruct model. I have been trying to finetune mistral with several datasets over dozens of ablations. There is very insane loss instability training this model with transformers that never seems to appear with his training runs which do not use hf trainer.
I am opening this so we can get to the bottom of this. Here are some of my runs using axolotl with some datasets.
With hermes 2.0 dataset (unpublished):
https://wandb.ai/teknium1/hermes2.0-mistral-7b?workspace=user-teknium1
With Teknium/GPT4-LLM-CLEANED dataset
https://wandb.ai/teknium1/gpt4llm-mistral-7b
With a 5-sequences run to ensure loss goes to 0 (that memorization is occurring):
https://wandb.ai/teknium1/5seq-mistral-7b?workspace=user-teknium1
With OpenHermes dataset teknium1/openhermes:
https://wandb.ai/teknium1/hermes-mistral-7b
as can be seen, these loss charts with all these ablations are unreliable, and generally produce bad results no matter what hyperparams are changed.
Mistral dev who worked with me, he trained mistral with gpt4llm cleaned and got this result:
@younesbelkada @muellerz
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Train Mistral on any of the above datasets with Mistral's own finetune hyperparams as reported in mistral's discord and see the loss fail to work out
Expected behavior
A smooth or downward trajectory for the loss.
The text was updated successfully, but these errors were encountered: