-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About RepLLaMA #103
Comments
Hi Xiaojie, |
Hi Xueguang, @MXueguang Thank you very much for sharing your code. However, when I tested it on a small test MSMARCO passage corpus (the first 100 passages), I encountered an issue: after encoding, the embeddings of some passages turned out to be NaN. Have you experienced this problem? The part of your code that I modified is located here: tevatron/examples/repllama/utils.py Line 41 in 2e5d00e
Please forgive my limited experience in this area. Your insights would be greatly appreciated. Here are the changes I made:
Modified to:
|
my transformers version is 4.31.0. I think later version has some issue here btw, repllama code in tevatron is a re-implementation, and due to limited resource I didn't get chance to do very detailed tests. Feel free to let me know any issues there. |
ok~ so I only need to comment out this line of code |
yes, in train.py and encode.py |
Hi Xueguang, I think I've found the issue with the NaN embedding. I've noticed that when we use fp16 during encoding, this problem occurs. However, when we switch to fp32, everything seems fine. By the way, could I ask you to provide the training data (or the co-condenser hard negative) for MSMARCO-passage/doc used in your paper 'Fine-Tuning LLaMA for Multi-Stage Text Retrieval'? |
its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look. I created a training data for repllama in tevatron format can be downloaded here |
Hi @sunxiaojie99, are you getting similar training log as #104? |
I just completed the test on the small corpus. I will run the entire process later and then confirm this. |
Thanks for sharing! Does this JSON file contain both the MSMARCO passage and document datasets? |
I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week. The data in above link is the training data for passage ranking. |
Okay, I sincerely appreciate your help! Please remind me when the document data is ready. |
Hi Xueguang, Sorry to bother you again. I have completed the training process for RepLLaMa. However, it seems that encoding the msmarco passage corpus requires at least 300 hours. I've noticed that Tevatron doesn't support multi-GPU encoding. Could you tell me how long the encoding process took for you? Also, is the document data ready? Haha. |
Hi Xiaojie, 300 hours on single gpu is reasonable.
oops.. thanks for the reminder...uploading the document data now. |
Hi Xiaojie, the processed training data for document ranking is big and hard to upload. |
Ok, thanks! Actually, I think I only need the CoCondenser-MaxP hard negatives for the document ranking data to reliably reproduce the results of the paper. By the way, is the slim version obtained by sampling a smaller proportion? |
the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space |
Okay ~ Is it convenient to tell me other parameters, such as the size of p |
Hi @sunxiaojie99 , sorry I missed your latest comment. |
ValueError: Unsupported model class DenseModel( i am getting this error during saving ckpt |
Hi~I am trying to reproduce the results of RepLLaMA. I have an a800 GPU. If I start training RepLLaMA from scratch with your code, it may take 80 hours? I want to know if this is normal? If possible, I would like to know the time cost when training RepLLaMA (lora) on the msmarco passage and doc datasets? Thank you very much. @MXueguang
The text was updated successfully, but these errors were encountered: