-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing results on Llama-3.1-8B-Inst #8
Comments
Hi, thank you for your interest in our work! Taking a closer look at the difference between the results in the linked spreadsheet, it seems like most numbers at 128k are within 1-2 absolute point of each other, which, unfortunately, is expected given the nondeterministic nature of flash attention and bf16. The largest deviation appears to be in the ICL datasets, which I will double-check if the random seeds are set correctly to provide the same demo and shuffled labels across different runs. |
Quick update: I checked up on the ICL code, and the problem is that the random seed is not set correctly, which results in different demos, label mapping, and thus high variance between runs. Thanks for prompting me to look into this issue, I will update the code soon as well as the results on the spreadsheet + the paper in their next iteration! |
This is good to know! I will pause running the ICL benchmarks for now and wait for updates from your end. Thank you for getting back to me. |
Thank you for your great work!
I tried to reproduce the results for meta-llama/Llama-3.1-8B-Instruct, but I noticed several discrepancies between my outcomes and those reported in the public result file under Llama-3.1-8B-Inst.
You can view my results here: Results. I obtained these results by running
scripts/run_eval_slurm.sh
on both short and long configs for some benchmarks, and I compiled all the results usingscripts/collect_results.py
.Do you have any suggestion on specific things I could adjust to align my results with yours?
The text was updated successfully, but these errors were encountered: