Reproducing results on Llama-3.1-8B-Inst #8

chtmp223 · 2024-10-22T19:07:28Z

Thank you for your great work!

I tried to reproduce the results for meta-llama/Llama-3.1-8B-Instruct, but I noticed several discrepancies between my outcomes and those reported in the public result file under Llama-3.1-8B-Inst.

You can view my results here: Results. I obtained these results by running scripts/run_eval_slurm.sh on both short and long configs for some benchmarks, and I compiled all the results using scripts/collect_results.py.

Do you have any suggestion on specific things I could adjust to align my results with yours?

The text was updated successfully, but these errors were encountered:

howard-yen · 2024-10-22T19:21:49Z

Hi, thank you for your interest in our work!

Taking a closer look at the difference between the results in the linked spreadsheet, it seems like most numbers at 128k are within 1-2 absolute point of each other, which, unfortunately, is expected given the nondeterministic nature of flash attention and bf16.

The largest deviation appears to be in the ICL datasets, which I will double-check if the random seeds are set correctly to provide the same demo and shuffled labels across different runs.
For future work, we will also add the results across multiple runs with expected error margins. Thanks for bringing this up!

howard-yen · 2024-10-25T14:11:06Z

Quick update: I checked up on the ICL code, and the problem is that the random seed is not set correctly, which results in different demos, label mapping, and thus high variance between runs. Thanks for prompting me to look into this issue, I will update the code soon as well as the results on the spreadsheet + the paper in their next iteration!

chtmp223 · 2024-10-25T14:20:57Z

This is good to know! I will pause running the ICL benchmarks for now and wait for updates from your end. Thank you for getting back to me.

xinyangz mentioned this issue Dec 12, 2024

Data preprocessing is non deterministic due to python's builtin hash function #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results on Llama-3.1-8B-Inst #8

Reproducing results on Llama-3.1-8B-Inst #8

chtmp223 commented Oct 22, 2024

howard-yen commented Oct 22, 2024

howard-yen commented Oct 25, 2024

chtmp223 commented Oct 25, 2024

Reproducing results on Llama-3.1-8B-Inst #8

Reproducing results on Llama-3.1-8B-Inst #8

Comments

chtmp223 commented Oct 22, 2024

howard-yen commented Oct 22, 2024

howard-yen commented Oct 25, 2024

chtmp223 commented Oct 25, 2024