多卡运行和参数问题 #12

MonrenZheng · 2024-09-03T08:46:02Z

为什么项目使用多 GPU 运行导致推理结果乱码，得到的评估结果很差呢？请问是什么原因导致的呢
另外一个问题是，论文说的是实验运用llama2的默认参数，比如温度等。但是实际推理时好像用的是llama-factory的参数，是0.95。而模型的默认温度是0.6。

rickyang1114 · 2024-09-03T09:21:11Z

您好，感谢您对我们项目的兴趣！

本项目绝大多数实验仅使用了单卡，多卡推理的问题可以参考LLaMA-Factory原仓库。参数问题以控制台实际输出为准。

MonrenZheng · 2024-09-03T09:24:01Z

您好，感谢您对我们项目的兴趣！

本项目绝大多数实验仅使用了单卡，多卡推理的问题可以参考LLaMA-Factory原仓库。参数问题以控制台实际输出为准。

哦噢谢谢。那请问参数问题呢？

rickyang1114 · 2024-09-03T09:25:59Z

应该就是llama-factory的默认参数，我没有调这些

MonrenZheng · 2024-09-03T09:27:57Z

收到
[WARNING|logging.py:328] 2024-09-03 16:13:36,533 >> We detected that you are passing past_key_values as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate Cache class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)

最后就是还想问问这个warning会有影响吗？

rickyang1114 · 2024-09-03T09:28:48Z

我觉得没有影响。我在实验的时候没管warnings

MonrenZheng · 2024-09-03T09:35:10Z

okok感谢了
我跑的gsm8k_test结果如下:

好像还是跟结果有点差距了两行上面一行是论文的结果下面一行是我跑的
还请指教一下

rickyang1114 · 2024-09-03T09:42:07Z

可能是环境原因吧...

MonrenZheng · 2024-09-03T09:48:10Z

哦噢好的感谢

SivilTaram · 2024-09-03T11:48:54Z

okok感谢了我跑的gsm8k_test结果如下: 好像还是跟结果有点差距了两行上面一行是论文的结果下面一行是我跑的还请指教一下

@zmr66z6xx6 能否详细说一下这个具体的setup呀？是用的本repo提供的command跑的evaluation吗，还是改动了什么参数？是否检查过模型self-disilltation 生成的数据呢？

SivilTaram · 2024-09-03T11:55:56Z

@zmr66z6xx6 另外，这个出问题是在单卡环境下，还是多卡环境下呢？

MonrenZheng · 2024-09-03T12:39:42Z

okok感谢了我跑的gsm8k_test结果如下: 好像还是跟结果有点差距了两行上面一行是论文的结果下面一行是我跑的还请指教一下

@zmr66z6xx6 能否详细说一下这个具体的setup呀？是用的本repo提供的command跑的evaluation吗，还是改动了什么参数？是否检查过模型self-disilltation 生成的数据呢？

main分支跑的哟参数没改就用原来的

MonrenZheng · 2024-09-03T12:40:20Z

@zmr66z6xx6 另外，这个出问题是在单卡环境下，还是多卡环境下呢？

多卡乱码结果很差。上面发的是单卡跑出来的

SivilTaram · 2024-09-04T02:19:04Z

@zmr66z6xx6 可以试试reproduce branch的code，不确定是不是因为Llama-Factory最新的codebase引起的问题

MonrenZheng · 2024-09-04T06:12:36Z

@SivilTaram 收到谢谢

MonrenZheng · 2024-09-04T14:17:32Z

@SivilTaram 请问一下这个warning有影响吗

SivilTaram · 2024-09-04T23:29:10Z

@zmr66z6xx6 没有影响的，这个是说这个API马上会弃用

MonrenZheng · 2024-09-05T00:27:41Z

@zmr66z6xx6 没有影响的，这个是说这个API马上会弃用

@SivilTaram 好的谢谢目前用分支部分跑出了seed结果 openfunctions的结果差的有点多了只有10.71%

SivilTaram · 2024-09-05T00:41:53Z

@zmr66z6xx6 是指seed model自己inference的结果在openfunctions只有10.71%，是吗？

MonrenZheng · 2024-09-05T00:54:22Z

@SivilTaram 对的之前main分支也是openfunctions上的test不太理想

SivilTaram · 2024-09-05T01:17:16Z

@zmr66z6xx6 因为seed model本身和方法没有任何关系，就是llama-2-chat，请问你是用什么精度做的inference，什么显卡呢？以及是只有openfunctions上的结果不理想吗还是？

MonrenZheng · 2024-09-05T01:35:04Z

@SivilTaram 我参数啥的都没改全是项目里头指定的卡是RTX 3090
对的目前只是openfunctions test差得多一些。

SivilTaram · 2024-09-05T02:12:40Z

@zmr66z6xx6 好的，谢谢反馈！可以先在reproduce下试试其他的dataset，比如GSM8K是否能复现sdft v.s. sft 的结果吗？感觉听起来像是硬件支持精度的问题😂 但我还不太确定

MonrenZheng · 2024-09-05T02:20:36Z

@SivilTaram 好的收到，感谢

MonrenZheng · 2024-09-12T01:00:24Z

@SivilTaram gsm8k数据集训练的结果：感觉还是openfunction上的出入有点大

上述SDFT的结果和论文对上了，但是前两项差了。对了之前说的精度问题，3090好像支持bf16精度的。顺便问问论文是什么卡跑出来的

rickyang1114 · 2024-09-12T01:24:09Z

部分实验用3090，部分用A800

MonrenZheng · 2024-09-12T06:49:28Z

@rickyang1114 还请问一下为什么分支在gsm8k跑出来的结果和论文对不上，没有出现论文表现的遗忘

rickyang1114 · 2024-09-12T07:10:08Z

可能是因为有一些环境方面的微小差异导致随机性未能完全被抹去= =

MonrenZheng · 2024-09-12T07:13:30Z

但是这里出现的openfunction效果增长了这么多着实有点奇怪wwww，对了还要问一下论文跑predict的时候用了do_sample吗？我在部分任务上跑了几次发现正确率是一模一样的

rickyang1114 · 2024-09-12T07:20:29Z

humaneval 评估太慢了，用了do_sample False来加快，其他地方都是LLaMA-Factory predict 的默认配置，应该是有sample；在同一个环境下多次执行结果不变是正常的，因为LLaMA-Factory固定了随机种子。

结果未能完全复现可能是我当时做实验的环境和复现的环境不是完全一样，可能由requirements.txt中某些未指定版本的package带来，也可能由操作系统带来。。。具体是什么原因我也不清楚。。。

MonrenZheng · 2024-09-12T07:22:10Z

好的好的谢谢

SivilTaram · 2024-09-12T07:34:34Z

@zmr66z6xx6 openfunctions 的性能我觉得可能是因为do sample的原因，可以试试打开do sample 试多次看看？因为humaneval本身的example数量太少了，很容易导致variance比较大；

另一个问题就是，seed model 如果用greedy（do sample=False）理应复现论文中的结果，现在看seed model的性能都不能match，很奇怪...

MonrenZheng · 2024-09-12T07:48:49Z

@rickyang1114 哦噢我看项目中seed脚本没有对do_sample指定我稍后指定其为False然后跑一下试试（上述得到的结果我没有改动任何地方）

MonrenZheng · 2024-09-14T01:26:33Z

@rickyang1114 还要麻烦请问一下HumanEval 测试要用到api吗？报错提示找不到dataset 该怎么办

rickyang1114 · 2024-09-14T02:18:24Z

检查一下bigcode-evaluation-harness是否为空目录？我没有遇到过这个问题

MonrenZheng · 2024-09-14T02:22:24Z

@rickyang1114 是不是我的服务器没办法连外网导致的呢，数据是从hub上在线抓取的吗？

rickyang1114 · 2024-09-14T02:26:52Z

很有可能。可以试试export HF_ENDPOINT=https://hf-mirror.com 或者使用代理

MonrenZheng · 2024-09-14T02:27:37Z

OK感谢

rickyang1114 · 2024-09-19T02:05:32Z

先前对于openfunction数据集的评估只匹配了模型输出的keyword argument，而未考虑position argument，存在将正确答案误判的情况。例如一个样例的标签为：plant.get_scientific_name(common_name="rose")，而模型的输出为plant.get_scientific_name("rose")。对此，我在reproduce分支更新了对该数据集的评估函数，为其赋予0.5的权重，从而更好地对模型的输出进行评估。

此外，由于先前的实验环境已经丢失，我按照reproduce分支的requirements.txt重新构建了环境并且进行了实验，以下将实验结果粘贴：

test_seed_LM.sh

Evaluation on seed LM.

Evaluation on gsm8k:
Accuracy for math: 380 / 1319 = 28.81%

Evaluation on multiarith:
Accuracy for math: 130 / 180 = 72.22%

Evaluation on OpenFunctions:
Accuracy for openfunction: 23.5 / 112 = 20.98%

Evaluation on HumanEval:
Accuracy for HumanEval: 14.63%

Evaluation on raw safety:
file: predictions/seed/advbench-raw/generated_predictions.jsonl, safe_rate: 99.42%

Evaluation on jailbreak safety:
file: predictions/seed/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 94.81%

Evaluation on MMLU:
        Average: 46.42
           STEM: 35.80
Social Sciences: 53.05
     Humanities: 43.35
          Other: 54.46

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.26%
Accuracy for ai2_arc: 64.06%
Accuracy for hellaswag: 57.80%
Accuracy for winogrande: 66.38%

gsm8k/sft.sh

Fine-tuning using sft

Evaluation on gsm8k:
Accuracy for math: 386 / 1319 = 29.26%

Evaluation on multiarith:
Accuracy for math: 140 / 180 = 77.78%

Evaluation on OpenFunctions:
Accuracy for openfunction: 22.5 / 112 = 20.09%

Evaluation on HumanEval:
Accuracy for HumanEval: 14.63%

Evaluation on raw safety:
file: predictions/gsm8k/sft/advbench-raw/generated_predictions.jsonl, safe_rate: 85.38%

Evaluation on jailbreak safety:
file: predictions/gsm8k/sft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 53.08%

Evaluation on MMLU:
        Average: 42.98
           STEM: 33.80
Social Sciences: 47.60
     Humanities: 40.81
          Other: 50.28

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 31.28%
Accuracy for ai2_arc: 64.01%
Accuracy for hellaswag: 56.76%
Accuracy for winogrande: 68.11%

gsm8k/sdft.sh

Fine-tuning using sdft

Evaluation on gsm8k:
Accuracy for math: 452 / 1319 = 34.27%

Evaluation on multiarith:
Accuracy for math: 155 / 180 = 86.11%

Evaluation on OpenFunctions:
Accuracy for openfunction: 25.0 / 112 = 22.32%

Evaluation on HumanEval:
Accuracy for HumanEval: 16.46%

Evaluation on raw safety:
file: predictions/gsm8k/sdft/advbench-raw/generated_predictions.jsonl, safe_rate: 94.81%

Evaluation on jailbreak safety:
file: predictions/gsm8k/sdft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 79.62%

Evaluation on MMLU:
        Average: 45.83
           STEM: 35.43
Social Sciences: 53.02
     Humanities: 42.71
          Other: 53.19

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 32.72%
Accuracy for ai2_arc: 62.37%
Accuracy for hellaswag: 56.55%
Accuracy for winogrande: 67.40%

openfunction/sft.sh

Fine-tuning using sft

Evaluation on gsm8k:
Accuracy for math: 289 / 1319 = 21.91%

Evaluation on multiarith:
Accuracy for math: 114 / 180 = 63.33%

Evaluation on OpenFunctions:
Accuracy for openfunction: 39 / 112 = 34.82%

Evaluation on HumanEval:
Accuracy for HumanEval: 6.71%

Evaluation on raw safety:
file: predictions/openfunction/sft/advbench-raw/generated_predictions.jsonl, safe_rate: 99.23%

Evaluation on jailbreak safety:
file: predictions/openfunction/sft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 94.62%

Evaluation on MMLU:
        Average: 46.64
           STEM: 36.07
Social Sciences: 53.80
     Humanities: 43.35
          Other: 54.46

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.69%
Accuracy for ai2_arc: 63.59%
Accuracy for hellaswag: 57.51%
Accuracy for winogrande: 66.46%

openfunction/sdft.sh

Fine-tuning using sdft

Evaluation on gsm8k:
Accuracy for math: 360 / 1319 = 27.29%

Evaluation on multiarith:
Accuracy for math: 126 / 180 = 70.00%

Evaluation on OpenFunctions:
Accuracy for openfunction: 41 / 112 = 36.61%

Evaluation on HumanEval:
Accuracy for HumanEval: 15.24%

Evaluation on raw safety:
file: predictions/openfunction/sdft/advbench-raw/generated_predictions.jsonl, safe_rate: 99.62%

Evaluation on jailbreak safety:
file: predictions/openfunction/sdft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 97.31%

Evaluation on MMLU:
        Average: 46.49
           STEM: 35.93
Social Sciences: 52.85
     Humanities: 43.73
          Other: 54.28

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.06%
Accuracy for ai2_arc: 63.53%
Accuracy for hellaswag: 57.16%
Accuracy for winogrande: 66.46%

magicoder/sft.sh

Fine-tuning using sft

Evaluation on gsm8k:
Accuracy for math: 314 / 1319 = 23.81%

Evaluation on multiarith:
Accuracy for math: 120 / 180 = 66.67%

Evaluation on OpenFunctions:
Accuracy for openfunction: 5.5 / 112 = 4.91%

Evaluation on HumanEval:
Accuracy for HumanEval: 18.90%

Evaluation on raw safety:
file: predictions/magicoder/sft/advbench-raw/generated_predictions.jsonl, safe_rate: 90.00%

Evaluation on jailbreak safety:
file: predictions/magicoder/sft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 70.00%

Evaluation on MMLU:
        Average: 46.56
           STEM: 35.90
Social Sciences: 53.34
     Humanities: 43.61
          Other: 54.34

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.73%
Accuracy for ai2_arc: 64.35%
Accuracy for hellaswag: 57.34%
Accuracy for winogrande: 67.17%

magicoder/sdft.sh

Fine-tuning using sdft

Evaluation on gsm8k:
Accuracy for math: 330 / 1319 = 25.02%

Evaluation on multiarith:
Accuracy for math: 114 / 180 = 63.33%

Evaluation on OpenFunctions:
Accuracy for openfunction: 7.5 / 112 = 6.70%

Evaluation on HumanEval:
Accuracy for HumanEval: 20.12%

Evaluation on raw safety:
file: predictions/magicoder/sdft/advbench-raw/generated_predictions.jsonl, safe_rate: 98.27%

Evaluation on jailbreak safety:
file: predictions/magicoder/sdft/advbench-jailbreak/generated_predictions.jsonl, safe_rate: 90.38%

Evaluation on MMLU:
        Average: 46.54
           STEM: 36.10
Social Sciences: 53.12
     Humanities: 43.29
          Other: 54.71

Evaluation on OpenLLM Leaderboard:
Accuracy for truthfulqa: 35.79%
Accuracy for ai2_arc: 64.23%
Accuracy for hellaswag: 57.31%
Accuracy for winogrande: 67.17%

可以看出，结果与论文中的数值存在一些波动，但是仍然能体现sdft相对于sft的优势。

此外，由于本项目使用去年12月左右的LLaMA-Factory构建，彼时其并不支持多卡推理，因而使用多卡可能出现未预期的错误，请和script示范中一样使用单卡。

MonrenZheng · 2024-09-19T06:37:29Z

好的感谢

SivilTaram · 2024-09-19T14:51:57Z

@zmr66z6xx6 请再试试是否可以复现上述结果哈，欢迎更多feedback！

rickyang1114 closed this as completed Sep 3, 2024

rickyang1114 reopened this Sep 19, 2024

多卡运行 和 参数问题 #12

多卡运行 和 参数问题 #12

Comments

MonrenZheng commented Sep 3, 2024

rickyang1114 commented Sep 3, 2024 • edited Loading

MonrenZheng commented Sep 3, 2024

rickyang1114 commented Sep 3, 2024

MonrenZheng commented Sep 3, 2024

rickyang1114 commented Sep 3, 2024

MonrenZheng commented Sep 3, 2024

rickyang1114 commented Sep 3, 2024

MonrenZheng commented Sep 3, 2024

SivilTaram commented Sep 3, 2024 • edited Loading

SivilTaram commented Sep 3, 2024

MonrenZheng commented Sep 3, 2024

MonrenZheng commented Sep 3, 2024

SivilTaram commented Sep 4, 2024

MonrenZheng commented Sep 4, 2024

MonrenZheng commented Sep 4, 2024

SivilTaram commented Sep 4, 2024

MonrenZheng commented Sep 5, 2024

SivilTaram commented Sep 5, 2024

MonrenZheng commented Sep 5, 2024

SivilTaram commented Sep 5, 2024 • edited Loading

MonrenZheng commented Sep 5, 2024

SivilTaram commented Sep 5, 2024

MonrenZheng commented Sep 5, 2024

MonrenZheng commented Sep 12, 2024

rickyang1114 commented Sep 12, 2024

MonrenZheng commented Sep 12, 2024

rickyang1114 commented Sep 12, 2024

MonrenZheng commented Sep 12, 2024

rickyang1114 commented Sep 12, 2024

MonrenZheng commented Sep 12, 2024

SivilTaram commented Sep 12, 2024 • edited Loading

MonrenZheng commented Sep 12, 2024

MonrenZheng commented Sep 14, 2024

rickyang1114 commented Sep 14, 2024

MonrenZheng commented Sep 14, 2024

rickyang1114 commented Sep 14, 2024

MonrenZheng commented Sep 14, 2024

rickyang1114 commented Sep 19, 2024

test_seed_LM.sh

gsm8k/sft.sh

gsm8k/sdft.sh

openfunction/sft.sh

openfunction/sdft.sh

magicoder/sft.sh

magicoder/sdft.sh

MonrenZheng commented Sep 19, 2024

SivilTaram commented Sep 19, 2024

多卡运行和参数问题 #12

多卡运行和参数问题 #12

rickyang1114 commented Sep 3, 2024 •

edited

Loading

SivilTaram commented Sep 3, 2024 •

edited

Loading

SivilTaram commented Sep 5, 2024 •

edited

Loading

SivilTaram commented Sep 12, 2024 •

edited

Loading