Evaluation mechanism update #82

cizhenshi · 2024-10-30T02:42:35Z

Currently, many evaluations of long text models reference LongBench results. However, n-gram based metrics do not truly reflect the quality of responses. Many papers have adopted the method of using GPT-4o for scoring. Could you provide an official version of the GPT-4o scoring code to standardize the 4o scoring across various evaluations and make the results more comparable?

bys0318 · 2024-10-31T13:26:35Z

Great suggestion! I will update the code to support LLM-as-a-judge evaluation in the next few days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation mechanism update #82

Evaluation mechanism update #82

cizhenshi commented Oct 30, 2024

bys0318 commented Oct 31, 2024

Evaluation mechanism update #82

Evaluation mechanism update #82

Comments

cizhenshi commented Oct 30, 2024

bys0318 commented Oct 31, 2024