-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
15 changed files
with
1,086 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,21 @@ | ||
<rss version="2.0"><channel><title>Chat Arxiv cs.CL</title><link>https://github.com/qhduan/cn-chat-arxiv</link><description>This is arxiv RSS feed for cs.CL</description></channel></rss> | ||
<rss version="2.0"><channel><title>Chat Arxiv cs.CL</title><link>https://github.com/qhduan/cn-chat-arxiv</link><description>This is arxiv RSS feed for cs.CL</description><item><title>对于评估大型语言模型中多选题回答的合理性进行了回顾,发现当前基于多选题回答的基准可能无法充分捕捉大型语言模型的真实能力。</title><link>https://rss.arxiv.org/abs/2402.01349</link><description><p> | ||
超越答案:对于评估大型语言模型中多选题回答的合理性的回顾 | ||
</p> | ||
<p> | ||
Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models | ||
</p> | ||
<p> | ||
https://rss.arxiv.org/abs/2402.01349 | ||
</p> | ||
<p> | ||
对于评估大型语言模型中多选题回答的合理性进行了回顾,发现当前基于多选题回答的基准可能无法充分捕捉大型语言模型的真实能力。 | ||
</p> | ||
<p> | ||
|
||
</p> | ||
<p> | ||
在自然语言处理领域,大型语言模型(LLMs)引发了一场范式转变,显著提升了自然语言生成任务的性能。尽管取得了这些进展,对LLMs的全面评估仍然是社区面临的必然挑战。最近,将多选题回答(MCQA)作为LLMs的基准已经引起了广泛关注。本研究调查了MCQA作为LLMs评估方法的合理性。如果LLMs真正理解问题的语义,它们的性能应该在从相同问题派生的各种配置上表现一致。然而,我们的实证结果表明LLMs的响应一致性存在显著差异,我们将之定义为LLMs的响应可变性综合征(REVAS),这表明目前基于MCQA的基准可能无法充分捕捉LLMs的真实能力,强调了对更合适的评估方法的需要。 | ||
</p> | ||
<p> | ||
In the field of natural language processing (NLP), Large Language Models (LLMs) have precipitated a paradigm shift, markedly enhancing performance in natural language generation tasks. Despite these advancements, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the utilization of Multiple Choice Question Answering (MCQA) as a benchmark for LLMs has gained considerable traction. This study investigates the rationality of MCQA as an evaluation method for LLMs. If LLMs genuinely understand the semantics of questions, their performance should exhibit consistency across the varied configurations derived from the same questions. Contrary to this expectation, our empirical findings suggest a notable disparity in the consistency of LLM responses, which we define as REsponse VAriability Syndrome (REVAS) of the LLMs, indicating that current MCQA-based benchmarks may not adequately capture the true capabilities of LLMs, which underscores the need f | ||
</p></description></item></channel></rss> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.