Skip to content

Commit

Permalink
Add changes
Browse files Browse the repository at this point in the history
  • Loading branch information
qhduan committed Dec 9, 2024
1 parent 26e0085 commit 15d702c
Show file tree
Hide file tree
Showing 15 changed files with 1,086 additions and 15 deletions.
105 changes: 104 additions & 1 deletion cs.AI.md

Large diffs are not rendered by default.

142 changes: 141 additions & 1 deletion cs.AI.xml

Large diffs are not rendered by default.

15 changes: 14 additions & 1 deletion cs.CL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,22 @@

| Ref | Title | Summary |
| --- | --- | --- |

| [^1] | [Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models](https://rss.arxiv.org/abs/2402.01349) | 对于评估大型语言模型中多选题回答的合理性进行了回顾,发现当前基于多选题回答的基准可能无法充分捕捉大型语言模型的真实能力。 |

# 详细

[^1]: 超越答案:对于评估大型语言模型中多选题回答的合理性的回顾

Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models

[https://rss.arxiv.org/abs/2402.01349](https://rss.arxiv.org/abs/2402.01349)

对于评估大型语言模型中多选题回答的合理性进行了回顾,发现当前基于多选题回答的基准可能无法充分捕捉大型语言模型的真实能力。



在自然语言处理领域,大型语言模型(LLMs)引发了一场范式转变,显著提升了自然语言生成任务的性能。尽管取得了这些进展,对LLMs的全面评估仍然是社区面临的必然挑战。最近,将多选题回答(MCQA)作为LLMs的基准已经引起了广泛关注。本研究调查了MCQA作为LLMs评估方法的合理性。如果LLMs真正理解问题的语义,它们的性能应该在从相同问题派生的各种配置上表现一致。然而,我们的实证结果表明LLMs的响应一致性存在显著差异,我们将之定义为LLMs的响应可变性综合征(REVAS),这表明目前基于MCQA的基准可能无法充分捕捉LLMs的真实能力,强调了对更合适的评估方法的需要。

In the field of natural language processing (NLP), Large Language Models (LLMs) have precipitated a paradigm shift, markedly enhancing performance in natural language generation tasks. Despite these advancements, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the utilization of Multiple Choice Question Answering (MCQA) as a benchmark for LLMs has gained considerable traction. This study investigates the rationality of MCQA as an evaluation method for LLMs. If LLMs genuinely understand the semantics of questions, their performance should exhibit consistency across the varied configurations derived from the same questions. Contrary to this expectation, our empirical findings suggest a notable disparity in the consistency of LLM responses, which we define as REsponse VAriability Syndrome (REVAS) of the LLMs, indicating that current MCQA-based benchmarks may not adequately capture the true capabilities of LLMs, which underscores the need f


22 changes: 21 additions & 1 deletion cs.CL.xml
Original file line number Diff line number Diff line change
@@ -1 +1,21 @@
<rss version="2.0"><channel><title>Chat Arxiv cs.CL</title><link>https://github.com/qhduan/cn-chat-arxiv</link><description>This is arxiv RSS feed for cs.CL</description></channel></rss>
<rss version="2.0"><channel><title>Chat Arxiv cs.CL</title><link>https://github.com/qhduan/cn-chat-arxiv</link><description>This is arxiv RSS feed for cs.CL</description><item><title>&#23545;&#20110;&#35780;&#20272;&#22823;&#22411;&#35821;&#35328;&#27169;&#22411;&#20013;&#22810;&#36873;&#39064;&#22238;&#31572;&#30340;&#21512;&#29702;&#24615;&#36827;&#34892;&#20102;&#22238;&#39038;&#65292;&#21457;&#29616;&#24403;&#21069;&#22522;&#20110;&#22810;&#36873;&#39064;&#22238;&#31572;&#30340;&#22522;&#20934;&#21487;&#33021;&#26080;&#27861;&#20805;&#20998;&#25429;&#25417;&#22823;&#22411;&#35821;&#35328;&#27169;&#22411;&#30340;&#30495;&#23454;&#33021;&#21147;&#12290;</title><link>https://rss.arxiv.org/abs/2402.01349</link><description>&lt;p&gt;
&#36229;&#36234;&#31572;&#26696;&#65306;&#23545;&#20110;&#35780;&#20272;&#22823;&#22411;&#35821;&#35328;&#27169;&#22411;&#20013;&#22810;&#36873;&#39064;&#22238;&#31572;&#30340;&#21512;&#29702;&#24615;&#30340;&#22238;&#39038;
&lt;/p&gt;
&lt;p&gt;
Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models
&lt;/p&gt;
&lt;p&gt;
https://rss.arxiv.org/abs/2402.01349
&lt;/p&gt;
&lt;p&gt;
&#23545;&#20110;&#35780;&#20272;&#22823;&#22411;&#35821;&#35328;&#27169;&#22411;&#20013;&#22810;&#36873;&#39064;&#22238;&#31572;&#30340;&#21512;&#29702;&#24615;&#36827;&#34892;&#20102;&#22238;&#39038;&#65292;&#21457;&#29616;&#24403;&#21069;&#22522;&#20110;&#22810;&#36873;&#39064;&#22238;&#31572;&#30340;&#22522;&#20934;&#21487;&#33021;&#26080;&#27861;&#20805;&#20998;&#25429;&#25417;&#22823;&#22411;&#35821;&#35328;&#27169;&#22411;&#30340;&#30495;&#23454;&#33021;&#21147;&#12290;
&lt;/p&gt;
&lt;p&gt;

&lt;/p&gt;
&lt;p&gt;
&#22312;&#33258;&#28982;&#35821;&#35328;&#22788;&#29702;&#39046;&#22495;&#65292;&#22823;&#22411;&#35821;&#35328;&#27169;&#22411;&#65288;LLMs&#65289;&#24341;&#21457;&#20102;&#19968;&#22330;&#33539;&#24335;&#36716;&#21464;&#65292;&#26174;&#33879;&#25552;&#21319;&#20102;&#33258;&#28982;&#35821;&#35328;&#29983;&#25104;&#20219;&#21153;&#30340;&#24615;&#33021;&#12290;&#23613;&#31649;&#21462;&#24471;&#20102;&#36825;&#20123;&#36827;&#23637;&#65292;&#23545;LLMs&#30340;&#20840;&#38754;&#35780;&#20272;&#20173;&#28982;&#26159;&#31038;&#21306;&#38754;&#20020;&#30340;&#24517;&#28982;&#25361;&#25112;&#12290;&#26368;&#36817;&#65292;&#23558;&#22810;&#36873;&#39064;&#22238;&#31572;&#65288;MCQA&#65289;&#20316;&#20026;LLMs&#30340;&#22522;&#20934;&#24050;&#32463;&#24341;&#36215;&#20102;&#24191;&#27867;&#20851;&#27880;&#12290;&#26412;&#30740;&#31350;&#35843;&#26597;&#20102;MCQA&#20316;&#20026;LLMs&#35780;&#20272;&#26041;&#27861;&#30340;&#21512;&#29702;&#24615;&#12290;&#22914;&#26524;LLMs&#30495;&#27491;&#29702;&#35299;&#38382;&#39064;&#30340;&#35821;&#20041;&#65292;&#23427;&#20204;&#30340;&#24615;&#33021;&#24212;&#35813;&#22312;&#20174;&#30456;&#21516;&#38382;&#39064;&#27966;&#29983;&#30340;&#21508;&#31181;&#37197;&#32622;&#19978;&#34920;&#29616;&#19968;&#33268;&#12290;&#28982;&#32780;&#65292;&#25105;&#20204;&#30340;&#23454;&#35777;&#32467;&#26524;&#34920;&#26126;LLMs&#30340;&#21709;&#24212;&#19968;&#33268;&#24615;&#23384;&#22312;&#26174;&#33879;&#24046;&#24322;&#65292;&#25105;&#20204;&#23558;&#20043;&#23450;&#20041;&#20026;LLMs&#30340;&#21709;&#24212;&#21487;&#21464;&#24615;&#32508;&#21512;&#24449;&#65288;REVAS&#65289;&#65292;&#36825;&#34920;&#26126;&#30446;&#21069;&#22522;&#20110;MCQA&#30340;&#22522;&#20934;&#21487;&#33021;&#26080;&#27861;&#20805;&#20998;&#25429;&#25417;LLMs&#30340;&#30495;&#23454;&#33021;&#21147;&#65292;&#24378;&#35843;&#20102;&#23545;&#26356;&#21512;&#36866;&#30340;&#35780;&#20272;&#26041;&#27861;&#30340;&#38656;&#35201;&#12290;
&lt;/p&gt;
&lt;p&gt;
In the field of natural language processing (NLP), Large Language Models (LLMs) have precipitated a paradigm shift, markedly enhancing performance in natural language generation tasks. Despite these advancements, the comprehensive evaluation of LLMs remains an inevitable challenge for the community. Recently, the utilization of Multiple Choice Question Answering (MCQA) as a benchmark for LLMs has gained considerable traction. This study investigates the rationality of MCQA as an evaluation method for LLMs. If LLMs genuinely understand the semantics of questions, their performance should exhibit consistency across the varied configurations derived from the same questions. Contrary to this expectation, our empirical findings suggest a notable disparity in the consistency of LLM responses, which we define as REsponse VAriability Syndrome (REVAS) of the LLMs, indicating that current MCQA-based benchmarks may not adequately capture the true capabilities of LLMs, which underscores the need f
&lt;/p&gt;</description></item></channel></rss>
30 changes: 29 additions & 1 deletion cs.IR.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,37 @@

| Ref | Title | Summary |
| --- | --- | --- |

| [^1] | [All-in-One: Heterogeneous Interaction Modeling for Cold-Start Rating Prediction](https://arxiv.org/abs/2403.17740) | 提出了异质交互评分网络(HIRE)框架,通过异质交互模块(HIM)来共同建模异质交互并直接推断重要特征 |
| [^2] | [TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient and Effective Retrieval.](http://arxiv.org/abs/2401.13509) | 本文提出一种基于Transformer的伪相关反馈模型(TPRF),适用于资源受限的环境。TPRF相比其他深度语言模型在内存占用和推理时间方面具备更小的开销,并能有效地结合来自稠密文具表示的相关反馈信号。 |

# 详细

[^1]: 一体化:异质交互建模用于冷启动评分预测

All-in-One: Heterogeneous Interaction Modeling for Cold-Start Rating Prediction

[https://arxiv.org/abs/2403.17740](https://arxiv.org/abs/2403.17740)

提出了异质交互评分网络(HIRE)框架,通过异质交互模块(HIM)来共同建模异质交互并直接推断重要特征



冷启动评分预测是推荐系统中一个基本问题,已得到广泛研究。许多方法已经被提出,利用现有数据之间的显式关系,例如协同过滤、社交推荐和异构信息网络,以缓解冷启动用户和物品的数据不足问题。然而,基于不同角色之间的数据构建的显式关系可能不可靠且无关,从而限制了特定推荐任务的性能上限。受此启发,本文提出了一个灵活的框架,名为异质交互评分网络(HIRE)。HIRE不仅仅依赖于预先定义的交互模式或手动构建的异构信息网络。相反,我们设计了一个异质交互模块(HIM),来共同建模异质交互并直接推断重要特征。

arXiv:2403.17740v1 Announce Type: cross Abstract: Cold-start rating prediction is a fundamental problem in recommender systems that has been extensively studied. Many methods have been proposed that exploit explicit relations among existing data, such as collaborative filtering, social recommendations and heterogeneous information network, to alleviate the data insufficiency issue for cold-start users and items. However, the explicit relations constructed based on data between different roles may be unreliable and irrelevant, which limits the performance ceiling of the specific recommendation task. Motivated by this, in this paper, we propose a flexible framework dubbed heterogeneous interaction rating network (HIRE). HIRE dose not solely rely on the pre-defined interaction pattern or the manually constructed heterogeneous information network. Instead, we devise a Heterogeneous Interaction Module (HIM) to jointly model the heterogeneous interactions and directly infer the important in

[^2]: TPRF:一种基于Transformer的伪相关反馈模型,用于高效且有效的检索。

TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient and Effective Retrieval. (arXiv:2401.13509v1 [cs.IR])

[http://arxiv.org/abs/2401.13509](http://arxiv.org/abs/2401.13509)

本文提出一种基于Transformer的伪相关反馈模型(TPRF),适用于资源受限的环境。TPRF相比其他深度语言模型在内存占用和推理时间方面具备更小的开销,并能有效地结合来自稠密文具表示的相关反馈信号。



本文考虑在资源受限的环境中,如廉价云实例或嵌入式系统(如智能手机和智能手表)中,针对稠密检索器的伪相关反馈(PRF)方法,其中内存和CPU受限,没有GPU。为此,我们提出了一种基于Transformer的PRF方法(TPRF),与采用PRF机制的其他深度语言模型相比,具有更小的内存占用和更快的推理时间,较小的效果损失。TPRF学习如何有效地结合来自稠密文具表示的相关反馈信号。具体而言,TPRF提供了一种建模查询和相关反馈信号之间关系和权重的机制。该方法对所使用的具体稠密表示不加偏见,因此可以广泛应用于任何稠密检索器。

This paper considers Pseudo-Relevance Feedback (PRF) methods for dense retrievers in a resource constrained environment such as that of cheap cloud instances or embedded systems (e.g., smartphones and smartwatches), where memory and CPU are limited and GPUs are not present. For this, we propose a transformer-based PRF method (TPRF), which has a much smaller memory footprint and faster inference time compared to other deep language models that employ PRF mechanisms, with a marginal effectiveness loss. TPRF learns how to effectively combine the relevance feedback signals from dense passage representations. Specifically, TPRF provides a mechanism for modelling relationships and weights between the query and the relevance feedback signals. The method is agnostic to the specific dense representation used and thus can be generally applied to any dense retriever.


Loading

0 comments on commit 15d702c

Please sign in to comment.