docs: concise vs full experiment

IAAR-Shanghai · Aug 25, 2024 · 887242f · 887242f
1 parent a1091fe
commit 887242f
Show file tree

Hide file tree

Showing 29 changed files with 3,117,348 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -141,7 +141,6 @@ To facilitate evaluation, we have developed a user-friendly evaluation framework
 
 <details><summary>Click me to show all TODOs</summary>
 
-- [ ] fix: test bias in concise dataset
 - [ ] docs: finish all TODOs in docs
 - [ ] feat: vLLM offline inference benchmarking
 - [ ] build: packaging

diff --git a/docs/experiments.md b/docs/experiments.md
@@ -2,6 +2,8 @@
 
 ## Experiment-20231117
 
+The original experimental results can be found in [./experiments/20231117](./experiments/20231117).
+
 These are the experimental results corresponding to the [ACL 2024 paper](https://aclanthology.org/2024.acl-long.288/). All evaluations were conducted on the full version of the UHGEvalDataset.
 
 <p align="center"><img src="./experiments/20231117/images/discri_and_sel.png" alt=""></p>
@@ -10,7 +12,13 @@ These are the experimental results corresponding to the [ACL 2024 paper](https:/
 
 <p align="center"><img src="./experiments/20231117/images/by_type.png" alt="" width="60%"></p>
 
-The original experimental code can be found in [./experiments/20231117](./experiments/20231117).
-
 > [!Caution]
 > The Eval Suite used at that time was an older version. Running the same experiments with the current version might produce slightly different results.
+
+## Experiment-20240822
+
+The original experimental results can be found in [./experiments/20240822](./experiments/20240822).
+
+Tested whether there would be significant differences in the evaluation results produced using the full dataset versus the concise dataset.
+
+The experimental results show that the differences between the full and concise datasets are minimal, so the concise dataset can be used instead of the full dataset to improve evaluation speed.
diff --git a/docs/experiments/20240822/expt.py b/docs/experiments/20240822/expt.py
@@ -0,0 +1,35 @@
+from eval.benchs import (
+    UHGDiscKeywordEvaluator,
+    UHGDiscSentenceEvaluator,
+    UHGGenerativeEvaluator,
+    UHGSelectiveEvaluator,
+)
+from eval.llms import OpenAIAPI
+
+glm = OpenAIAPI(
+    model_name="THUDM/glm-4-9b-chat",
+    api_key="your_api_key",
+    base_url="https://api.siliconflow.cn/v1",
+)
+
+qwen = OpenAIAPI(
+    model_name="Qwen/Qwen1.5-7B-Chat",
+    api_key="your_api_key",
+    base_url="https://api.siliconflow.cn/v1",
+)
+
+llama = OpenAIAPI(
+    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
+    api_key="your_api_key",
+    base_url="https://api.siliconflow.cn/v1",
+)
+
+for model in (glm, qwen, llama):
+    for use_full in (True, False):
+        for evaluator in [
+            UHGSelectiveEvaluator,
+            UHGGenerativeEvaluator,
+            UHGDiscSentenceEvaluator,
+            UHGDiscKeywordEvaluator,
+        ]:
+            evaluator(model=model, use_full=use_full).evaluate()
diff --git a/.../20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGDiscKeywordEvaluator_20240823175218.json b/.../20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGDiscKeywordEvaluator_20240823175218.json
diff --git a/...20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGDiscSentenceEvaluator_20240823172707.json b/...20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGDiscSentenceEvaluator_20240823172707.json
diff --git a/...s/20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGGenerativeEvaluator_20240823163555.json b/...s/20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGGenerativeEvaluator_20240823163555.json
diff --git a/...ts/20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGSelectiveEvaluator_20240823161832.json b/...ts/20240822/output-concise/Qwen_Qwen1.5-7B-Chat_UHGSelectiveEvaluator_20240823161832.json
diff --git a/...s/20240822/output-concise/THUDM_glm-4-9b-chat_UHGDiscKeywordEvaluator_20240823022145.json b/...s/20240822/output-concise/THUDM_glm-4-9b-chat_UHGDiscKeywordEvaluator_20240823022145.json
diff --git a/.../20240822/output-concise/THUDM_glm-4-9b-chat_UHGDiscSentenceEvaluator_20240823015922.json b/.../20240822/output-concise/THUDM_glm-4-9b-chat_UHGDiscSentenceEvaluator_20240823015922.json
diff --git a/...ts/20240822/output-concise/THUDM_glm-4-9b-chat_UHGGenerativeEvaluator_20240823010330.json b/...ts/20240822/output-concise/THUDM_glm-4-9b-chat_UHGGenerativeEvaluator_20240823010330.json
diff --git a/...nts/20240822/output-concise/THUDM_glm-4-9b-chat_UHGSelectiveEvaluator_20240823004734.json b/...nts/20240822/output-concise/THUDM_glm-4-9b-chat_UHGSelectiveEvaluator_20240823004734.json
diff --git a/...concise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGDiscKeywordEvaluator_20240824102622.json b/...concise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGDiscKeywordEvaluator_20240824102622.json
diff --git a/...oncise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGDiscSentenceEvaluator_20240824095655.json b/...oncise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGDiscSentenceEvaluator_20240824095655.json
diff --git a/...-concise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGGenerativeEvaluator_20240824092232.json b/...-concise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGGenerativeEvaluator_20240824092232.json
diff --git a/...t-concise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGSelectiveEvaluator_20240824090228.json b/...t-concise/meta-llama_Meta-Llama-3.1-8B-Instruct_UHGSelectiveEvaluator_20240824090228.json
diff --git a/...nts/20240822/output-full/Qwen_Qwen1.5-7B-Chat_UHGDiscKeywordEvaluator_20240823105649.json b/...nts/20240822/output-full/Qwen_Qwen1.5-7B-Chat_UHGDiscKeywordEvaluator_20240823105649.json