Skip to content

Commit

Permalink
docs: concise vs full experiment
Browse files Browse the repository at this point in the history
  • Loading branch information
Ki-Seki committed Aug 25, 2024
1 parent a1091fe commit 887242f
Show file tree
Hide file tree
Showing 29 changed files with 3,117,348 additions and 3 deletions.
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,6 @@ To facilitate evaluation, we have developed a user-friendly evaluation framework

<details><summary>Click me to show all TODOs</summary>

- [ ] fix: test bias in concise dataset
- [ ] docs: finish all TODOs in docs
- [ ] feat: vLLM offline inference benchmarking
- [ ] build: packaging
Expand Down
12 changes: 10 additions & 2 deletions docs/experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## Experiment-20231117

The original experimental results can be found in [./experiments/20231117](./experiments/20231117).

These are the experimental results corresponding to the [ACL 2024 paper](https://aclanthology.org/2024.acl-long.288/). All evaluations were conducted on the full version of the UHGEvalDataset.

<p align="center"><img src="./experiments/20231117/images/discri_and_sel.png" alt=""></p>
Expand All @@ -10,7 +12,13 @@ These are the experimental results corresponding to the [ACL 2024 paper](https:/

<p align="center"><img src="./experiments/20231117/images/by_type.png" alt="" width="60%"></p>

The original experimental code can be found in [./experiments/20231117](./experiments/20231117).

> [!Caution]
> The Eval Suite used at that time was an older version. Running the same experiments with the current version might produce slightly different results.
## Experiment-20240822

The original experimental results can be found in [./experiments/20240822](./experiments/20240822).

Tested whether there would be significant differences in the evaluation results produced using the full dataset versus the concise dataset.

The experimental results show that the differences between the full and concise datasets are minimal, so the concise dataset can be used instead of the full dataset to improve evaluation speed.
35 changes: 35 additions & 0 deletions docs/experiments/20240822/expt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
from eval.benchs import (
UHGDiscKeywordEvaluator,
UHGDiscSentenceEvaluator,
UHGGenerativeEvaluator,
UHGSelectiveEvaluator,
)
from eval.llms import OpenAIAPI

glm = OpenAIAPI(
model_name="THUDM/glm-4-9b-chat",
api_key="your_api_key",
base_url="https://api.siliconflow.cn/v1",
)

qwen = OpenAIAPI(
model_name="Qwen/Qwen1.5-7B-Chat",
api_key="your_api_key",
base_url="https://api.siliconflow.cn/v1",
)

llama = OpenAIAPI(
model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
api_key="your_api_key",
base_url="https://api.siliconflow.cn/v1",
)

for model in (glm, qwen, llama):
for use_full in (True, False):
for evaluator in [
UHGSelectiveEvaluator,
UHGGenerativeEvaluator,
UHGDiscSentenceEvaluator,
UHGDiscKeywordEvaluator,
]:
evaluator(model=model, use_full=use_full).evaluate()

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Loading

0 comments on commit 887242f

Please sign in to comment.