docs: update

IAAR-Shanghai · Aug 31, 2024 · 78105e6 · 78105e6
1 parent 4a09004
commit 78105e6
Show file tree

Hide file tree

Showing 6 changed files with 64 additions and 37 deletions.
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -2,6 +2,14 @@
 
 We appreciate your interest in contributing. To ensure a smooth collaboration, please review the following guidelines.
 
+> [!Note]
+> Please ensure that your code passes all tests and `black` code formatting before opening a pull request.
+> You can run the following commands to check your code:
+> ```bash
+> python -m unittest discover -s tests/ -p 'test*.py' -v
+> black . --check
+> ```
+
 ## How to Contribute
 
 1. Get the latest version of the repository:
@@ -23,7 +31,7 @@ We appreciate your interest in contributing. To ensure a smooth collaboration, p
 
 ## Code Style
 
+- (Mandatory) Use [black](https://black.readthedocs.io/en/stable/) to format code
 - Use [isort](https://pycqa.github.io/isort/) to reorder import statements
-- Use [black](https://black.readthedocs.io/en/stable/) to format code
 - Use [Google Docstring Format](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) to standardize docstrings
 - Use [Conventional Commits](https://www.conventionalcommits.org/) to make commit messages more readable
diff --git a/CITATION.bib b/CITATION.bib
@@ -0,0 +1,6 @@
+@article{liang2023uhgeval,
+  title={Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation},
+  author={Liang, Xun and Song, Shichao and Niu, Simin and Li, Zhiyu and Xiong, Feiyu and Tang, Bo and Wy, Zhaohui and He, Dawei and Cheng, Peng and Wang, Zhonghao and others},
+  journal={arXiv preprint arXiv:2311.15296},
+  year={2023}
+}
diff --git a/CITATION.cff b/CITATION.cff
diff --git a/README.md b/README.md
@@ -141,10 +141,10 @@ To facilitate evaluation, we have developed a user-friendly evaluation framework
 
 <details><summary>Click me to show all TODOs</summary>
 
-- [ ] docs: finish all TODOs in docs
 - [ ] feat: vLLM offline inference benchmarking
 - [ ] build: packaging
 - [ ] feat(benchs): add TruthfulQA benchmark
 - [ ] docs: update citation with DOI
+- [ ] other: promotion
 
 </details>
diff --git a/docs/add-bench-or-model.md b/docs/add-bench-or-model.md
@@ -1,9 +1,49 @@
 # Customization Guidelines
 
-## Add New Benchmarks
+## Adding a New Benchmark
 
-TODO
+You can refer to the structure of the `eval/benchs/exampleqa` folder, which serves as a minimal benchmark example. Additionally, you might want to check the `eval/benchs/base_dataset.py` and `eval/benchs/base_evaluator.py` files, as they provide the base classes for benchmarks.
 
-## Add New Model Loaders
+1. **Creating a Benchmark Folder**
+   - Create a new folder under the `benchs` directory.
+   - The folder should contain the dataset, evaluator, and any other necessary files.
+   - The folder should include a `README.md` file to provide an overview of the benchmark and any specific instructions for running it.
 
-TODO
+2. **Dataset**
+   - Ensure the benchmark folder includes a `dataset.py` file, which contains the dataset loading logic.
+   - Implement a subclass that inherits from `BaseDataset`.
+   - The subclass must implement the `load` method, which returns a `list[dict]`, where each element is an evaluation data sample and must contain a unique `id` field.
+
+3. **Evaluator**
+   - The folder must include an evaluator script, typically named `eval_{benchmark_name}.py`, which implements the evaluation logic.
+   - This script should inherit from `BaseEvaluator` and contain the benchmark's evaluation logic.
+   - The subclass should implement `set_generation_configs` to determine the default token generation settings for the LLM during evaluation.
+   - Implement `load_batched_dataset` to load the batched dataset from `dataset.py`.
+   - Implement `scoring` to evaluate one data item from the datasets, returning the evaluation result in a dictionary format.
+   - Implement `compute_overall` to aggregate the evaluation results into an overall assessment, returning a dictionary with the final evaluation results.
+
+4. **Registering the Benchmark**
+   - Add the benchmark to the `__init__.py` file under the `benchs` directory to ensure it is discoverable by the framework.
+   - Import the benchmark class in the `__init__.py` file and add it to the `__all__` list.
+
+5. **Documentation**
+   - Create a `README.md` file in the benchmark folder.
+   - Include any specific instructions or requirements for running the benchmark.
+   - Add one line to the `README.md` file in the root directory under the `## Eval Suite` section to introduce the new benchmark.
+
+## Adding a New Model Loader
+
+You can refer to the `eval/llms/huggingface.py` and `eval/llms/openai_api.py` files as examples for loading LLMs.
+
+1. **Language Model Loader**
+   - Create a new file under the `llms` directory. 
+   - The file should contain the logic for loading the LLM from a specific source (e.g., Hugging Face, OpenAI API).
+
+2. **Implementation Steps**
+   - Implement a subclass that inherits from `BaseLLM`.
+   - The subclass must implement `update_generation_configs` to handle parameter conversions, as different LLM loaders may have varying parameter names (e.g., `max_tokens` vs. `max_new_tokens`).
+   - The subclass must implement `_request`, which accepts a `str` as input and returns a `str` as the generated output.
+
+3. **Registering the LLM Loader**
+   - Register the new LLM loader in the `__init__.py` file under the `llms` directory.
+   - Import the new LLM loader in the `__init__.py` file and add it to the `__all__` list.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -2,9 +2,11 @@
 
 ## Architecture
 
-TODO
+Eval Suite is a lightweight, extensible framework with three main components: `benchs`, `llms`, and auxiliary modules. The `benchs` component defines benchmarks, each with datasets and evaluators. The `llms` component manages model loading from OpenAI-compatible APIs or Hugging Face. Auxiliary modules handle CLI, logging, and metrics.
 
-## Project Structure
+A base evaluator and dataset under `benchs` provide default evaluation logic and data loading, which benchmarks inherit and extend.
+
+## Structure
 
 ```bash
 eval