Skip to content

Commit

Permalink
docs: update
Browse files Browse the repository at this point in the history
  • Loading branch information
Ki-Seki committed Aug 31, 2024
1 parent 4a09004 commit 78105e6
Show file tree
Hide file tree
Showing 6 changed files with 64 additions and 37 deletions.
10 changes: 9 additions & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@

We appreciate your interest in contributing. To ensure a smooth collaboration, please review the following guidelines.

> [!Note]
> Please ensure that your code passes all tests and `black` code formatting before opening a pull request.
> You can run the following commands to check your code:
> ```bash
> python -m unittest discover -s tests/ -p 'test*.py' -v
> black . --check
> ```
## How to Contribute
1. Get the latest version of the repository:
Expand All @@ -23,7 +31,7 @@ We appreciate your interest in contributing. To ensure a smooth collaboration, p
## Code Style
- (Mandatory) Use [black](https://black.readthedocs.io/en/stable/) to format code
- Use [isort](https://pycqa.github.io/isort/) to reorder import statements
- Use [black](https://black.readthedocs.io/en/stable/) to format code
- Use [Google Docstring Format](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) to standardize docstrings
- Use [Conventional Commits](https://www.conventionalcommits.org/) to make commit messages more readable
6 changes: 6 additions & 0 deletions CITATION.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
@article{liang2023uhgeval,
title={Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation},
author={Liang, Xun and Song, Shichao and Niu, Simin and Li, Zhiyu and Xiong, Feiyu and Tang, Bo and Wy, Zhaohui and He, Dawei and Cheng, Peng and Wang, Zhonghao and others},
journal={arXiv preprint arXiv:2311.15296},
year={2023}
}
29 changes: 0 additions & 29 deletions CITATION.cff

This file was deleted.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,10 @@ To facilitate evaluation, we have developed a user-friendly evaluation framework

<details><summary>Click me to show all TODOs</summary>

- [ ] docs: finish all TODOs in docs
- [ ] feat: vLLM offline inference benchmarking
- [ ] build: packaging
- [ ] feat(benchs): add TruthfulQA benchmark
- [ ] docs: update citation with DOI
- [ ] other: promotion

</details>
48 changes: 44 additions & 4 deletions docs/add-bench-or-model.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,49 @@
# Customization Guidelines

## Add New Benchmarks
## Adding a New Benchmark

TODO
You can refer to the structure of the `eval/benchs/exampleqa` folder, which serves as a minimal benchmark example. Additionally, you might want to check the `eval/benchs/base_dataset.py` and `eval/benchs/base_evaluator.py` files, as they provide the base classes for benchmarks.

## Add New Model Loaders
1. **Creating a Benchmark Folder**
- Create a new folder under the `benchs` directory.
- The folder should contain the dataset, evaluator, and any other necessary files.
- The folder should include a `README.md` file to provide an overview of the benchmark and any specific instructions for running it.

TODO
2. **Dataset**
- Ensure the benchmark folder includes a `dataset.py` file, which contains the dataset loading logic.
- Implement a subclass that inherits from `BaseDataset`.
- The subclass must implement the `load` method, which returns a `list[dict]`, where each element is an evaluation data sample and must contain a unique `id` field.

3. **Evaluator**
- The folder must include an evaluator script, typically named `eval_{benchmark_name}.py`, which implements the evaluation logic.
- This script should inherit from `BaseEvaluator` and contain the benchmark's evaluation logic.
- The subclass should implement `set_generation_configs` to determine the default token generation settings for the LLM during evaluation.
- Implement `load_batched_dataset` to load the batched dataset from `dataset.py`.
- Implement `scoring` to evaluate one data item from the datasets, returning the evaluation result in a dictionary format.
- Implement `compute_overall` to aggregate the evaluation results into an overall assessment, returning a dictionary with the final evaluation results.

4. **Registering the Benchmark**
- Add the benchmark to the `__init__.py` file under the `benchs` directory to ensure it is discoverable by the framework.
- Import the benchmark class in the `__init__.py` file and add it to the `__all__` list.

5. **Documentation**
- Create a `README.md` file in the benchmark folder.
- Include any specific instructions or requirements for running the benchmark.
- Add one line to the `README.md` file in the root directory under the `## Eval Suite` section to introduce the new benchmark.

## Adding a New Model Loader

You can refer to the `eval/llms/huggingface.py` and `eval/llms/openai_api.py` files as examples for loading LLMs.

1. **Language Model Loader**
- Create a new file under the `llms` directory.
- The file should contain the logic for loading the LLM from a specific source (e.g., Hugging Face, OpenAI API).

2. **Implementation Steps**
- Implement a subclass that inherits from `BaseLLM`.
- The subclass must implement `update_generation_configs` to handle parameter conversions, as different LLM loaders may have varying parameter names (e.g., `max_tokens` vs. `max_new_tokens`).
- The subclass must implement `_request`, which accepts a `str` as input and returns a `str` as the generated output.

3. **Registering the LLM Loader**
- Register the new LLM loader in the `__init__.py` file under the `llms` directory.
- Import the new LLM loader in the `__init__.py` file and add it to the `__all__` list.
6 changes: 4 additions & 2 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@

## Architecture

TODO
Eval Suite is a lightweight, extensible framework with three main components: `benchs`, `llms`, and auxiliary modules. The `benchs` component defines benchmarks, each with datasets and evaluators. The `llms` component manages model loading from OpenAI-compatible APIs or Hugging Face. Auxiliary modules handle CLI, logging, and metrics.

## Project Structure
A base evaluator and dataset under `benchs` provide default evaluation logic and data loading, which benchmarks inherit and extend.

## Structure

```bash
eval
Expand Down

0 comments on commit 78105e6

Please sign in to comment.