DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models
DCR-Consistency is a novel framework that uses LLM agents to detect and mitigate inconsistencies, or in other words hallucinations. It takes advantage of LLM's power in semantic understanding while circumventing known pitfalls such as relatively poor performance in math. For more details please see our paper.
Given a reference
as the ground truth and a candidate
to evaluate, it will output a numeric score between [0, 1] indicating its consistency where 0 means no sentence in the candidate
is consistent and 1 otherwise. It also outputs a list of reasons
about why this score is generated. Better yet, based on such reasons
, it can improve the candidate
and mitigate detected inconsistencies.
It is composed of three parts:
- DCE takes a
reference
and acandidate
, evaluates the consistency between the two on a sentence level, and outputs a list ofreasons
on the consistency check for each sentence in thecandidate
. - AMC takes the output of DCE and converts it to a numeric score between [0, 1]
- RAI takes the
reasons
output of DCE and generates improved versions that mitigate detected inconsistencies.
We evaluated the DCR-Consistency framework on a wide range of datasets: QQP, PAWS-QQP, SummEval, QAGS-CNN, and QAGS-XSUM.
Below is a comparison of DCR-Consistency with some start of art metrics on the SummEval dataset about consistency. We included prestigious metrics like BERTScore, and trending new ones leveraging LLMs(GPT-3.5/4) such as G-Eval as well. DCR-Consistency is outperforming those metrics by a large margin.
We also evaluated DCR-Consistency's effectiveness on inconsistency migration. Below is an illustration showing the consistency rate changes after iterations of applying DCR-Consistency. We observe effective mitigations in all three datasets and that 100% migration of detected inconsistency can be achieved within three rounds.
- Ensure you have python >= 3.9
- Clone this repo and install it with
pip install .
DCR-Consistency can also be installed directly from pip(coming soon!)
pip install dcr-consistency
The easiest way to start is to play with the example in examples/example.py
. To do so:
- Install the DCR-Consistency package with steps above
- Install the necessary packages with command below (example use additional dependencies such as openai):
pip install -r examples/requirements_example.txt
- update the
api_key
variable with your apikey. - run the example with
python examples/example.py
res = evaluate(_your_LLM_, _your_model_config_, data, worker_count=5)
- your_LLM: This will be your own object that handles communication with LLM. It should follow the contract of LLM abstract class. This allows freedom of using whatever LLM you desire. An example can be found here
- your_model_config: This will be whatever parameter your LLM needs. An example can be found here
- worker_count: This configures the number of threads to run the program
- data: The
data
filed will be a list of data to run. By default each item in it should be a dict containing fieldsid
,reference
andcandidate
. The returned item will be the original data passed in joined with the columns below:
column | meaning |
---|---|
id | Unique Identifier for each row |
score | Final consistency score of the row |
dce_reasons | Reasons for the final score given by DCE |
amc_reasons | Reasons for scoring of each sentence given by AMC |
dce_raw | Raw data from DCE |
amc_raw | Raw data from AMC |
decision | Consistency decision based on DCE |
res = improve(_your_LLM_, _your_model_config_, data, worker_count=5)
- your_LLM: This will be your own object that handles communication with LLM. It should follow the contract of LLM abstract class. This allows freedom of using whatever LLM you desire. An example can be found here
- your_model_config: This will be whatever parameter your LLM needs. An example can be found here
- worker_count: This configures the number of threads to run the program
- data: The
data
filed will be a list of data to run. By default each item in it should be a dict containing fieldsid
,article
andsentences
.article
is thereference
passed into evaluator.sentences
can be extracted from the output evaluator. It is a list holding information on what the original sentences are and whether each sentence is or is not consistent compared to the reference and the reasons. The returned item will be the original data passed in joined with the columns below:
column | meaning |
---|---|
id | Unique Identifier for each row |
improved_version | The improved version where inconsistency is mitigated |
rai_raw | Raw data from RAI |
See CONTRIBUTING.md.
@inproceedings{cui2023dcr,
title={DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models},
author={Wendi Cui, Jiaxin Zhang, Zhuohang Li, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar},
booktitle={arXiv preprint arXiv:2401.02132},
year={2023},
primaryClass={cs.CL}
}