-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
21 changed files
with
138 additions
and
43 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
--- | ||
title: Evaluators and Tests | ||
sidebar: | ||
badge: | ||
text: new | ||
variant: tip | ||
--- | ||
|
||
## Adding Evaluators and Tests to a Pipeline | ||
|
||
You can optionally add `eval` and `tests` to the modules you want to measure the performance of. | ||
|
||
|
||
`eval` field to select relevant evaluation metrics | ||
- Select the metrics and specify the input according to the data fields required for each metric. `MetricName().use(data_fields)`. | ||
- Metric inputs can be referenced using items from two sources: | ||
- **From `dataset`**: e.g. `ground_truth_context = dataset.ground_truth_context` | ||
- **From current module**: e.g. `answer = ModuleOutput()` | ||
- **From prior modules**: e.g. `retrieved_context = ModuleOutput(DocumentsContent, module=reranker)`, where | ||
`DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x])` to select specific items from the prior module's output | ||
|
||
|
||
`tests` field to define specific performance criteria | ||
- Select testing class `GreaterOrEqualThan` or `MeanGreaterOrEqualThan` to run test over each datapoint or the mean of the aggregate dataset | ||
- Define `test_name`, `metric_name` (must be part of the metric_name that `eval` calculates), and `min_value`. | ||
|
||
|
||
|
||
Below is a full example of a two-step pipeline. | ||
|
||
```python | ||
from continuous_eval.eval import Module, Pipeline, Dataset, ModuleOutput | ||
from continuous_eval.metrics.retrieval import PrecisionRecallF1 # Deterministic metric | ||
from continuous_eval.metrics.generation.text import ( | ||
FleschKincaidReadability, # Deterministic metric | ||
DebertaAnswerScores, # Semantic metric | ||
LLMBasedFaithfulness, # LLM-based metric | ||
) | ||
from typing import List, Dict | ||
from continuous_eval.eval.tests import GreaterOrEqualThan | ||
dataset = Dataset("data/eval_golden_dataset") | ||
|
||
Documents = List[Dict[str, str]] | ||
DocumentsContent = ModuleOutput(lambda x: [z["page_content"] for z in x]) | ||
|
||
base_retriever = Module( | ||
name="base_retriever", | ||
input=dataset.question, | ||
output=Documents, | ||
eval=[ | ||
PrecisionRecallF1().use( # Reference-based metric that compares the Retrieved Context with the Ground Truths | ||
retrieved_context=DocumentsContent, | ||
ground_truth_context=dataset.ground_truth_contexts, | ||
), | ||
], | ||
tests=[ | ||
GreaterOrEqualThan( # Set a test using context_recall, a metric calculated by PrecisionRecallF1() | ||
test_name="Context Recall", metric_name="context_recall", min_value=0.9 | ||
), | ||
], | ||
) | ||
|
||
llm = Module( | ||
name="answer_generator", | ||
input=reranker, | ||
output=str, | ||
eval=[ | ||
FleschKincaidReadability().use( # Reference-free metric that only uses the output of the module | ||
answer=ModuleOutput() | ||
), | ||
DebertaAnswerScores().use( # Reference-based metric that compares the Answer with the Ground Truths | ||
answer=ModuleOutput(), ground_truth_answers=dataset.ground_truths | ||
), | ||
LLMBasedFaithfulness().use( # Reference-free metric that uses output from a prior module (Retrieved Context) to evaluate the answer | ||
answer=ModuleOutput(), | ||
retrieved_context=ModuleOutput(DocumentsContent, module=reranker), # DocumentsContent from the reranker module | ||
question=dataset.question, | ||
), | ||
], | ||
tests=[ | ||
MeanGreaterOrEqualThan( # Compares the aggregate result over the dataset against the min_value | ||
test_name="Readability", metric_name="flesch_reading_ease", min_value=20.0 | ||
), | ||
GreaterOrEqualThan( # Compares each result in the dataset against the min_value, and outputs the mean | ||
test_name="Deberta Entailment", metric_name="deberta_entailment", min_value=0.8 | ||
), | ||
], | ||
) | ||
|
||
pipeline = Pipeline([retriever, llm], dataset=dataset) | ||
print(pipeline.graph_repr()) # visualize the pipeline in Mermaid graph format | ||
``` |