Skip to content

Adding a New Metric

Nathan Habib edited this page Sep 30, 2024 · 3 revisions

First, check if you can use one of the parametrized functions in src.lighteval.metrics.metrics_corpus or src.lighteval.metrics.metrics_sample.

If not, you can use the custom_task system to register your new metric:

Tip

To see an example of a custom metric added along with a custom task, look at the IFEval custom task.

Warning

To contribute your custom metric to the lighteval repo, you would first need to install the required dev dependencies by running pip install -e .[dev] and then run pre-commit install to install the pre-commit hooks.

  • Create a new Python file which should contain the full logic of your metric.
  • The file also needs to start with these imports
from aenum import extend_enum
from lighteval.metrics import Metrics

You need to define a sample level metric:

def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
    response = predictions[0]
    return response == formatted_doc.choices[formatted_doc.gold_index]

Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.

def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
    response = predictions[0]
    return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}

Then, you can define an aggregation function if needed, a common aggregation function is np.mean.

def agg_function(items):
    flat_items = [item for sublist in items for item in sublist]
    score = sum(flat_items) / len(flat_items)
    return score

Finally, you can define your metric. If it's a sample level metric, you can use the following code:

my_custom_metric = SampleLevelMetric(
    metric_name={custom_metric_name},
    higher_is_better={either True or False},
    category={MetricCategory},
    use_case={MetricUseCase},
    sample_level_fn=custom_metric,
    corpus_level_fn=agg_function,
)

If your metric defines multiple metrics per sample, you can use the following code:

custom_metric = SampleLevelMetricGrouping(
    metric_name={submetric_names},
    higher_is_better={n: {True or False} for n in submetric_names},
    category={MetricCategory},
    use_case={MetricUseCase},
    sample_level_fn=custom_metric,
    corpus_level_fn={
        "accuracy": np.mean,
        "other_metric": agg_function,
    },
)

To finish, add the following, so that it adds your metric to our metrics list when loaded as a module.

# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
    print("Imported metric")

You can then give your custom metric to lighteval by using --custom-tasks path_to_your_file when launching it.