Repo containing code and data for the analysis presented in: "E Pluribus Unum: Guidelines on Multi-Objective Evaluation of Recommender Systems"
Clone this repository and execute the following code from the root of the project to replicate the original evaluation protocol in EvalRS 2022.
import pandas as pd
data = pd.read_parquet("./data/evalrs_2022_submissions.parquet.snappy")
MRED is an equality difference measure and hence has to be minimized. Since the remaining evaluation metrics need to be maximized, following Eq. 1 in the paper, we report here the negated MRED scores. As a result, the aggregated score has to be maximized.
The score assigned to the first stage submission is a simple average across all metrics:
METRICS = [
"hit_rate",
"mrr",
"mred_country",
"mred_track_popularity",
"mred_user_activity",
"mred_artist_popularity",
"mred_gender",
"being_less_wrong",
"latent_diversity",
]
def get_score_stage_one(data: pd.DataFrame):
return data[METRICS].mean(axis=1)
data["score_1"] = get_score_stage_one(data)
To compute the final score we
- picked the best submission for each team in phase 1;
- filtered our submissions with a hit-rate threshold;
- computed min-max normalization using a CBOW baseline (min) and the per-metric best (max). For your convenience, we added hardcoded metric values in the code below for both the baseline and the best results;
- aggregated the metrics with a weighted average.
Use the following code to replicate our process.
from dataclasses import dataclass
@dataclass
class PhaseOne:
@property
def baseline(self):
return self._CBOW_SCORES
@property
def best(self):
return self._BEST_SCORE_P1
HR_THRESHOLD = 0.015 # ~ 20% below CBOW HIT RATE
# scores of our CBOW baseline
_CBOW_SCORES = {
"hit_rate": 0.018763,
"mrr": 0.001654,
"mred_country": -0.006944,
"mred_user_activity": -0.012460,
"mred_track_popularity": -0.006816,
"mred_artist_popularity": -0.003915,
"mred_gender": -0.004354,
"being_less_wrong": 0.2744871, # Original score (0.322926) decreased by 15%
"latent_diversity": -0.324706
}
# best per-metric score from phase 1, considering only each team best submission
_BEST_SCORE_P1 = {
"hit_rate": 0.264642,
"mrr": 0.067493,
"mred_country": -0.004490,
"mred_user_activity": -0.006922,
"mred_track_popularity": -0.005865,
"mred_artist_popularity": -0.003623,
"mred_gender": -0.000032,
"being_less_wrong": 0.40635,
"latent_diversity": -0.202812
}
def get_score_stage_two(row):
reference = PhaseOne()
# Check if submission meets minimum reqs
if row["hit_rate"] < reference.HR_THRESHOLD:
return -100.0
normalized_scores = dict()
for test in METRICS:
normalized_scores[test] = (
row[test] - reference.baseline[test]
) / (reference.best[test] - reference.baseline[test])
# Computing meta-scores
# Performance
ms_perf = (normalized_scores["hit_rate"] + normalized_scores["mrr"]) / 2
#Fairness / Slicing
ms_fair = (
normalized_scores["mred_country"] +
normalized_scores["mred_user_activity"] +
normalized_scores["mred_track_popularity"] +
normalized_scores["mred_artist_popularity"] +
normalized_scores["mred_gender"]
) / 5
# Behavioral
ms_behav = (
normalized_scores["being_less_wrong"] + normalized_scores["latent_diversity"]
) / 2
# Meta-scores weights
w = 1, 1.5, 1.5
score = (w[0] * ms_perf + w[1] * ms_fair + w[2] * ms_behav) / sum(w)
return score
data["score_2"] = data.apply(get_score_stage_two, axis=1)