-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds recall@k metric to rank eval API #52577
Adds recall@k metric to rank eval API #52577
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial pass, but haven't yet done a detailed review as it sounds like you're still sorting out some details.
For me, the new confusion matrix abstraction adds more complexity than it’s worth:
- We compute true negatives, but none of our metrics actually use that value yet. To calculate true negatives, we must add a new 'total hits' parameter to
EvaluationMetric#evaluate
. - To me it doesn’t make it easier to understand each calculation (although maybe others have a different intuition).
Perhaps we could hold off on adding an abstraction until we add more metrics that would benefit from sharing logic?
Yeah, I stuck it in to see how it would look, but I agree it's more than we need and can take it out.
That's basically why I added it — I find it easier to reason about these calculations with a confusion matrix, thinking about true positives, false positives, etc. instead of the more abstract concepts like recall being "the proportion of relevant documents in the top-k vs all possible relevant documents", but maybe that's just me 😄 I'm not very opinionated about that detail so I'm happy either way. |
zOMG, builds are finally passing. I'm removing the confusion matrix abstraction, adding some better recall tests, then I'll ping for a review again after that. @sbourke this should make adding MAP a bit easier, so you can base it off this change set. |
I'm adding Docs as well. I just realised I can do that in the same PR. Will get to that tomorrow though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me, just left a few small comments. Some higher-level notes:
- There is quite a bit of duplication between
PrecisionAtK
andRecallAtK
in terms of serialization code, getters + setters, etc. We could try a refactor to pull out some shared code, but to me it's best to keep the implementation straightforward for now. - I think the current approach is fine from a backwards-compatibility perspective. When nodes containing this new metric try to communicate with nodes without the implementation, we will simply fail (because we won't find a matching 'named writeable').
modules/rank-eval/src/main/java/org/elasticsearch/index/rankeval/RecallAtK.java
Outdated
Show resolved
Hide resolved
modules/rank-eval/src/main/java/org/elasticsearch/index/rankeval/RecallAtK.java
Outdated
Show resolved
Hide resolved
modules/rank-eval/src/test/java/org/elasticsearch/index/rankeval/PrecisionAtKTests.java
Outdated
Show resolved
Hide resolved
modules/rank-eval/src/test/java/org/elasticsearch/index/rankeval/RankEvalSpecTests.java
Show resolved
Hide resolved
Yeah, that's why I tried to refactor it out in the confusion matrix version. The serialization stuff is annoying to find well factored patterns since it relies on a lot of static methods and fields. I would vote to leave it as-is for now and perhaps after we add MAP/GMAP we start to see better patterns to refactor to. In general, I would prefer to group these metrics into something like:
Maybe we can find better ways to factor things based on these kinds of abstractions. The confusion matrix metric was a first attempt to do that. It's also how we have done these metrics in the ML plugin for classification, based on a confusion matrix metric. Of course it's a bit different for rank metrics since you have to deal with top-k situations (e.g. for recall) which you don't have to deal with in the ML use-cases, which makes it a bit more effort and not as clean to code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would vote to leave it as-is for now and perhaps after we add MAP/GMAP we start to see better patterns to refactor to.
Sounds good!
This looks good to me, I think it's ready to merge after you add a note to the docs. If you agree, it'd be good to remove the 'WIP' label, and add the appropriate labels (area, type of change, plus target versions).
Pinging @elastic/es-search (:Search/Ranking) |
This change adds the recall@k metric and refactors precision@k to match the new metric. Recall@k is an important metric to use for learning to rank (LTR) use-cases. Candidate generation or first ranking phase ranking functions are often optimized for high recall, in order to generate as many relevant candidates in the top-k as possible for a second phase of ranking. Adding this metric allows tuning that base query for LTR. See: #51676
This change adds the recall@k metric and refactors precision@k to match the new metric. Recall@k is an important metric to use for learning to rank (LTR) use-cases. Candidate generation or first ranking phase ranking functions are often optimized for high recall, in order to generate as many relevant candidates in the top-k as possible for a second phase of ranking. Adding this metric allows tuning that base query for LTR. See: #51676 Backports: #52577
This change adds the recall@k metric and refactors precision@k to match the new metric.
Recall@k is an important metric to use for learning to rank (LTR). Candidate generation / first phase ranking functions are often optimized for high recall, in order to generate as many relevant candidates in the top-k as possible for a second phase of ranking using LTR or a less efficient ranking function. Adding this metric allows tuning the candidate generation for LTR.
See: #51676