Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds recall@k metric to rank eval API #52577

Merged
merged 2 commits into from
Feb 27, 2020
Merged

Adds recall@k metric to rank eval API #52577

merged 2 commits into from
Feb 27, 2020

Conversation

joshdevins
Copy link
Member

@joshdevins joshdevins commented Feb 20, 2020

This change adds the recall@k metric and refactors precision@k to match the new metric.

Recall@k is an important metric to use for learning to rank (LTR). Candidate generation / first phase ranking functions are often optimized for high recall, in order to generate as many relevant candidates in the top-k as possible for a second phase of ranking using LTR or a less efficient ranking function. Adding this metric allows tuning the candidate generation for LTR.

See: #51676

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial pass, but haven't yet done a detailed review as it sounds like you're still sorting out some details.

For me, the new confusion matrix abstraction adds more complexity than it’s worth:

  • We compute true negatives, but none of our metrics actually use that value yet. To calculate true negatives, we must add a new 'total hits' parameter to EvaluationMetric#evaluate.
  • To me it doesn’t make it easier to understand each calculation (although maybe others have a different intuition).

Perhaps we could hold off on adding an abstraction until we add more metrics that would benefit from sharing logic?

@joshdevins
Copy link
Member Author

joshdevins commented Feb 21, 2020

We compute true negatives, but none of our metrics actually use that value yet

Yeah, I stuck it in to see how it would look, but I agree it's more than we need and can take it out.

To me it doesn’t make it easier to understand each calculation

That's basically why I added it — I find it easier to reason about these calculations with a confusion matrix, thinking about true positives, false positives, etc. instead of the more abstract concepts like recall being "the proportion of relevant documents in the top-k vs all possible relevant documents", but maybe that's just me 😄 I'm not very opinionated about that detail so I'm happy either way.

@joshdevins
Copy link
Member Author

zOMG, builds are finally passing. I'm removing the confusion matrix abstraction, adding some better recall tests, then I'll ping for a review again after that. @sbourke this should make adding MAP a bit easier, so you can base it off this change set.

@joshdevins joshdevins marked this pull request as ready for review February 24, 2020 15:57
@joshdevins joshdevins self-assigned this Feb 24, 2020
@joshdevins
Copy link
Member Author

joshdevins commented Feb 24, 2020

I'm adding Docs as well. I just realised I can do that in the same PR. Will get to that tomorrow though.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me, just left a few small comments. Some higher-level notes:

  • There is quite a bit of duplication between PrecisionAtK and RecallAtK in terms of serialization code, getters + setters, etc. We could try a refactor to pull out some shared code, but to me it's best to keep the implementation straightforward for now.
  • I think the current approach is fine from a backwards-compatibility perspective. When nodes containing this new metric try to communicate with nodes without the implementation, we will simply fail (because we won't find a matching 'named writeable').

@joshdevins
Copy link
Member Author

joshdevins commented Feb 25, 2020

There is quite a bit of duplication between PrecisionAtK and RecallAtK in terms of serialization code, getters + setters, etc. We could try a refactor to pull out some shared code, but to me it's best to keep the implementation straightforward for now.

Yeah, that's why I tried to refactor it out in the confusion matrix version. The serialization stuff is annoying to find well factored patterns since it relies on a lot of static methods and fields. I would vote to leave it as-is for now and perhaps after we add MAP/GMAP we start to see better patterns to refactor to.

In general, I would prefer to group these metrics into something like:

  • Set-based binary metrics (precision, recall)
  • Binary rank metrics (MAP, MRR)
  • Graded rank metrics (DCG, ERR)

Maybe we can find better ways to factor things based on these kinds of abstractions. The confusion matrix metric was a first attempt to do that. It's also how we have done these metrics in the ML plugin for classification, based on a confusion matrix metric. Of course it's a bit different for rank metrics since you have to deal with top-k situations (e.g. for recall) which you don't have to deal with in the ML use-cases, which makes it a bit more effort and not as clean to code.

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote to leave it as-is for now and perhaps after we add MAP/GMAP we start to see better patterns to refactor to.

Sounds good!

This looks good to me, I think it's ready to merge after you add a note to the docs. If you agree, it'd be good to remove the 'WIP' label, and add the appropriate labels (area, type of change, plus target versions).

@joshdevins joshdevins added :Search Relevance/Ranking Scoring, rescoring, rank evaluation. v7.7.0 labels Feb 26, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Ranking)

This change adds the recall@k metric and refactors precision@k to match
the new metric.

Recall@k is an important metric to use for learning to rank (LTR)
use-cases. Candidate generation or first ranking phase ranking functions
are often optimized for high recall, in order to generate as many
relevant candidates in the top-k as possible for a second phase of
ranking. Adding this metric allows tuning that base query for LTR.

See: #51676
@joshdevins joshdevins merged commit 4ff5e03 into elastic:master Feb 27, 2020
@joshdevins joshdevins deleted the joshdevins/rank-eval-recall-at-k branch February 27, 2020 09:43
joshdevins added a commit that referenced this pull request Feb 27, 2020
This change adds the recall@k metric and refactors precision@k to match
the new metric.

Recall@k is an important metric to use for learning to rank (LTR)
use-cases. Candidate generation or first ranking phase ranking functions
are often optimized for high recall, in order to generate as many
relevant candidates in the top-k as possible for a second phase of
ranking. Adding this metric allows tuning that base query for LTR.

See: #51676
Backports: #52577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants