Investigate other ranking evaluation metrics #29653

cbuescher · 2018-04-23T12:22:31Z

From an old dicussion in our forums I just learned about another interesting looking ranking evaluation metric used in some TREC competitions called "bpref" that is advertised to work well with incomplete data.

I'm opening this issue to do some more investigation into this and other evaluation metrics that we haven't considered yet.

Regarding bpref its atm. unclear to me:

how widely used it is
in which use cases it might perform better than the metric we currently offer
if we can implement it with our current API that is based in msearch or if we would need to change something to make it work

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-04-23T12:24:13Z

Pinging @elastic/es-search-aggs

rpedela · 2018-04-24T18:14:34Z

I would love to see expected reciprocal rank (ERR) added.

cbuescher · 2018-04-24T20:19:39Z

@rpedela great suggestion, I will look into this as well and how it fits into the current design of the API. Do you already use ERR? If so, for which kind of use case and how does it compare to other metrics (like e.g. nDCG) in your experience?

rpedela · 2018-04-24T23:29:16Z

Doug Turnbull from Open Source Connections does a great job answering your question in this talk starting at 21:18.

rpedela · 2018-04-24T23:59:14Z

One more data point. Ranklib is the de facto learning to rank library and ERR is the default optimization metric used for training.

cbuescher · 2018-07-09T09:33:18Z

@rpedela I started looking into ERR and found it to be a great additional metric. I've opened a PR at #31891, maybe you'd like to comment if you are familiar with the calculation of this metric and want to check if my understanding of the algorithm is correct. In particular I was wondering about the handling of ungraded search results. The paper assumes complete labels but this is unrealistic in a real-world scenario. For now I opted for an optional, user-supplied "unknown_doc_rating" parameter that gets substituted for search results without a relevance judgment (it could simply be 0 for most cases). If this parameter is not present, unrated documents are just skipped over in the metric calculation. Not sure if that is common practice but would like to hear thought or get pointers on this.

This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric as descriped in: Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management. https://doi.org/10.1145/1645953.1646033 ERR is an extension of the classical reciprocal rank to the graded relevance case and assumes a cascade browsing model. It quantifies the usefulness of a document at rank `i` conditioned on the degree of relevance of the items at ranks less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it seems like a good metric to support. Also ERR seems to be the default optimization metric used for training in RankLib, a widely used learning to rank library. Relates to elastic#29653

This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric as descriped in: Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management. https://doi.org/10.1145/1645953.1646033 ERR is an extension of the classical reciprocal rank to the graded relevance case and assumes a cascade browsing model. It quantifies the usefulness of a document at rank `i` conditioned on the degree of relevance of the items at ranks less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it seems like a good metric to support. Also ERR seems to be the default optimization metric used for training in RankLib, a widely used learning to rank library. Relates to #29653

cbuescher · 2018-10-02T16:53:51Z

Another possible metric that I recently encountered in presentation is Average Precision (or, if taken across multiple user needs: Mean Average Precison).

cbuescher · 2019-04-04T15:51:20Z

Moving some thoughts from #20441 here since it seems a better fit to keep tracking it:

In case users are able to label entire datasets (likely more academic / ML use cases) they might be interested in metrics including some sort of recall like f-score, AUC of ROC curve. However, we are doubtfull about the likelihood of this for any practical purpose.

joshdevins · 2020-01-30T14:23:09Z

In case users are able to label entire datasets (likely more academic / ML use cases) they might be interested in metrics including some sort of recall like f-score, AUC of ROC curve. However, we are doubtfull about the likelihood of this for any practical purpose.

I think the "entire datasets of labels" is covered by what we offer today in Machine Learning.

elasticsearchmachine · 2022-09-23T09:34:27Z

Pinging @elastic/es-search (Team:Search)

javanna · 2024-06-25T20:09:49Z

There are no concrete plans to work on this issue. Closing.

javanna added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Apr 23, 2018

javanna assigned cbuescher Apr 23, 2018

colings86 added the >feature label Apr 24, 2018

cbuescher changed the title ~~Investigate more ranking evaluation metrics~~ Investigate other ranking evaluation metrics Apr 24, 2018

cbuescher mentioned this issue Jul 9, 2018

Add Expected Reciprocal Rank metric #31891

Merged

cbuescher added the Meta label Oct 2, 2018

cbuescher mentioned this issue Apr 4, 2019

RankEval: Will we run into folks running rank eval on a fully labelled corpus? #20441

Closed

joshdevins mentioned this issue Jan 30, 2020

Ranking Evaluation API: Add MAP and recall@k metrics #51676

Open

rjernst added the Team:Search Meta label for search team label May 4, 2020

cbuescher removed their assignment Sep 23, 2022

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2024

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate other ranking evaluation metrics #29653

Investigate other ranking evaluation metrics #29653

cbuescher commented Apr 23, 2018

elasticmachine commented Apr 23, 2018

rpedela commented Apr 24, 2018

cbuescher commented Apr 24, 2018

rpedela commented Apr 24, 2018

rpedela commented Apr 24, 2018

cbuescher commented Jul 9, 2018

cbuescher commented Oct 2, 2018

cbuescher commented Apr 4, 2019

joshdevins commented Jan 30, 2020

elasticsearchmachine commented Sep 23, 2022

javanna commented Jun 25, 2024

Investigate other ranking evaluation metrics #29653

Investigate other ranking evaluation metrics #29653

Comments

cbuescher commented Apr 23, 2018

elasticmachine commented Apr 23, 2018

rpedela commented Apr 24, 2018

cbuescher commented Apr 24, 2018

rpedela commented Apr 24, 2018

rpedela commented Apr 24, 2018

cbuescher commented Jul 9, 2018

cbuescher commented Oct 2, 2018

cbuescher commented Apr 4, 2019

joshdevins commented Jan 30, 2020

elasticsearchmachine commented Sep 23, 2022

javanna commented Jun 25, 2024