-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate other ranking evaluation metrics #29653
Comments
Pinging @elastic/es-search-aggs |
I would love to see expected reciprocal rank (ERR) added. |
@rpedela great suggestion, I will look into this as well and how it fits into the current design of the API. Do you already use ERR? If so, for which kind of use case and how does it compare to other metrics (like e.g. nDCG) in your experience? |
Doug Turnbull from Open Source Connections does a great job answering your question in this talk starting at 21:18. |
One more data point. Ranklib is the de facto learning to rank library and ERR is the default optimization metric used for training. |
@rpedela I started looking into ERR and found it to be a great additional metric. I've opened a PR at #31891, maybe you'd like to comment if you are familiar with the calculation of this metric and want to check if my understanding of the algorithm is correct. In particular I was wondering about the handling of ungraded search results. The paper assumes complete labels but this is unrealistic in a real-world scenario. For now I opted for an optional, user-supplied "unknown_doc_rating" parameter that gets substituted for search results without a relevance judgment (it could simply be 0 for most cases). If this parameter is not present, unrated documents are just skipped over in the metric calculation. Not sure if that is common practice but would like to hear thought or get pointers on this. |
This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric as descriped in: Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management. https://doi.org/10.1145/1645953.1646033 ERR is an extension of the classical reciprocal rank to the graded relevance case and assumes a cascade browsing model. It quantifies the usefulness of a document at rank `i` conditioned on the degree of relevance of the items at ranks less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it seems like a good metric to support. Also ERR seems to be the default optimization metric used for training in RankLib, a widely used learning to rank library. Relates to elastic#29653
This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric as descriped in: Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management. https://doi.org/10.1145/1645953.1646033 ERR is an extension of the classical reciprocal rank to the graded relevance case and assumes a cascade browsing model. It quantifies the usefulness of a document at rank `i` conditioned on the degree of relevance of the items at ranks less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it seems like a good metric to support. Also ERR seems to be the default optimization metric used for training in RankLib, a widely used learning to rank library. Relates to #29653
This change adds Expected Reciprocal Rank (ERR) as a ranking evaluation metric as descriped in: Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management. https://doi.org/10.1145/1645953.1646033 ERR is an extension of the classical reciprocal rank to the graded relevance case and assumes a cascade browsing model. It quantifies the usefulness of a document at rank `i` conditioned on the degree of relevance of the items at ranks less than `i`. ERR seems to be gain traction as an alternative to (n)DCG, so it seems like a good metric to support. Also ERR seems to be the default optimization metric used for training in RankLib, a widely used learning to rank library. Relates to #29653
Another possible metric that I recently encountered in presentation is Average Precision (or, if taken across multiple user needs: Mean Average Precison). |
Moving some thoughts from #20441 here since it seems a better fit to keep tracking it: In case users are able to label entire datasets (likely more academic / ML use cases) they might be interested in metrics including some sort of recall like f-score, AUC of ROC curve. However, we are doubtfull about the likelihood of this for any practical purpose. |
I think the "entire datasets of labels" is covered by what we offer today in Machine Learning. |
Pinging @elastic/es-search (Team:Search) |
There are no concrete plans to work on this issue. Closing. |
From an old dicussion in our forums I just learned about another interesting looking ranking evaluation metric used in some TREC competitions called "bpref" that is advertised to work well with incomplete data.
I'm opening this issue to do some more investigation into this and other evaluation metrics that we haven't considered yet.
Regarding
bpref
its atm. unclear to me:The text was updated successfully, but these errors were encountered: