-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a problem in the eval.py #5
Comments
It is a simple intuition that triples having a same scores should have a same rank. |
Hi, I believe this is wrong, @daiquocnguyen. Two simple fixes: Currently your results for ConvKB and CapsE are not comparable to the remainder of the literature and do not make sense (since a trivial model would be optimal). |
At the begging, in order to work with a batch size when evaluating ConvKB for the knowledge graph completion, I replicated each correct test triple several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said that triples having a same score should have same rank. Those triples I mean are the correct test triple and its replicated triples. Last year, I found out that someone mentioned an issue that some different triples have also a same score on FB15k-237 in Openreview ICLR2019. I don't know how many triples as they mentioned, but this probably does not happens in WN18RR, WN11, FB13 and SEARCH17. |
Hi, "At the begging, in order to work with a batch size when evaluating ConvKB for the knowledge graph completion, I replicated each correct test triple several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said that triples having a same score should have same rank. Those triples I mean are the correct test triple and its replicated triples." No. Each triple in the list new_x_batch is unique. There are exactly n_entities elements in new_x_batch, each with a unique head or tail. Then you "filter" all correct triples out line 195-202, before re-adding the correct triple (which has been removed before) line 205-206. Currently, the model that performs better for your evaluation is a model that gives a score of 0 to all triples. Your numbers for ComplEx are outdated, please see https://github.com/facebookresearch/kbc for up-to-date numbers. |
At the beginning I mean it's 1 years and a half ago, when using a batch size, for each correct test triple, I replicated it several times to add to its set of corrupted triples to fulfill a batch size. That's reason why I said in my previous comment above. After the discussion in Openreview last year, I made a new implementation without using a batch size for the evaluation as you now see here. |
So is the problem already solved? Could we use the eval.py to evaluate the methods modified from ConvKB? |
I would like to clarify that there is no problem in eval.py itself As the issue doesn't appear on other datasets, I still don't know what exact answer is. |
@chenwang1701, if you use eval.py from this repo, you'll suffer from the same problems mentioned in this issue. However, the fix to obtain results that are comparable to the state of the art is simple : |
@timlacroix @chenwang1701 @zhangzy6 @AndRossi This paper above was accepted to ACL 2020 and advertised via Twitter, thus I knew it. I see this paper contributes some findings to the evaluation protocol. But it's not fair when the paper does not mention that the issue is "only" on FB15k237, where I still don't know the exact answer. And it does not mean that ConvKB can not be served as a good baseline at this time. I have found the ACL 2019 paper (and the Pytorch code), where the authors used ConvKB in the decoder. The results are still good for our ConvKB. And I can get better results with suitable initialization. So, for technical speaking, a reasonable answer comes from Pytorch (instead of Tensorflow), where ConvKB does not get the issue on FB15k237. I will plan to reimplement our ConvKB in Pytorch to answer you further. |
@chenwang1701 I believe that the issue does not come from the implementation of In a special case when you have to use a The issue mentioned here is due to a finding that ConvKB returns a same score for different triples |
@timlacroix @zhangzy6 @AndRossi I have just released the Pytorch implementation of our ConvKB, which is based on the OpenKE framework, to deal with the issue to show you that the ACL 2020 paper is not fair when re-evaluating our model. The authors just changed a line of code in The results obtained by the Pytorch implementation with using the existing initialization are as follows:
These obtained results are sate-of-the-art ones in 2017. Note that ConvKB is used as a decoder to improve the task performance. You can get better results if you use the initialization produced by TransE from the OpenKE framework. |
I find a problem in the eval.py. when the scores of these entities are same, the ranking of the all entities are 1. Can you explain this?
The text was updated successfully, but these errors were encountered: