Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in bm25.py #1726

Merged
merged 1 commit into from
Nov 20, 2017
Merged

Fix bug in bm25.py #1726

merged 1 commit into from
Nov 20, 2017

Conversation

souravsingh
Copy link
Contributor

Related to #1718

@piskvorky
Copy link
Owner

piskvorky commented Nov 18, 2017

That looks like a serious / critical difference, thanks! @menshikh-iv can you please verify the new version is correct?

What functionality is affected? How did this bug manifest itself in practice?

@souravsingh
Copy link
Contributor Author

Nothing is mentioned in the issue itself. The person filing the issue just mentions that it is a bug, and didn't say anything about how it affects the functionality.

@piskvorky
Copy link
Owner

piskvorky commented Nov 18, 2017

Probably best to double check and verify. The summarization subpackage is still used very sparsely (I didn't even know it had BM25 in it).

@souravsingh
Copy link
Contributor Author

The person who filed the issue mentioned the following-

Actually, I used the bm25 module independently of the
summarization module just for IR.

Basically, I wanted to quickly compare bm25 with tf-idf, pylucene's api
seemed to be too cumbersome, and I did not want to reinvent the wheel.
Gensim generally has pretty good coding standards that's why I decided to
use bm25 from gensim instead of the various other BM25 implementations on
github. But after implementation I noticed that the results from bm25 were
worse than standard tf-idf and that's why I looked at the code in more
detail. The data that I am working on comes from https://www.ldc.upenn.edu
so its hard to share it.

@piskvorky
Copy link
Owner

piskvorky commented Nov 18, 2017

Yes, I saw.

It's better link to @se4u's answer, instead of copy-pasting.

@souravsingh
Copy link
Contributor Author

So we confirm the behavior of the bug by running both tf-idf and bm25 correct? It might be a bit difficult since bm25 doesn't have any docstrings to help in understanding the behavior.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 20, 2017

BM25 formula:
image
Current variant (in code) looks correct now.

@menshikh-iv menshikh-iv merged commit 24a692b into piskvorky:develop Nov 20, 2017
VaiyeBe pushed a commit to VaiyeBe/gensim that referenced this pull request Nov 26, 2017
@eliotpbrenner
Copy link

Can you please double check? I believe it should be len(self.corpus[index]), not len(document)
Thanks!

@se4u
Copy link

se4u commented Dec 30, 2017

@eliotpbrenner yes you are right !

@menshikh-iv
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants