Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary assert blocking direct usage of CSC for LSI #1622

Merged
merged 1 commit into from
Oct 16, 2017

Conversation

isamaru
Copy link
Contributor

@isamaru isamaru commented Oct 10, 2017

@piskvorky @janpom @menshikh-iv

Removes the assert which is blocking usage of CSC directly as corpus for LsiModel with multiple power iterations.
I could not figure out why the assert was there in the first place: it doesn't make sense to me: The code path handles CSC exclusively but I don't see why CSC would require a single pass only. It works just fine with multiple power iterations.

@piskvorky
Copy link
Owner

piskvorky commented Oct 10, 2017

@isamaru IIRC, the code path that handles CSC was there for the distributed version only. The master worker prepares batches for workers as CSC matrices; the workers process the batches and send the results back to master over the network.

The code was never meant to be used in this way, as the "main" worker, it's a different use case. I actually wouldn't expect it to work so easily (do all the necessary model setup etc, as if it was the master itself). But if it does, that's great.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that distributed onepass works properly, look at this tutorial: https://radimrehurek.com/gensim/dist_lsi.html

@isamaru
Copy link
Contributor Author

isamaru commented Oct 11, 2017

@radim Thanks for the explanation! Makes much more sense now.

Distributed algorithm with not work on CSC directly as written now: the dispatcher code is only enabled in the top branch (if not scipy.sparse.issparse(corpus):), and the remaining assert will fail during dispatcher initialization if you try distributed straight on CSC.
Distributed algorithm on corpus (the original use case) will work just fine, it remains unaffected by this change.
Enabling the distributed mode for CSC would take slightly more work and refactoring of the whole method.

To sum up: serial (non-distributed) algorithm will work directly on CSC. Distributed will not work, failing on the very same assert than before ('must be in serial mode to receive jobs').

I could rewrite that assert message to be more meaningful to the user: "Distributed mode not supported for CSC matrices directly. Use csc2corpus."

@piskvorky
Copy link
Owner

I see, thanks. I think that's fine as-is, no changes needed -- nobody really uses the (ancient) distributed mode, AFAIK. It's one of the candidate features for removal.

@menshikh-iv
Copy link
Contributor

@piskvorky ok, I close this PR

@piskvorky
Copy link
Owner

piskvorky commented Oct 13, 2017

Why? We definitely want this PR.

@menshikh-iv
Copy link
Contributor

@piskvorky as I understand, think that's fine as-is, no changes needed -- nobody really uses the (ancient) distributed mode means that we no need this changes

@menshikh-iv menshikh-iv reopened this Oct 13, 2017
@piskvorky
Copy link
Owner

No, it means we don't need to rewrite the assert message.

@menshikh-iv
Copy link
Contributor

Thanks @isamaru 🥇

@menshikh-iv menshikh-iv merged commit e9bbcf3 into piskvorky:develop Oct 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants