Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

* optimize __len__ of SlicedCorpus #1679

Merged

Conversation

horpto
Copy link
Contributor

@horpto horpto commented Nov 1, 2017

  • small refactoring of SlicedCorpus and IndexedCorpus
  • remove unnecessary file

horpto and others added 2 commits November 1, 2017 04:54
* small refactoring of SlicedCorpus and IndexedCorpus
* remove unnecessary file
@menshikh-iv
Copy link
Contributor

@horpto thanks, you are very active for last time 👍

@menshikh-iv menshikh-iv merged commit 7ceeda9 into piskvorky:develop Nov 1, 2017
@piskvorky
Copy link
Owner

Awesome! These type of fixes are sorely needed. Many thanks @horpto .

@horpto
Copy link
Contributor Author

horpto commented Nov 2, 2017

Thanks, @piskvorky! It's just little fix. The Worse problem is SlicedCorpus.__iter__: docbyoffset will reopen file for each doc. >< I cannot fix it as I don't want to destroy the abstraction of IndexedCorpus & not all Corpuses support line2doc of something like this.

@horpto horpto deleted the refactoring/len-of-SlisedCorpus branch November 2, 2017 14:22
@piskvorky
Copy link
Owner

piskvorky commented Nov 3, 2017

I think we should also implement a binary format, such as extend our Matrix Market to handle binary (.mm, MmCorpus).

It will be efficient and much faster, for those people who care about speed more than readability. The format is pretty trivial too => easy task. CC @menshikh-iv .

@menshikh-iv
Copy link
Contributor

Can you describe your proposal more concretely @piskvorky (I'll create an issue)?

@piskvorky
Copy link
Owner

piskvorky commented Nov 3, 2017

I mean, store the (sparse) matrix represented by a corpus in a binary format: less space + faster processing.

For example, store each document as number_of_document_nonzeros (int32) followed by serialized (feature_id, feature_weight) pairs as (int32, float32) = 8 bytes per non-zero element. So, total serialized corpus size = num_docs * 4 + 8 * nnz.

That's one simple option (doesn't support slicing / indexing though).

More optimized binary schemes are possible, such as encoding elements more cleverly (RLE for the values? gray code for feature ids? or just zip? something even cleverer?) to save space, or adding indexing capabilities, but I don't think that's strictly necessary. Certainly not as the first step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants