-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
* optimize __len__ of SlicedCorpus #1679
* optimize __len__ of SlicedCorpus #1679
Conversation
horpto
commented
Nov 1, 2017
- small refactoring of SlicedCorpus and IndexedCorpus
- remove unnecessary file
* small refactoring of SlicedCorpus and IndexedCorpus * remove unnecessary file
@horpto thanks, you are very active for last time 👍 |
Awesome! These type of fixes are sorely needed. Many thanks @horpto . |
Thanks, @piskvorky! It's just little fix. The Worse problem is SlicedCorpus.__iter__: |
I think we should also implement a binary format, such as extend our Matrix Market to handle binary (.mm, It will be efficient and much faster, for those people who care about speed more than readability. The format is pretty trivial too => easy task. CC @menshikh-iv . |
Can you describe your proposal more concretely @piskvorky (I'll create an issue)? |
I mean, store the (sparse) matrix represented by a corpus in a binary format: less space + faster processing. For example, store each document as That's one simple option (doesn't support slicing / indexing though). More optimized binary schemes are possible, such as encoding elements more cleverly (RLE for the values? gray code for feature ids? or just zip? something even cleverer?) to save space, or adding indexing capabilities, but I don't think that's strictly necessary. Certainly not as the first step. |