* optimize len of SlicedCorpus #1679

horpto · 2017-11-01T00:05:24Z

small refactoring of SlicedCorpus and IndexedCorpus
remove unnecessary file

* small refactoring of SlicedCorpus and IndexedCorpus * remove unnecessary file

menshikh-iv · 2017-11-01T11:24:12Z

@horpto thanks, you are very active for last time 👍

piskvorky · 2017-11-01T22:48:40Z

Awesome! These type of fixes are sorely needed. Many thanks @horpto .

horpto · 2017-11-02T05:47:12Z

Thanks, @piskvorky! It's just little fix. The Worse problem is SlicedCorpus.__iter__: docbyoffset will reopen file for each doc. >< I cannot fix it as I don't want to destroy the abstraction of IndexedCorpus & not all Corpuses support line2doc of something like this.

piskvorky · 2017-11-03T11:25:12Z

I think we should also implement a binary format, such as extend our Matrix Market to handle binary (.mm, MmCorpus).

It will be efficient and much faster, for those people who care about speed more than readability. The format is pretty trivial too => easy task. CC @menshikh-iv .

menshikh-iv · 2017-11-03T11:57:33Z

Can you describe your proposal more concretely @piskvorky (I'll create an issue)?

piskvorky · 2017-11-03T12:08:43Z

I mean, store the (sparse) matrix represented by a corpus in a binary format: less space + faster processing.

For example, store each document as number_of_document_nonzeros (int32) followed by serialized (feature_id, feature_weight) pairs as (int32, float32) = 8 bytes per non-zero element. So, total serialized corpus size = num_docs * 4 + 8 * nnz.

That's one simple option (doesn't support slicing / indexing though).

More optimized binary schemes are possible, such as encoding elements more cleverly (RLE for the values? gray code for feature ids? or just zip? something even cleverer?) to save space, or adding indexing capabilities, but I don't think that's strictly necessary. Certainly not as the first step.

horpto and others added 2 commits November 1, 2017 04:54

* optimize __len__ of SlicedCorpus

c3e4151

* small refactoring of SlicedCorpus and IndexedCorpus * remove unnecessary file

indentation fix

fe1fea7

menshikh-iv merged commit 7ceeda9 into piskvorky:develop Nov 1, 2017

horpto deleted the refactoring/len-of-SlisedCorpus branch November 2, 2017 14:22

menshikh-iv mentioned this pull request Nov 7, 2017

New binary corpus format #1697

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

* optimize len of SlicedCorpus #1679

* optimize len of SlicedCorpus #1679

horpto commented Nov 1, 2017

menshikh-iv commented Nov 1, 2017

piskvorky commented Nov 1, 2017

horpto commented Nov 2, 2017 •

edited by menshikh-iv

Loading

piskvorky commented Nov 3, 2017 •

edited

Loading

menshikh-iv commented Nov 3, 2017

piskvorky commented Nov 3, 2017 •

edited

Loading

* optimize __len__ of SlicedCorpus #1679

* optimize __len__ of SlicedCorpus #1679

Conversation

horpto commented Nov 1, 2017

menshikh-iv commented Nov 1, 2017

piskvorky commented Nov 1, 2017

horpto commented Nov 2, 2017 • edited by menshikh-iv Loading

piskvorky commented Nov 3, 2017 • edited Loading

menshikh-iv commented Nov 3, 2017

piskvorky commented Nov 3, 2017 • edited Loading

* optimize len of SlicedCorpus #1679

* optimize len of SlicedCorpus #1679

horpto commented Nov 2, 2017 •

edited by menshikh-iv

Loading

piskvorky commented Nov 3, 2017 •

edited

Loading

piskvorky commented Nov 3, 2017 •

edited

Loading