Summarisation optimisations #441

erbas · 2015-08-29T09:40:33Z

Added speed improvements to keyword and summarisation methods.

While the paper showed best performance was obtained with nouns and adjectives, there are applications where including verbs, adverbs, or even prepositions in the keywords list might be desirable.

piskvorky · 2015-08-29T11:30:55Z

Looks good, thanks!

Is there a way to make use of sparsity of the adjacency matrix? That would save us a LOT of memory and allow processing much larger inputs.

@fbarrios is the adjacency matrix supposed to be symmetric or not? I didn't check the algo, but I see @erbas added some comments around complex numbers coming out of the eigendecompostion, which would suggest it's not symmetric...?

erbas · 2015-08-29T12:13:46Z

Well, it's packed efficiently in the csr_matrix format, but then unpacked and made dense at full dimension by the addition of the probability_matrix. I don't have good theory on this but debugging it looked like the adjacency_matrix was not symmetric.

To really scale this up I think you need a different algorithmic approach. All I've done is some obvious optimisations. Some form of count-min-sketch is probably the way to go.

…python versions

piskvorky · 2015-08-29T14:01:27Z

Yes, unless we find a way to make use of adjacency matrix' sparsity (and I don't see an easy way), there's no point in the CSR step. We should simply construct it as a dense array to start with, and avoid the pointless conversions (wasting both memory and CPU).

Or implement the iterative algo for finding the principal component ourselves. It's very simple AFAIR, via power iterations. Then we can perhaps make use of both sparse adjacency_matrix and the dense probability matrix (scalar?) in an efficient way, without having to combine them explicitly.

@erbas Can you say a little more around what WEIGHT_THRESHOLD does? What are the tradeoffs here? Worth making it a user-tunable parameter (rather than a hardwired constant)? Add documentation?

piskvorky · 2015-08-29T14:07:56Z

I looked at graph and as far as I can see, it should be undirected. Which means the adjacency matrix should be symmetrical, so we can simply use eigh / eigsh. @fedelopez77 @fbarrios can you please confirm?

erbas · 2015-08-31T02:52:21Z

Yes, I agree with your analysis of the code. The graph should be undirected, but when I step through the code the adjacency matrix is not symmetric. Could be a bug? Could be "conceptually symmetric", i.e. numerical issues mean it's not symmetric to many decimal places? The latter seems unlikely to me as the differences were ~ 1, not 1.e-4.

You asked about WEIGHT_THRESHOLD. The smallest non-zero entry in the matrix for a small sample of random texts was ~ 0.1, so 0.001 seemed a reasonable threshold. Perhaps it's safer to set it to zero? I can't see much advantage in making it user tuneable.

piskvorky · 2015-08-31T03:33:26Z

Let's wait for @fedelopez77's comment, so we don't get too ahead of ourselves :)

fedelopez77 · 2015-08-31T11:37:34Z

I'm sorry I can't confirm this until I get home and run some tests, but I think the matrix is not symmetric.
In the original TextRank paper, it is symmetric, but in the implementation that was merged, we use a variation of the Okapi BM25 function to set the edge's weights. Given that this function uses the results of the tf-idf function on the words of only one of the BM25 arguments, the results aren't the same (i.e. if A and B are sentences, BM25(A, B) != BM25(B, A).
And that's is also why we need a directed graph.

piskvorky · 2015-08-31T11:59:21Z

Ah ok, that explains it. Cheers @fedelopez77 :)

erbas · 2015-09-01T04:10:48Z

@fedelopez77 My unit tests are failing on Travis because different hardware/library combinations give different keyword lists. I would not have expected the numerical differences between platforms to make such a difference, any suggestions?

piskvorky · 2015-09-01T05:01:45Z

@erbas Maybe you're hitting some singular inputs (degenerate matrix spectrum)?

Otherwise we want the algo to be stable of course = small numerical imprecisions should have only proportional (small) impact on the result.

erbas · 2015-09-01T14:18:36Z

@piskvorky I'm developing on OS X and Travis is running linux inside Docker, so I would expect small differences from different BLAS libraries at least. But emphasis on "small". The really odd error is the mismatch number of keywords: ask for 15, get 16. I don't understand :(

This reverts commit 45e6c5b.

piskvorky · 2015-09-22T09:30:09Z

@erbas what's the status here? We'll make a new release this week, would be great to get this in.

erbas · 2015-09-22T23:45:34Z

Hi @piskvorky, so it all works on python 2. I don't understand unicode on python 3 well enough to understand the issue there. The unit tests I wrote are failing because of the previously noted instabilities in keywords. Again I'm not sure what's going on, I would have expected the largest eigenvector to be pretty robust. Seems more likely to me that there's something in the calculation of the scores which is platform dependent.

piskvorky · 2015-09-23T01:36:45Z

@tmylk any ideas? Can you check / fix?

tmylk · 2015-09-30T11:20:57Z

Merged just a Keywords test with Linux keywords in PR-474 It passes in Python 2 in pre-BM25, pre-PR441 and post-PR441 on my Fedora and Travis docker. This shows that this PR hasn't deviated.

The instability due to platform/BLAS version is a separate issue but pretty important.
For example, the test_keywords_ratio passes on my Fedora but fails on Travis.

And of course @erbas actually gets different keywords on OSX.
@erbas What is the BLAS and numpy versions where the tests are passing? Will try to reproduce it in some OSX VM.

Keyword test from #441

erbas · 2015-10-02T05:47:46Z

Hi @tmylk, I'm on numpy 1.9.2 and Xcode 7.1 for the BLAS (contained in the OS X Accelerate Framework). Would be worth checking the matrix doesn't have different values on different platforms too.

Thanks for patching the Python3 issues, not my strong point.

tmylk · 2015-10-06T11:38:52Z

@erbas do you have any advice on how to get the keywords in this file test_data/mihalcea_tarau.kw.txt ?

I didn't have any luck producing them. Everywhere I try I get one different set of keywords. I created a test with this new set in #474 and it passes on unix'es and OSX (python=2.7.9 numpy=1.9.2 Xcode=7.0)

Thinking about removing mihalcea_tarau.kw.txt changes from this PR and using the ones in #474 instead.

erbas · 2015-10-06T12:19:43Z

Nice work! I'm afraid I really don't have any better ideas. In some ways it's reassuring to think this sample text is somehow pathological for this algo but that does beg the question of which other texts might be problematic. Seems reasonable to me to replace the tests.

On 6 Oct 2015, at 7:39 AM, Lev Konstantinovskiy notifications@github.com wrote:

@erbas do you have any advice on how to get the keywords in this file test_data/mihalcea_tarau.kw.txt ?

I didn't have any luck producing them. Everywhere I try I get one different set of keywords. I created a test with this new set in #474 and it passes on unix'es and OSX (python=2.7.9 numpy=1.9.2 Xcode=7.0)

Thinking about removing mihalcea_tarau.kw.txt changes from this PR and using the ones in #474 instead.

—
Reply to this email directly or view it on GitHub.

Summarisation and TextRank optimisations. Updated version of PR #441

tmylk · 2015-10-07T17:37:24Z

@erbas All merged in. Thanks for the optimisations!

Ratio parameters in tests updated to values more resistant to numeric fluctuations.
The cut-off used to be in the middle of a flat patch at 0.08 and that made it very unstable. There were differences even between consequent Travis runs.

erbas · 2015-10-08T19:30:20Z

@tmylk Thanks for digging into this, and sorry for not figuring it out :(

erbas added 6 commits August 29, 2015 19:31

Speedup pagerank by only calculating first eigenvector

36a85ba

Speedup summarisation by skipping edges with small weights

f0bfbd1

Add a pos_filters argument to keywords method

527025f

While the paper showed best performance was obtained with nouns and adjectives, there are applications where including verbs, adverbs, or even prepositions in the keywords list might be desirable.

Make summarisation tests executable as standalone

f2ab8ba

Fix bug not caught by tests

7b0e5b4

Add some keyword tests

4b96eb0

Naive attempt to make keyword tests robust to different hardware and …

1c49a98

…python versions

Add user option to control lemmatising of keywords

78c001a

erbas added 4 commits September 1, 2015 17:24

Make keywords python 3 compatible

34e0df5

Second attempt to make keywords tests robust to version and platform

8962ee3

Remove test which fails on python 2.6

e1db1e6

Try removing unicode transform which breaks on python3

45e6c5b

Revert "Try removing unicode transform which breaks on python3"

8fdfff7

This reverts commit 45e6c5b.

tmylk mentioned this pull request Sep 30, 2015

PRs 441 463 #473

Merged

tmylk mentioned this pull request Sep 30, 2015

Fix summarization.keywords for py3k. Fixes #459 #463

Merged

tmylk added a commit that referenced this pull request Sep 30, 2015

Merge pull request #474 from piskvorky/pr_441_just_tests

962c884

Keyword test from #441

tmylk mentioned this pull request Oct 5, 2015

just Keyword tests from PR441. Pass on Fedora #474

Merged

tmylk mentioned this pull request Oct 7, 2015

Update test_keywords.py #475

Closed

tmylk merged commit 8fdfff7 into piskvorky:develop Oct 7, 2015

tmylk added a commit that referenced this pull request Oct 7, 2015

Merge pull request #473 from piskvorky/pr_441_463

58d52e4

Summarisation and TextRank optimisations. Updated version of PR #441

piskvorky mentioned this pull request Oct 29, 2015

keywords.py AttributeError #503

Closed

tmylk mentioned this pull request Jan 9, 2016

Summarize calculates all eigenvectors when only the largest is needed #438

Closed

xelez mentioned this pull request Oct 25, 2017

Error while summarizing text #805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarisation optimisations #441

Summarisation optimisations #441

erbas commented Aug 29, 2015

piskvorky commented Aug 29, 2015

erbas commented Aug 29, 2015

piskvorky commented Aug 29, 2015

piskvorky commented Aug 29, 2015

erbas commented Aug 31, 2015

piskvorky commented Aug 31, 2015

fedelopez77 commented Aug 31, 2015

piskvorky commented Aug 31, 2015

erbas commented Sep 1, 2015

piskvorky commented Sep 1, 2015

erbas commented Sep 1, 2015

piskvorky commented Sep 22, 2015

erbas commented Sep 22, 2015

piskvorky commented Sep 23, 2015

tmylk commented Sep 30, 2015

erbas commented Oct 2, 2015

tmylk commented Oct 6, 2015

erbas commented Oct 6, 2015

tmylk commented Oct 7, 2015

erbas commented Oct 8, 2015

Summarisation optimisations #441

Summarisation optimisations #441

Conversation

erbas commented Aug 29, 2015

piskvorky commented Aug 29, 2015

erbas commented Aug 29, 2015

piskvorky commented Aug 29, 2015

piskvorky commented Aug 29, 2015

erbas commented Aug 31, 2015

piskvorky commented Aug 31, 2015

fedelopez77 commented Aug 31, 2015

piskvorky commented Aug 31, 2015

erbas commented Sep 1, 2015

piskvorky commented Sep 1, 2015

erbas commented Sep 1, 2015

piskvorky commented Sep 22, 2015

erbas commented Sep 22, 2015

piskvorky commented Sep 23, 2015

tmylk commented Sep 30, 2015

erbas commented Oct 2, 2015

tmylk commented Oct 6, 2015

erbas commented Oct 6, 2015

tmylk commented Oct 7, 2015

erbas commented Oct 8, 2015