Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PR #963 : Passing all the params through the apply call in lda.get_document_topics #978

Merged
merged 11 commits into from
Nov 11, 2016

Conversation

Partho
Copy link
Contributor

@Partho Partho commented Oct 26, 2016

Hi, I have added the test case to check get_document_topics() working on corpus or not. However, the build will fail because of the new test case I have added. For some words in corpus, word_topics and word_phis are not returned .

If the corpus is small, I am getting the correct value. However, if I am taking the corpus as defined in test_ldamodel.py, then it throws the error . Let me show this via a toy example called app.py

from gensim.corpora import mmcorpus, Dictionary
from gensim import models


texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system']]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.LdaModel(corpus)

# ab = lda.get_document_topics(corpus)
# for t in ab:
#     print('new doc')
#     print(t)
#
# print len(ab)
# print len(corpus)

doc_topics, word_topics, word_phis = lda.get_document_topics(corpus, per_word_topics=True, minimum_probability=0.02)

for t in doc_topics:
    print('new doc_topic')
    print(t)

for t in word_topics:
    print('new word_topic')
    print(t)

for t in word_phis:
    print('new word_phis')
    print(t)

For the above code, I am getting complete output as follows:

new doc_topic
[(96, 0.75249999999999984)]
new doc_topic
[(0, [96]), (1, [96]), (2, [96])]
new doc_topic
[(0, [(96, 0.99999999999999989)]), (1, [(96, 1.0)]), (2, [(96, 1.0)])]
new word_topic
[(2, 0.85857142857143109)]
new word_topic
[(1, [2]), (3, [2]), (4, [2]), (5, [2]), (6, [2]), (7, [2])]
new word_topic
[(1, [(2, 1.0)]), (3, [(2, 1.0)]), (4, [(2, 1.0)]), (5, [(2, 1.0)]), (6, [(2, 1.0)]), (7, [(2, 1.0)])]
new word_phis
[(29, 0.80200000000000238)]
new word_phis
[(0, [29]), (6, [29]), (7, [29]), (8, [29])]
new word_phis
[(0, [(29, 0.99999999999999989)]), (6, [(29, 1.0)]), (7, [(29, 0.99999999999999989)]), (8, [(29, 0.99999999999999989)])]

However, if I take the corpus as:

texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]

then the following error will be thrown:

/Users/partho/anaconda/bin/python /Users/partho/PycharmProjects/untitled/app.py
Traceback (most recent call last):
  File "/Users/partho/PycharmProjects/untitled/app.py", line 39, in <module>
    doc_topics, word_topics, word_phis = lda.get_document_topics(corpus, per_word_topics=True, minimum_probability=0.02)
ValueError: too many values to unpack

Please let me know how to modify the code for word_topics and word_phis in the returned tuple so that it won't throw the ValueError.

@tmylk
Copy link
Contributor

tmylk commented Oct 26, 2016

is it just that they are filtered?

Feel free to change the minimum_phi_value passed

@Partho
Copy link
Contributor Author

Partho commented Oct 26, 2016

Hi @tmylk ,

For the code below :


from gensim.corpora import mmcorpus, Dictionary
from gensim import models


texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system','eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.LdaModel(corpus)

topics = lda.get_document_topics(corpus, per_word_topics=True)

# for index, topic in enumerate(doc_topics):
#     print len(corpus[index])
#     print doc_topics.corpus[index]


for t in topics:
    print('new doc')
    print 'Document topics:', t[0]
    print 'Word topics:', t[1]
    print 'Word phis:', t[2]
    print('--------------')

Should the output be like this ??

/Users/partho/anaconda/bin/python /Users/partho/PycharmProjects/untitled/app.py
new doc
Document topics: [(80, 0.75250000000000072)]
Word topics: [(0, [80]), (1, [80]), (2, [80])]
Word phis: [(0, [(80, 0.99999999999999989)]), (1, [(80, 0.99999999999999989)]), (2, [(80, 1.0)])]
--------------
new doc
Document topics: [(56, 0.85857142857142965)]
Word topics: [(1, [56]), (3, [56]), (4, [56]), (5, [56]), (6, [56]), (7, [56])]
Word phis: [(1, [(56, 1.0)]), (3, [(56, 1.0)]), (4, [(56, 1.0)]), (5, [(56, 1.0)]), (6, [(56, 1.0)]), (7, [(56, 1.0)])]
--------------
new doc
Document topics: [(93, 0.80200000000000016)]
Word topics: [(0, [93]), (6, [93]), (7, [93]), (8, [93])]
Word phis: [(0, [(93, 1.0)]), (6, [(93, 1.0)]), (7, [(93, 1.0)]), (8, [(93, 1.0)])]
--------------
new doc
Document topics: [(80, 0.80200000000000049)]
Word topics: [(2, [80]), (6, [80]), (8, [80])]
Word phis: [(2, [(80, 1.0)]), (6, [(80, 2.0)]), (8, [(80, 0.99999999999999989)])]
--------------
new doc
Document topics: [(1, 0.75250000000000394)]
Word topics: [(3, [1]), (4, [1]), (7, [1])]
Word phis: [(3, [(1, 1.0)]), (4, [(1, 0.99999999999999989)]), (7, [(1, 1.0)])]
--------------
new doc
Document topics: [(70, 0.50499999999999989)]
Word topics: [(9, [70])]
Word phis: [(9, [(70, 1.0)])]
--------------
new doc
Document topics: [(86, 0.67000000000000048)]
Word topics: [(9, [86]), (10, [86])]
Word phis: [(9, [(86, 1.0)]), (10, [(86, 1.0)])]
--------------
new doc
Document topics: [(4, 0.75250000000000372)]
Word topics: [(9, [4]), (10, [4]), (11, [4])]
Word phis: [(9, [(4, 1.0)]), (10, [(4, 1.0)]), (11, [(4, 1.0)])]
--------------
new doc
Document topics: [(27, 0.75250000000000283)]
Word topics: [(5, [27]), (10, [27]), (11, [27])]
Word phis: [(5, [(27, 0.99999999999999989)]), (10, [(27, 1.0)]), (11, [(27, 1.0)])]
--------------

Process finished with exit code 0

@tmylk
Copy link
Contributor

tmylk commented Oct 27, 2016

So what is happening in topic 3?

@Partho
Copy link
Contributor Author

Partho commented Oct 27, 2016

At this line, we are returning (document_topics, word_topic, word_phi) as a tuple. So, if you call the following function:
doc_topics, word_topic, word_phis = lda.get_document_topics(corpus, per_word_topics=True), you are basically storing each tuple in a particular variable, i.e., variable doc_topics store tuple data (document_topics, word_topic, word_phi) for document 1, and variables word_topic and word_phis store tuple data for document 2 and 3 respectively. Since,
lda.get_document_topics(corpus, per_word_topics=True) returns a list of tuples and the number of these tuples is more than 3 (corpus size is 9), and 3 variables aren't enough to store these data😃

I will show below for others to realize what mistake they are making while making call to function get_document_topics for the entire corpus:

from gensim.corpora import mmcorpus, Dictionary
from gensim import models


texts = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system']]

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

lda = models.LdaModel(corpus)

doc_topics, word_topics, word_phis = lda.get_document_topics(corpus, per_word_topics=True)

print (doc_topics)
print (word_topics)
print (word_phis)

will output :

[[(19, 0.75250000000000328)], [(0, [19]), (1, [19]), (2, [19])], [(0, [(19, 1.0)]), (1, [(19, 1.0)]), (2, [(19, 0.99999999999999989)])]]

[[(86, 0.85857142576872991)], [(1, [86]), (3, [86]), (4, [86]), (5, [86]), (6, [86]), (7, [86])], [(1, [(86, 1.0)]), (3, [(86, 1.0)]), (4, [(86, 1.0)]), (5, [(86, 1.0)]), (6, [(86, 1.0)]), (7, [(86, 1.0)])]]

[[(77, 0.80200000000000071)], [(0, [77]), (6, [77]), (7, [77]), (8, [77])], [(0, [(77, 1.0)]), (6, [(77, 1.0)]), (7, [(77, 1.0)]), (8, [(77, 1.0)])]]

As we can see, each tuple has 3 lists of document_topics, word_topic, word_phi. In case the corpus length is >3, the above code will throw valueError.

So, the proper way to access the data for the entire corpus is like this :

all_topics = lda.get_document_topics(corpus, per_word_topics=True)
for doc_topic, word_topic, word_phi in all_topics:
    print('new doc')
    print 'Document topics:', doc_topic
    print 'Word topics:', word_topic
    print 'Word phis:', word_phi
    print('--------------')

which will return:

new doc
Document topics: [(80, 0.75250000000000072)]
Word topics: [(0, [80]), (1, [80]), (2, [80])]
Word phis: [(0, [(80, 0.99999999999999989)]), (1, [(80, 0.99999999999999989)]), (2, [(80, 1.0)])]
--------------
new doc
Document topics: [(56, 0.85857142857142965)]
Word topics: [(1, [56]), (3, [56]), (4, [56]), (5, [56]), (6, [56]), (7, [56])]
Word phis: [(1, [(56, 1.0)]), (3, [(56, 1.0)]), (4, [(56, 1.0)]), (5, [(56, 1.0)]), (6, [(56, 1.0)]), (7, [(56, 1.0)])]
--------------
new doc
Document topics: [(93, 0.80200000000000016)]
Word topics: [(0, [93]), (6, [93]), (7, [93]), (8, [93])]
Word phis: [(0, [(93, 1.0)]), (6, [(93, 1.0)]), (7, [(93, 1.0)]), (8, [(93, 1.0)])]
--------------
new doc
Document topics: [(80, 0.80200000000000049)]
Word topics: [(2, [80]), (6, [80]), (8, [80])]
Word phis: [(2, [(80, 1.0)]), (6, [(80, 2.0)]), (8, [(80, 0.99999999999999989)])]
--------------
new doc
Document topics: [(1, 0.75250000000000394)]
Word topics: [(3, [1]), (4, [1]), (7, [1])]
Word phis: [(3, [(1, 1.0)]), (4, [(1, 0.99999999999999989)]), (7, [(1, 1.0)])]
--------------
                                              .
                                              .
                                              .

And if we want to get the data for doc_topics like
[(80, 0.75250000000000072), (56, 0.85857142857142965), ...] , and word_topics like
[( 0, [80, 93, ...] ), ( 6, [ 80, 93, ... ] ), ...] , then we will have to add the code to restructure the data and pass it in three variables doc_topic, word_topic, word_phi .

@Partho
Copy link
Contributor Author

Partho commented Oct 27, 2016

In order to get the doc_topics, word_topics and word_phis of the overall corpus, I restructured the output data by creating a helper function. I want to confirm whether the output of the aggregated result is desirable or not. If yes, then we can include it in the final code.

The helper function is as follows:

def get_document_topics_for_corpus(topics):
    document_topics, word_topics, word_phis = dict(), dict(), dict()
    doc_topic_list, word_topic_list, word_phi_list = list(), list(), list()

    for doc_index, (doc_topic, word_topic, word_phi) in enumerate(topics):
        #Document_topics aggregation
        #document_topics.setdefault(doc_topic[0][0], []).append(doc_topic[0][1])
        document_topics.setdefault(doc_index, doc_topic[0])

        #Word_topics aggregation
        for key in word_topic:
            word_topics.setdefault(doc_index, []).append(key)

        #Word_phis aggregation
        for key in word_phi:
            word_phis.setdefault(doc_index, []).append(key)

    doc_topic_list.extend(document_topics.items())
    word_topic_list.extend(word_topics.items())
    word_phi_list.extend(word_phis.items())

    return (doc_topic_list, word_topic_list, word_phi_list)

And now, I can call the above function as follows:

topics = lda.get_document_topics(corpus, per_word_topics=True)
doc_topics, word_topics, word_phis = get_document_topics_for_corpus(topics)

print "Document topics: ", doc_topics
print "Word topics: ", word_topics
print "Word phis:", word_phis

The actual result is as follows:

new doc
Document topics: [(96, 0.75249999999999995)]
Word topics: [(0, [96]), (1, [96]), (2, [96])]
Word phis: [(0, [(96, 1.0)]), (1, [(96, 0.99999999999999989)]), (2, [(96, 0.99999999999999989)])]
--------------
new doc
Document topics: [(83, 0.85857142857142887)]
Word topics: [(1, [83]), (3, [83]), (4, [83]), (5, [83]), (6, [83]), (7, [83])]
Word phis: [(1, [(83, 1.0)]), (3, [(83, 1.0)]), (4, [(83, 1.0)]), (5, [(83, 1.0)]), (6, [(83, 1.0)]), (7, [(83, 1.0)])]
--------------
new doc
Document topics: [(28, 0.80199999999858307)]
Word topics: [(0, [28]), (6, [28]), (7, [28]), (8, [28])]
Word phis: [(0, [(28, 1.0)]), (6, [(28, 1.0)]), (7, [(28, 1.0)]), (8, [(28, 1.0)])]
--------------
new doc
Document topics: [(45, 0.80199999999999594)]
Word topics: [(2, [45]), (6, [45]), (8, [45])]
Word phis: [(2, [(45, 1.0)]), (6, [(45, 2.0)]), (8, [(45, 1.0)])]
--------------
new doc
Document topics: [(40, 0.75250000000000217)]
Word topics: [(3, [40]), (4, [40]), (7, [40])]
Word phis: [(3, [(40, 1.0000000000000002)]), (4, [(40, 1.0)]), (7, [(40, 1.0000000000000002)])]
--------------
new doc
Document topics: [(79, 0.50499999999999989)]
Word topics: [(9, [79])]
Word phis: [(9, [(79, 1.0)])]
--------------
new doc
Document topics: [(79, 0.67000000000000082)]
Word topics: [(9, [79]), (10, [79])]
Word phis: [(9, [(79, 1.0)]), (10, [(79, 1.0)])]
--------------
new doc
Document topics: [(79, 0.75249999997567552)]
Word topics: [(9, [79]), (10, [79]), (11, [79])]
Word phis: [(9, [(79, 0.99999999999999989)]), (10, [(79, 1.0)]), (11, [(79, 1.0)])]
--------------
new doc
Document topics: [(75, 0.75250000000000095)]
Word topics: [(5, [75]), (10, [75]), (11, [75])]
Word phis: [(5, [(75, 1.0)]), (10, [(75, 1.0)]), (11, [(75, 1.0)])]
--------------

And the aggregated result are as follows:

Document topics:  [(0, (93, 0.75250000000000017)), (1, (94, 0.85857142857142865)), (2, (64, 0.80200000000000116)), (3, (97, 0.80199999896254648)), (4, (75, 0.75249999999877881)), (5, (85, 0.50499999999999989)), (6, (51, 0.67000000000000226)), (7, (52, 0.75250000000000183)), (8, (49, 0.75250000000000195))]

Word topics:  [(0, [(0, [93]), (1, [93]), (2, [93])]), (1, [(1, [94]), (3, [94]), (4, [94]), (5, [94]), (6, [94]), (7, [94])]), (2, [(0, [64]), (6, [64]), (7, [64]), (8, [64])]), (3, [(2, [97]), (6, [97]), (8, [97])]), (4, [(3, [75]), (4, [75]), (7, [75])]), (5, [(9, [85])]), (6, [(9, [51]), (10, [51])]), (7, [(9, [52]), (10, [52]), (11, [52])]), (8, [(5, [49]), (10, [49]), (11, [49])])]

Word phis: [(0, [(0, [(93, 1.0)]), (1, [(93, 1.0)]), (2, [(93, 1.0)])]), (1, [(1, [(94, 1.0)]), (3, [(94, 1.0)]), (4, [(94, 1.0)]), (5, [(94, 1.0)]), (6, [(94, 1.0)]), (7, [(94, 1.0)])]), (2, [(0, [(64, 1.0)]), (6, [(64, 1.0)]), (7, [(64, 1.0)]), (8, [(64, 1.0)])]), (3, [(2, [(97, 1.0)]), (6, [(97, 2.0)]), (8, [(97, 1.0)])]), (4, [(3, [(75, 0.99999999999999989)]), (4, [(75, 0.99999999999999989)]), (7, [(75, 0.99999999999999989)])]), (5, [(9, [(85, 1.0)])]), (6, [(9, [(51, 1.0)]), (10, [(51, 1.0)])]), (7, [(9, [(52, 1.0)]), (10, [(52, 1.0)]), (11, [(52, 1.0)])]), (8, [(5, [(49, 1.0)]), (10, [(49, 1.0)]), (11, [(49, 1.0)])])]


Is the aggregated result desirable? If yes, then I will work on improving this helper function. Can you also give me some tips in improving this helper function ??

In case the document topics are unique throughout the corpus (can it happen ?), I can structure the data so that it will look like this:

Document topics:  [(33, 0.80199999999605875), (5, 0.75250000000000383), (39, 0.75250000000000239), (12, 0.50499999998756462), (83, 0.75250000000000061), (52, 0.85857142808535136), (23, 0.67000000000000359), (24, 0.80200000000000249), (90, 0.75250000000000028)]

Word topics:  [(0, [5, 33]), (1, [5, 52]), (2, [5, 24]), (3, [52, 83]), (4, [52, 83]), (5, [52, 90]), (6, [52, 33, 24]), (7, [52, 33, 83]), (8, [33, 24]), (9, [12, 23, 39]), (10, [23, 39, 90]), (11, [39, 90])]

Word phis: [(0, [(5, 1.0), (33, 1.0)]), (1, [(5, 0.99999999999999989), (52, 0.99999999999999989)]), (2, [(5, 0.99999999999999989), (24, 1.0)]), (3, [(52, 1.0), (83, 1.0)]), (4, [(52, 0.99999999999999989), (83, 0.99999999999999989)]), (5, [(52, 1.0), (90, 1.0)]), (6, [(52, 0.99999999999999989), (33, 1.0), (24, 2.0)]), (7, [(52, 0.99999999999999989), (33, 1.0), (83, 0.99999999999999989)]), (8, [(33, 1.0), (24, 1.0)]), (9, [(12, 0.99999999999999989), (23, 1.0000000000000002), (39, 1.0)]), (10, [(23, 1.0), (39, 1.0), (90, 1.0)]), (11, [(39, 1.0), (90, 1.0)])]

@tmylk
Copy link
Contributor

tmylk commented Oct 28, 2016

I don't see a need for aggregated function - please convince me why you think it is heplful.

The current API below looks ok, though it needs to be documented in the docstring and in the lda ipynb tutorial

all_topics = lda.get_document_topics(corpus, per_word_topics=True)
for doc_topic, word_topic, word_phi in all_topics:

@Partho
Copy link
Contributor Author

Partho commented Oct 28, 2016

Yes, I do agree that the aggregated function is not needed actually, but users were confused and calling like this:

doc_topics, word_topics, word_phis  = lda.get_document_topics(corpus, per_word_topics=True)

instead of this:

topics =  lda.get_document_topics(corpus, per_word_topics=True)

It will be a good idea to be documented in docstring and lda tutorial.

Plus, users can store the data for all corpus like this:

topics =  lda.get_document_topics(corpus, per_word_topics=True)
all_topics = [(doc_topics, word_topics, word_phis) for doc_topics, word_topics, word_phis in topics]

And access a particular document details like this:

print all_topics[doc_index]

and output like this:

[[(19, 0.75250000000000328)], [(0, [19]), (1, [19]), (2, [19])], [(0, [(19, 1.0)]), (1, [(19, 1.0)]), (2, [(19, 0.99999999999999989)])]]

@tmylk
Copy link
Contributor

tmylk commented Oct 31, 2016

Sounds great. Let's update docstring and tutorial.

@Partho
Copy link
Contributor Author

Partho commented Oct 31, 2016

@tmylk I have updated the tutorial in topic_methods.ipynb . However, the code that I have added in tutorial will only run if we merge the code in this PR. So, can you please review the code supposed to be merged, and let me know what other changes to be made?

@@ -977,7 +985,9 @@ def __getitem__(self, bow, eps=None):
Ignore topics with very low probability (below `eps`).

"""
return self.get_document_topics(bow, eps)
#Is eps equivalent to minimum_probability?
eps = self.minimum_probability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -80,7 +80,8 @@ class LdaMulticore(LdaModel):
def __init__(self, corpus=None, num_topics=100, id2word=None, workers=None,
chunksize=2000, passes=1, batch=False, alpha='symmetric',
eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50,
gamma_threshold=0.001, random_state=None):
gamma_threshold=0.001, random_state=None,
Copy link
Contributor

@tmylk tmylk Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is minimium_probability? Please add it and pass through

@@ -249,6 +249,26 @@ def testGetDocumentTopics(self):
self.assertTrue(isinstance(k, int))
self.assertTrue(isinstance(v, float))

#Test case to use the get_document_topic function for the corpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new test for minimum_prob and minimum_phi_value. Select values that change the number of topics and docs returned

#Test case to use the get_document_topic function for the corpus
all_topics = model.get_document_topics(self.corpus, minimum_probability=0.5, minimum_phi_value=0.3, per_word_topics=True)

self.assertEqual(model.state.numdocs, len(corpus))
Copy link
Contributor

@tmylk tmylk Nov 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the filtering params have no effect? It is a better test if they change the outcome

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing the effect in case I choose minimum_probability=0.8 and minimum_phi_value=1.0, then most of the doc topic list and word phi list will be blank. Should I use that as a test case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@tmylk tmylk merged commit 17a01cf into piskvorky:develop Nov 11, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants