Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Dynamic Topic Models in python. #739

Merged
merged 40 commits into from
Aug 18, 2016
Merged

Conversation

bhargavvader
Copy link
Contributor

@bhargavvader bhargavvader commented Jun 9, 2016

A python translation from C of Dynamic Topic Models, as described by Blei at al in this paper, originally written in C here.

Dynamic Topic Models are used to topic model time-series tagged documents such that the topics evolve over each time-slice. This is useful in finding out how certain key words in topics change over time and in finding document similarity between documents far apart in time but contextually similar.

This was my Google Summer of Code 2016 project.

Future Works:

  • Make the corpus streaming similar to LdaModel
  • Include Document Influence Model (DIM) mode. Most of the infrastructure for this is in place.
  • See if LdaPost can be replaced by LdaModel completely without breaking anything.
    - in particular, a lot of DIM depends on LdaPost being in place.
  • Heavy lifting going on in the sslm class - efforts can be made to cythonise mathematical methods.
    - in particular, update_obs and the optimization takes a lot time.
  • Try and make it distributed, especially around the E and M step.
  • Get rid of all C/C++ coding styles if left behind.

@tmylk
Copy link
Contributor

tmylk commented Jun 21, 2016

Could you please update it to be more up to date with your private fork?

@bhargavvader
Copy link
Contributor Author

@tmylk all the methods are more or less done. I still have to see scipy implementations of the gsl methods and replace them, as well as add tests for the entire module.

@bhargavvader
Copy link
Contributor Author

Note: as of now, the classes and methods are not well arranged, and there are a few mock classes (which will be removed) to help me with testing. Once all the testing is done and a crude DTM is fit, I will rearrange the structure of the DTM code.

import sys

# this is a mock LDA class to help with testing until this is figured out
class mockLDA(utils.SaveLoad):
Copy link
Contributor

@tmylk tmylk Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the mock class be in test folder rather than in prod file?

@bhargavvader
Copy link
Contributor Author

Cleaned up code, added docstrings so that it is more reviewable now. Made changes to many core classes so most of the tests would be failing; will fix them and further clean tomorrow.

totals = numpy.zeros(counts.shape[1])

for w in range(0, W):
self.variance[w], self.fwd_variance[w] = self.compute_post_variance(w, self.chain_variance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip* and list comprehension is easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


1) Include DIM mode. Most of the infrastructure for this is in place.
2) See if LdaPost can be replaces by LdaModel completely without breakign anything.
3) Heavy lifting going on in the sslm class - efforts can be made to cythonise mathematical methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time is spent there according to line profiler? is it really a perf bottleneck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fit_sslm takes up majority of the time with update_obs being particularly slow.

@bhargavvader
Copy link
Contributor Author

@tmylk , @piskvorky , I've added a tutorial notebook and tests to accompany the code.
Could you do a review so I can start working on changes?

@tmylk
Copy link
Contributor

tmylk commented Aug 17, 2016

@piskvorky I think this is ready for your review. (Though I still have some comments about the code and notebook to discuss with Bhargav today.)

@piskvorky
Copy link
Owner

piskvorky commented Aug 18, 2016

OK, I'll have a look, probably this weekend. Thanks @bhargavvader !

One thing I see immediately is that this PR needs a better description. When people come here from the CHANGELOG/web/google/blog etc, they shouldn't be greeted with "This is a very, very rough draft". Add some high-level motivational/solution explanation section to the description (possibly taken from the notebook, if it's there).

@bhargavvader bhargavvader changed the title [WIP] DTM sample classes, helper methods. [MRG] Dynamic Topic Models in python. Aug 18, 2016
@bhargavvader
Copy link
Contributor Author

@piskvorky , haha, yes. I've changed the title and description with a brief explanation and TODO for further works.

@tmylk tmylk merged commit 1ae1338 into piskvorky:develop Aug 18, 2016
@piskvorky
Copy link
Owner

piskvorky commented Aug 19, 2016

@tmylk I see a lot of TODOs in the description, and a request for review. Was this really meant to be merged?

@bhargavvader
Copy link
Contributor Author

bhargavvader commented Aug 19, 2016

@piskvorky , the TODOs in the description are more of further works to improve the code- things like improving performance and making it distributed. I've changed it from TODO to Future Works to reflect this.
I am opening another PR (#831) to work on the tutorial notebook, and on incorporating suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants