Student Projects

A list of Google Summer of Code and student thesis projects for Gensim, a scientific Python package for efficient, large-scale topic modelling.

We offer financial reward as well as technical and academic assistance for completing these projects. Expectations are high though; read this general summary before applying.

If you'd like to work on any of the topics below, or have your own ideas, get in touch at student-projects@rare-technologies.com.

Online NNMF

Background:

Non-negative matrix factorization, NNMF [1], is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm. [2]

While implementations of NNMF in Python exist [3, 4], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications. You will contribute a scalable implementation of NNMF to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals:

Demonstrate understanding of matrix factorization theory and practice, by describing, implementing and evaluating a scalable version of the NNMF algorithm.
Implement streamed NNMF [5] that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally also implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables:

Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings and accuracy of your NNMF implementation on English Wikipedia and the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your NNMF implementation. You can also evaluate the NNMF factorization quality against other factorization methods, such as SVD and LDA [9] in collaborative filtering settings (optional).

Resources:

[1] NNMF on Wikipedia

[2] Online algorithm

[3] Christian Thurau et al. "Python Matrix Factorisation"

[4] Sklearn NMF code

[5] Online NMF on Wikipedia

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010

[10] Wang, Tan, König, Li. "Efficient Document Clustering via Online Nonnegative Matrix Factorizations." 2011

[11] Topics extraction with Non-Negative Matrix Factorization in sklearn

[12] Gensim github issue #132.

Explicit Semantic Analysis

Background: Explicit Semantic Analysis [1, 2] is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter [3].

While implementations of ESA exist in Python [4] and other languages [5], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.

You will contribute a scalable implementation of ESA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals:

Demonstrate understanding of semantic interpretation theory and practice, by describing, implementing and evaluating a scalable version of the ESA algorithm.
Implement streamed ESA that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings and accuracy of your ESA implementation on the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your ESA implementation. You can also evaluate the ESA against other methods of semantic analysis, such as Latent Semantic Analysis [9, 10] in an event classification task (optional).

Resources:

[1] Evgeniy Gabrilovich and Shaul Markovitch "Wikipedia-based Semantic Interpretation for Natural Language Processing." Journal of Artificial Intelligence Research, 34:443–498, 2009

[2] Explicit Semantic Analysis.

[3] Musaev, A.; De Wang; Shridhar, S.; Chien-An Lai; Pu, C., "Toward a Real-Time Service for Landslide Detection: Augmented Explicit Semantic Analysis and Clustering Composition Approaches," in Web Services (ICWS), 2015 IEEE International Conference on , vol., no., pp.511-518, June 27 2015-July 2 2015

[4] Python implementation of ESA

[5] Gabrilovich's page on ESA

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] "Latent Semantic Analysis" article on Wikipedia

[10] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

Supervised Latent Dirichlet Allocation

Note: Consider integration with existing Python sLDA

Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.

In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.

While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.
Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.

Resources:

[1] Mcauliffe, Jon D., and David M. Blei. "Supervised topic models." Advances in neural information processing systems. 2008.

[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022

[3] sLDA implementation in C++

[4] Implementation of sLDA in R

[5] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[6] Gensim github issue #121.

[7] Gensim on github

[8] Movie Review Dataset from Cornell NLP group

[9] Ramage, Daniel, et al. "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009.

[10] Labelled LDA in Python

[11] Jagarlamudi, Jagadeesh, Hal Daumé III, and Raghavendra Udupa. "Incorporating lexical priors into topic models." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012

Online Word2Vec

Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.

This original Word2Vec algorithm can't add more words to vocabulary after an initial training. This is quite limiting for a news recommender engine encountering new words every day, for example. Many other real-world uses will benefit from being able to add new words to the vocabulary during training. This modification is called an online-training [3] of a Word2vec model.

There is no robust implementation of Online Word2vec available in any programming language. You will contribute a scalable implementation of Online Word2Vec to the data science world in Python. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding theory and practice of distributed representations of words by describing, implementing and evaluating Online word2vec.
Implement a streamed Online word2vec that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [4] on github [5]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your Online word2vec using Lee corpus [6] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of Online word2vec.

Resources: [1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)

[2] Gensim word2vec tutorial at Kaggle

[3] Online algorithm

[4] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[5] Gensim on github

[6] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

Word Movers Distance for word2vec

Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.

Many methods are proposed on how to measure distance between sentences in this new vector space. "Word Mover's Distance" (WMD) [3] is a novel distance-between-text-documents measure. It outperforms simple combinations like sum or mean. Visually, the distance between the two documents is the minimum cumulative distance that all words in document A need to travel to exactly match document B.

For example, these two sentences are close with respect to WMD even though they only have one word in common: "The restaurant is loud, we couldn't speak across the tabel" and "The restaurant has a lot to offer but easy conversation is not there". [4]

While there is an academic implementation in C [5], there is no implementation of WMD available in Python. You will contribute a scalable implementation of WMD to the data science world in Python. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding theory and practice of document distances by describing, implementing and evaluating WMD.
Implement the WMD. Processing must be done in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your WMD using the freely available datasets in [3], for example the "20 newsgroups" corpus [8]. A summary of insights into parameter selection and tuning of document distances.

Resources:

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)

[2] Gensim word2vec tutorial at Kaggle

[3] "From Word Embeddings to Document Distances" Kusner et al 2015

[4] [Sudeep Das "Navigating themes in restaurant reviews with Word Mover’s Distance", 2015] (http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/)

[5] Matthew J Kusner's WMD in C on github

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] The 20 newsgroups dataset

[9] Gensim github issue #482

Author-Topic Models

Background: Author-topic model [2] is a Natural Language Processing method that tells us about a person's writing. It can say how diverse is a range of topics covered by one author. It can also compare two authors and say how similar they are.

Best implementation is CVB below.

The author-topic model adds information about an author into very popular Latent Dirichlet Allocation (LDA) [6] model.

While there are academic implementations in Python and other languages [3, 4], they are very slow for large datasets. You will contribute a scalable implementation of Author-topic modelling to the data science world in Python. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding of theory and practice of topic modelling by describing, implementing and evaluating author-topic modelling.
Implement a streamed author-topic model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
A very interesting point here is adapting a Gibbs sampling paper to use Gensim's variational inference.

Deliverables

Code: a pull request against gensim [1] on github [2]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Report: timings, memory use and accuracy of your author-topic model using the NIPS papers dataset [5], following the methodology of [2]. A summary of insights into parameter selection and tuning of the model.

Resources: [1] Rosen-Zvi, Michal, et al. "The author-topic model for authors and documents." Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004. PDF.

[2] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.

[3] Author-topic model in Python

[4] Author-topic model in C++

[5] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks.

[6] Gensim on github

[7] NIPS text corpus in MATLAB format

[8] Collapsed VB implementation

Distributed computing for Latent Dirichlet Allocation

Background: Latent Dirichlet Allocation (LDA) [1] is a very popular algorithm for modelling topics of text documents.

Modern data mining relies on high-level distributed [2] frameworks like Hadoop, Spark [3], Celery [4], Disco [5], Samza [6] and Ibis [7].

While there are implementations of distributed LDA in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LDA implementation in Python. You will contribute a scalable implementation of distributed LDA to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding of theory and practice of distributed computing and topic modelling by describing, implementing and evaluating distributed LDA.
Implement a streamed distributed LDA model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of LDA [8] that you can build on.
Report: timings, memory use and accuracy of your distributed LDA implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model. In particular, how performance changes by adding cores and machines to the cluster.

Resources:

[1] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010

[2] MapReduce: Simplified Data Processing on Large Clusters

[3] Spark distributed computing framework

[4] Celery

[5] Disco

[6] Storm, Samza.

[7] Ibis

[8] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[9] Gensim on github

[10] Low-level distributed LDA in gensim

Distributed computing for Latent Semantic Indexing

Background: Latent Semantic Indexing (LSI) [1] is a very popular algorithm for modelling topics of text documents.

Modern data mining relies on high-level distributed [2] frameworks like Hadoop, Spark [3], Celery [4], Disco [5], Samza [6] and Ibis [7].

While there are implementations of distributed LSI in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LSI implementation in Python. You will contribute a scalable implementation of distributed LSI to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding theory and practice of distributed computing and topic modelling by describing, implementing and evaluating distributed LSI.
Implement a streamed distributed LSI model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of LSI [10] that you can build on.
Report: timings, memory use and accuracy of your distributed LSI implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model.

Resources:

[1] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

[2] MapReduce: Simplified Data Processing on Large Clusters

[3] Spark distributed computing framework

[4] Celery

[5] Disco

[6] Storm, Samza.

[7] Ibis

[8] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[9] Gensim on github

[10] Low-level distributed LSI in gensim

[11] LSI on Spark

Distributed computing for word2vec

Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.

Modern data mining relies on high-level distributed [3] frameworks like Hadoop, Spark [4], Celery [5], Disco [6], Samza [7] and Ibis [8].

While there are implementations of distributed word2vec in Scala over Spark [9] and in other languages [10], there is no established distributed computing framework that contains a word2vec implementation in Python. You will contribute a scalable implementation of distributed word2vec to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at student-projects@rare-technologies.com.

Goals

Demonstrate understanding theory and practice of distributed computing and word representations by describing, implementing and evaluating distributed word2vec.
Implement a streamed distributed word2vec model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

Code: a pull request against gensim [11] on github [12]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of distributed word2vec that you can build on.
Report: timings, memory use and accuracy of your distributed word2vec implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model.

Resources:

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

[2] Gensim word2vec tutorial at Kaggle

[3] MapReduce: Simplified Data Processing on Large Clusters

[4] Spark distributed computing framework

[5] Celery

[6] Disco

[7] Storm, Samza.

[8] Ibis

[9] word2vec in Spark

[10] word2vec in DeepLearning4J

[11] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks