Latent Dirichlet allocation (LDA) is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document set. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
The data set being used is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle - https://www.kaggle.com/therohk/million-headlines/data
The following Python packages will be used:
- genism
- nltk