Project Summary: This project forcuses on applying Topic Modeling to a BBC News Dataset. Topic Modeling is a statistical modeling technique used to uncover the main themes and topics present in a structure of documents or textual data. The project was done in several parts as follows:
-
Analysis and Preprocessing
- Performed an in-depth analysis of the dataset and the textual content it contained by using various descriptive statistics.
- Used visualizations as well as word clouds to figure out most commonly used words and patterns in the occurrence of words.
- Combined the titles and descriptions of the news articles to extract more information from the text.
-
Text Preprocessing:
- Performed text preprocessing to clean the text and prepare it for analysis, andremove words and instanceswhich do not add any semantic meaning to the text.
- Removed stopwords, punctuation, extra spaces, and unnecessary characters which do not add any semantic meaning to the text and could interfere with the accuracy of the analysis.
-
Text Vectorization using FastText and Embedding Visualizations using UMAP:
- Converted the preprocessed text into vector representations using FastText embeddings.
- Utilized FastText embeddings to capture the semantic meaning of words and generate numerical representations of the text.
- Created Embedding Visualizations using UMAP.
-
Topic Modelling with LDA
- Applied the Latent Dirichlet Allocation (LDA) algorithm, a popular topic modeling technique, to identify underlying topics within the dataset.
- Analyzed patterns of word co-occurrence in the documents to uncover latent themes or topics.
-
Analysis of LDA results
- Interpreted and understood the results obtained from the LDA model.
- Identified the most significant terms within each topic to gain insights into the main themes present in the dataset.