In the dynamic field of Natural Language Processing (NLP), deciphering context from textual data stands as a formidable challenge. Word embeddings, providing semantically rich vectors, emerge as the ultimate solution. This project aims to construct domain-specific medical word embeddings using Word2Vec and FastText in Python.
Leveraging a clinical trials dataset focused on Covid-19 obtained from Dimensions COVID-19 Publications, Datasets, and Clinical Trials. The dataset comprises 10666 rows and 21 columns, with emphasis on the 'Title' and 'Abstract' columns.
The primary objective is to train Skip-gram and FastText models for word embeddings and subsequently develop a search engine alongside a Streamlit UI.
- Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Plotly, Gensim, Streamlit, NLTK
- Environment: Jupyter Notebook
- Import Essential Libraries.
- Read the Dataset.
- Pre-process the Data:
- Remove URLs
- Convert text to lowercase
- Remove numerical values
- Remove punctuation
- Tokenization
- Remove stop words
- Lemmatization
- Remove '\n' character from columns
- Conduct Exploratory Data Analysis (EDA):
- Word cloud visualization
- Train the 'Skip-gram' Model.
- Train the 'FastText' Model.
- Model Embeddings and Assess Similarity.
- Generate PCA Plots for Skip-gram and FastText Models.
- Convert Abstract and Title to Vectors using the Skip-gram and FastText Models.
- Utilize the Cosine Similarity Function.
- Pre-process the Input Query.
- Define a Function to Return Top 'n' Similar Results.
- Evaluate Results.
- Deploy the Streamlit Application.
- Understanding the business problem.
- Grasping the architecture to build the Streamlit application.
- Mastery of Word2Vec and FastText models.
- Importing datasets and necessary libraries.
- Data Pre-processing.
- Basic Exploratory Data Analysis (EDA).
- Training Skip-gram model with varying parameters.
- Training FastText model with varying parameters.
- Embedding models understanding and implementation.
- Plotting PCA plots.
- Obtaining vectors for each attribute.
- Executing the Cosine similarity function.
- Input query pre-processing.
- Result evaluation.
- Building a function to return top 'n' similar results for a given query.
- Understanding the Streamlit application code.
- Deployment of the Streamlit application.
Certainly! Let's make the "Get Connected" section more fun and engaging:
Absolutely! Let's make the "Get Connected" section more enthusiastic and visually appealing, with follow buttons aligned on the left side:
For more insightful projects and collaboration, connect with me on: