Word2Vec and FastText Word Embeddings with Gensim in Python

🚀 Business Objective

In the dynamic field of Natural Language Processing (NLP), deciphering context from textual data stands as a formidable challenge. Word embeddings, providing semantically rich vectors, emerge as the ultimate solution. This project aims to construct domain-specific medical word embeddings using Word2Vec and FastText in Python.

📊 Data Description

Leveraging a clinical trials dataset focused on Covid-19 obtained from Dimensions COVID-19 Publications, Datasets, and Clinical Trials. The dataset comprises 10666 rows and 21 columns, with emphasis on the 'Title' and 'Abstract' columns.

🎯 Aim

The primary objective is to train Skip-gram and FastText models for word embeddings and subsequently develop a search engine alongside a Streamlit UI.

🛠️ Tech Stack

Language: Python
Libraries: Pandas, NumPy, Matplotlib, Plotly, Gensim, Streamlit, NLTK
Environment: Jupyter Notebook

🔍 Approach

Import Essential Libraries.
Read the Dataset.
Pre-process the Data:
- Remove URLs
- Convert text to lowercase
- Remove numerical values
- Remove punctuation
- Tokenization
- Remove stop words
- Lemmatization
- Remove '\n' character from columns
Conduct Exploratory Data Analysis (EDA):
- Word cloud visualization
Train the 'Skip-gram' Model.
Train the 'FastText' Model.
Model Embeddings and Assess Similarity.
Generate PCA Plots for Skip-gram and FastText Models.
Convert Abstract and Title to Vectors using the Skip-gram and FastText Models.
Utilize the Cosine Similarity Function.
Pre-process the Input Query.
Define a Function to Return Top 'n' Similar Results.
Evaluate Results.
Deploy the Streamlit Application.

📝 Project Takeaways

Understanding the business problem.
Grasping the architecture to build the Streamlit application.
Mastery of Word2Vec and FastText models.
Importing datasets and necessary libraries.
Data Pre-processing.
Basic Exploratory Data Analysis (EDA).
Training Skip-gram model with varying parameters.
Training FastText model with varying parameters.
Embedding models understanding and implementation.
Plotting PCA plots.
Obtaining vectors for each attribute.
Executing the Cosine similarity function.
Input query pre-processing.
Result evaluation.
Building a function to return top 'n' similar results for a given query.
Understanding the Streamlit application code.
Deployment of the Streamlit application.

Certainly! Let's make the "Get Connected" section more fun and engaging:

Absolutely! Let's make the "Get Connected" section more enthusiastic and visually appealing, with follow buttons aligned on the left side:

🔗 Get Connected

For more insightful projects and collaboration, connect with me on:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Medical Embeddings_Final.ipynb		Medical Embeddings_Final.ipynb
Medical.py		Medical.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec and FastText Word Embeddings with Gensim in Python

🚀 Business Objective

📊 Data Description

🎯 Aim

🛠️ Tech Stack

🔍 Approach

📝 Project Takeaways

🔗 Get Connected

About

Releases

Packages

Languages

Vidhi1290/Word2Vec-and-FastText-Word-Embedding-with-Gensim-in-Python

Folders and files

Latest commit

History

Repository files navigation

Word2Vec and FastText Word Embeddings with Gensim in Python

🚀 Business Objective

📊 Data Description

🎯 Aim

🛠️ Tech Stack

🔍 Approach

📝 Project Takeaways

🔗 Get Connected

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages