Contrastive Learning for Sentence Embeddings

This repository provides a simple code to implement unsupervised contrastive learning framework for generating sentence embeddings from the paper "SimCSE: Simple Contrastive Learning of Sentence Embeddings".

Setup Environment Requirements

Python 3.9
transformers 4.16.2
tensorflow 2.8
tensorflow-addons 0.16.1
scikit-learn 1.0.2
numpy 1.22.2
pandas 1.4.1

Usage

Data

Dataset used can be downloaded directly by using the command, written in data/data.txt file, in Windows PowerShell. The dataset comprises 1 million sentences randomly sampled from English Wikipedia. While running the code, ensuring that the data folder has wiki1m_for_simcse.txt file is required.

Training

To train the unsupervised contrastive learning approach, run main.py.
All hyperparameters to control model training and the paths to input and output data directories are provided in the given main.py file. The values of these hyperparameters can be altered to see how the approach performs in different hyperparameter settings.
model.py file contains the implemented unsupervised contrastive learning-based language model approach that is being imported to main.py file in order to train and test the same approach.
This unsupervised contrastive learning approach can also be extended to perform text similarity using the preTrained language model, the example of which can be found in main.py file.
The average loss for the trained model is printed after every epoch.

Text Similarity

Input Sentences

"chocolates are my favourite items.",
"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
"The person box was packed with jelly many dozens of months later.",
"white chocolates and dark chocolates are favourites for many people.",
"I love chocolates.",
"Let me help you.",
"There are some who influenced many.",
"Chips are getting more popular these days.",
"There are tools which help us get our work done.",
"Electric vehicles are worth buying given their mileage on the road.",
"NATO is the most powerful military alliance.",
"Gone are the days when people got worry about their diets."

Sentence to be compared with the other given sentences

"chocolates are my favourite items."

Similarity Scores

0.84871584, 0.8847387, 0.874104, 0.96159446, 0.87748206,
0.88612396, 0.9087229, 0.86401033, 0.90140533, 0.8532164,
0.8539922

Analysis

Results, mentioned above, show higher similarity between 1^st and 5^th sentences than any other possible 1^st sentence combination with the remaining 10 input sentences.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrastive Learning for Sentence Embeddings

Setup Environment Requirements

Usage

Data

Training

Text Similarity

Input Sentences

Sentence to be compared with the other given sentences

Similarity Scores

Analysis

About

Releases

Packages

Languages

License

fork123aniket/Contrastive-Learning-for-Sentence-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Contrastive Learning for Sentence Embeddings

Setup Environment Requirements

Usage

Data

Training

Text Similarity

Input Sentences

Sentence to be compared with the other given sentences

Similarity Scores

Analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages