🌽 CoRNStack: A High-Quality Contrastive Dataset for Better Code Ranking.

ℹ️ About | 📖 More About CORNSTACK | 🚀 Quick Start | 👀 Running Evaluation

ℹ️ About

🌽 CoRNStack is a large-scale high-quality (text, code) pairs dataset for training and fine-tuning embedding models and re-rankers for code retrieval via contrastive learning.
We train CodeRankEmbed, a 137M bi-encoder, on 🌽 CoRNStack and demonstrate considerably higher performance on a variety of code retrieval benchmarks, with substantial gains over current state-of-the-art code embedding models.
By leveraging 🌽 CoRNStack, we are the first to finetune LLMs as code rerankers. CodeRankLLM, our 7B code reranker, considerably improves performance over the retriever.

📖 More About CORNSTACK

The performance of code embedding models is highly contingent on the quality of the large-scale data used for contrastive training. Effective contrastive training hinges on satisfying two primary conditions:

The positives are highly relevant to the query and not noisy
The negatives are semantically similar to the positives but do not directly address the query, a.k.a hard negatives.

Existing approaches heuristically source contrastive examples from large-scale open-source code data with limited filtering and mining, retaining irrelevant or incorrectly labeled <query, positive> pairs, which impair the models’ ability to learn robust and accurate representations. To address these challenges, we introduce curriculum-based hard negative mining and consistency filtering techniques and apply these techniques on the de-duplicated version of The Stack v2. More details on these specific curation techniques and how we use them to train embedding models and re-rankers in our paper coming soon!

🚀 Quick Start

Install the required dependencies:

pip install -r requirements.txt

👀 Running Evaluation

To reproduce the performance of CodeRankEmbed on popular code retrieval benchmarks, run the following commands:

COIR Evaluation

cd src/
python evaluation/eval_coir.py

CSN Evaluation

cd src/
python create/csn.py
python evaluation/eval_csn.py

SWE-Bench-Lite Evaluation

cd src/
python create/swebench.py
python evaluation/eval_swebench.py
python evaluation/eval_localization.py --level file  #print out file localization top-k results
python evaluation/eval_localization.py --level function  #print out function localization top-k results

We plan to release the full training and dataset curation code soon!

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
src		src
static		static
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌽 CoRNStack: A High-Quality Contrastive Dataset for Better Code Ranking.

ℹ️ About

📖 More About CORNSTACK

🚀 Quick Start

👀 Running Evaluation

COIR Evaluation

CSN Evaluation

SWE-Bench-Lite Evaluation

About

Releases

Packages

Contributors 2

Languages

License

gangiswag/cornstack

Folders and files

Latest commit

History

Repository files navigation

🌽 CoRNStack: A High-Quality Contrastive Dataset for Better Code Ranking.

ℹ️ About

📖 More About CORNSTACK

🚀 Quick Start

👀 Running Evaluation

COIR Evaluation

CSN Evaluation

SWE-Bench-Lite Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages