Β
Contributions are welcome! If you have any resources, tools, papers, or insights related to Code LLMs, feel free to submit a pull request. Let's work together to make this project better!
Β
- π₯π₯π₯ [2024-11-12] Qwen2.5-Coder series are released, offering six model sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B), with Qwen2.5-Coder-32B-Instruct now the most powerful open-source code model.
- π₯π₯ [2024-11-08] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models is released.
Β
- 𧡠Table of Contents
- π Top Code LLMs
- π‘ Evaluation Toolkit
- π Awesome Code LLMs Leaderboard
- π Awesome Code LLMs Papers
- π Contributors
- Cite as
- Acknowledgement
- Star History
Β
Rank | Model | Params | HumanEval | MBPP | Source |
---|---|---|---|---|---|
1 | o1-mini-2024-09-12 | - | 97.6 | 93.9 | paper |
2 | o1-preview-2024-09-12 | - | 95.1 | 93.4 | paper |
3 | Qwen2.5-Coder-32B-Instruct | 32B | 92.7 | 90.2 | github |
4 | Claude-3.5-Sonnet-20241022 | - | 92.1 | 91.0 | paper |
5 | GPT-4o-2024-08-06 | - | 92.1 | 86.8 | paper |
6 | Qwen2.5-Coder-14B-Instruct | 14B | 89.6 | 86.2 | github |
7 | Claude-3.5-Sonnet-20240620 | - | 89.0 | 87.6 | paper |
8 | GPT-4o-mini-2024-07-18 | - | 87.8 | 86.0 | paper |
9 | Qwen2.5-Coder-7B-Instruct | 7B | 88.4 | 83.5 | github |
10 | DS-Coder-V2-Instruct | 21/236B | 85.4 | 89.4 | github |
11 | Qwen2.5-Coder-3B-Instruct | 3B | 84.1 | 73.6 | github |
12 | DS-Coder-V2-Lite-Instruct | 2.4/16B | 81.1 | 82.8 | github |
13 | CodeQwen1.5-7B-Chat | 7B | 83.5 | 70.6 | github |
14 | DeepSeek-Coder-33B-Instruct | 33B | 79.3 | 70.0 | github |
15 | DeepSeek-Coder-6.7B-Instruct | 6.7B | 78.6 | 65.4 | github |
16 | GPT-3.5-Turbo | - | 76.2 | 70.8 | github |
17 | CodeLlama-70B-Instruct | 70B | 72.0 | 77.8 | paper |
18 | Qwen2.5-Coder-1.5B-Instruct | 1.5B | 70.7 | 69.2 | github |
19 | StarCoder2-15B-Instruct-v0.1 | 15B | 67.7 | 78.0 | paper |
20 | Qwen2.5-Coder-0.5B-Instruct | 0.5B | 61.6 | 52.4 | github |
21 | Pangu-Coder2 | 15B | 61.6 | - | paper |
22 | WizardCoder-15B | 15B | 57.3 | 51.8 | paper |
23 | CodeQwen1.5-7B | 7B | 51.8 | 61.8 | github |
24 | CodeLlama-34B-Instruct | 34B | 48.2 | 61.1 | paper |
25 | Code-Davinci-002 | - | 47.0 | - | paper |
Β
- bigcode-evaluation-harness: A framework for the evaluation of autoregressive code generation language models.
- code-eval: A framework for the evaluation of autoregressive code generation language models on HumanEval.
Β
Leaderboard | Description |
---|---|
Evalperf Leaderboard | Evaluating LLMs for Efficient Code Generation. |
Aider Code Editing Leaderboard | Measuring the LLMβs coding ability, and whether it can write new code that integrates into existing code. |
BigCodeBench Leaderboard | BigCodeBench evaluates LLMs with practical and challenging programming tasks. |
LiveCodeBench Leaderboard | Holistic and Contamination Free Evaluation of Large Language Models for Code. |
Big Code Models Leaderboard | Compare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E. |
BIRD Leaderboard | BIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc. |
CanAiCode Leaderboard | CanAiCode Leaderboard |
Coding LLMs Leaderboard | Coding LLMs Leaderboard |
CRUXEval Leaderboard | CRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities! |
EvalPlus Leaderboard | EvalPlus evaluates AI Coders with rigorous tests. |
InfiBench Leaderboard | InfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain. |
InterCode Leaderboard | InterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue. |
Program Synthesis Models Leaderboard | They created this leaderboard to help researchers easily identify the best open-source model with an intuitive leadership quadrant graph. They evaluate the performance of open-source code models to rank them based on their capabilities and market adoption. |
Spider Leaderboard | Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. |
Β
Β
Title | Venue | Date | Code | Resources |
---|---|---|---|---|
Magicoder: Source Code Is All You Need |
ICML'24 |
2023.12 |
Github | HF |
OctoPack: Instruction Tuning Code Large Language Models |
ICLR'24 |
2023.08 |
Github | HF |
WizardCoder: Empowering Code Large Language Models with Evol-Instruct |
Preprint |
2023.07 |
Github | HF |
Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions |
Preprint |
2023.xx |
Github | HF |
Β
Title | Venue | Date | Code | Resources |
---|---|---|---|---|
ProSec: Fortifying Code LLMs with Proactive Security Alignment |
Preprint |
2024.11 |
- | - |
PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models |
Preprint |
2024.06 |
- | - |
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback |
Preprint |
2023.07 |
- | - |
RLTF: Reinforcement Learning from Unit Test Feedback |
Preprint |
2023.07 |
Github | - |
Execution-based Code Generation using Deep Reinforcement Learning |
TMLR'23 |
2023.01 |
Github | - |
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning |
NeurIPS'22 |
2022.07 |
Github | - |
Title | Venue | Date | Code | Resources |
---|---|---|---|---|
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging |
Preprint |
2024.10 |
Github | - |
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step |
ACL'24 |
2024.02 |
Github | - |
SelfEvolve: A Code Evolution Framework via Large Language Models |
Preprint |
2023.06 |
- | - |
Demystifying GPT Self-Repair for Code Generation |
ICLR'24 |
2023.06 |
Github | - |
Teaching Large Language Models to Self-Debug |
ICLR'24 |
2023.06 |
- | - |
LEVER: Learning to Verify Language-to-Code Generation with Execution |
ICML'23 |
2023.02 |
Github | - |
Coder Reviewer Reranking for Code Generation |
ICML'23 |
2022.11 |
Github | - |
CodeT: Code Generation with Generated Tests |
ICLR'23 |
2022.07 |
Github | - |
Β
Β
This is an active repository and your contributions are always welcome! If you have any question about this opinionated list, do not hesitate to contact me huybery@gmail.com
.
Β
@software{awesome-code-llm,
author = {Binyuan Hui, Lei Zhang},
title = {An awesome and curated list of best code-LLM for research},
howpublished = {\url{https://github.com/huybery/Awesome-Code-LLM}},
year = 2023,
}
Β
This project is inspired by Awesome-LLM.
Β