Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Few-Shot Learning Help LLM Performance in Code Synthesis? #960

Open
1 task
ShellLM opened this issue Dec 18, 2024 · 1 comment
Open
1 task

Does Few-Shot Learning Help LLM Performance in Code Synthesis? #960

ShellLM opened this issue Dec 18, 2024 · 1 comment
Labels
AI-Agents Autonomous AI agents using LLMs code-generation code generation models and tools like copilot and aider llm Large Language Models MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re Research personal research notes for a topic software-engineering Best practice for software engineering

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Dec 18, 2024

Does Few-Shot Learning Help LLM Performance in Code Synthesis?

Derek Xu1*, Tong Xie1*, Botao Xia1*, Haoyu Li2*
Yunsheng Bai3, Yizhou Sun1, Wei Wang1

1University of California Los Angeles
2University of Illinois Urbana-Champaign
3Nvidia

  • Equal Contribution

Abstract

Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM's coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CodeExemplar-Free, and a model-based method, CodeExamplar-Base. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama's coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.

1 Introduction

Recently, large language models (LLMs) (Radford et al., 2019) have demonstrated impressive capabilities outside of natural language processing, including in mathematics (Touvron et al., 2023; Luo et al., 2023), time series forecasting (Ansari et al., 2024; Woo et al., 2024; Das et al., 2023), tabular data understanding (Hegselmann et al., 2023; Hollmann et al., 2022; Xu et al., 2024), and multi-modal understanding (Liu et al., 2024a). Among these capabilities, LLMs' application to software engineering is particularly exciting. In just a few months, LLMs exhibit zero-shot abilities in code completion (Peng et al., 2023), code generation (Roziere et al., 2023; Guo et al., 2024; Dubey et al., 2024), test case generation (Vikram et al., 2023; Lemieux et al., 2023; Schäfer et al., 2023), debugger interaction (Islam et al., 2024; Ho et al., 2024), and repository-level generation (Shrivastava et al., 2023; Bui et al., 2024; Wang et al., 2024a).

In this work, we study the important task of code generation (Roziere et al., 2023), where an LLM agent generates code described by a prompt consisting of a natural language function description and few-shot input-output examples. Current LLMs exhibit code generation capabilities through improved model design (Guo et al., 2024), training (Roziere et al., 2023), and chain-of-thought (Li et al., 2023). Our work aims to enhance existing LLMs for code generation by improving the prompt itself, which remains an under-explored research problem. In fact, existing techniques evaluate on predefined prompt templates with little to no modifications (Austin et al., 2021; Liu et al., 2024b).

To improve the prompt itself, we break down the prompt template into 2 components: (1) a natural language description, providing a high-level description of the code, and (2) few-shot input-output examples, describing function input outputs to disambiguate the natural language description. For example, the description "return a list with elements incremented by 1" can be disambiguated by the example "incr_list([1,2,3]) == [2,3,4]," which shows that each numerical element of the list should be incremented by 1.

Inspired by existing work (Liu et al., 2021, 2024c) that show LLM's in-context learning (ICL) ability is shaped by which examples are included in the ICL prompt, we hypothesize LLM's coding ability is also shaped by which few-shot examples are included in the code generation prompt 1. We confirm our hypothesis on several language models: T5-Small, T5-Base, Mistral, and CodeLlama on the HumanEval+ benchmark. Furthermore, we provide analysis on which examples contribute most to LLM's coding capabilities.

Given few-shot examples greatly affect LLM coding capability, we provide two methods for selecting few-shot examples: (1) a model-free algorithm, CodeExemplar-Free, that picks examples based on an input metric, and (2) a model-based algorithm, CodeExamplar-Base that picks examples based on a bootstrapped training dataset. The former offers an interpretable data-agnostic algorithm, and the latter offers a better performing data-driven model. Both approaches support arbitrary token cost constraints. Both approaches substantially improve CodeLlama's coding capabilities on the HumanEval+ benchmark under fixed token constraints. We summarize our contributions:

• We demonstrate that choice of few-shot examples in the LLM prompt has a significant effect on LLM coding capabilities across 5 different LLMs.
• We propose an interpretable model-free algorithm, requiring no training, that improves LLM code generation ability by only modifying the input prompt.
• We propose a data-driven neural network, trained on a dataset of code generation prompts, that improves LLM code generation by only modifying the input prompt. Both algorithms are gray-box: they do not require access to ground-truth weights of the model, only the logits for given input.

[1] Unlike in-context learning (ICL), few-shot examples in code synthesis prompts are not in the same input output space. Hence, existing ICL techniques for example selection (Liu et al., 2021) does not work in our setting.

Suggested labels

{'label-name': 'few-shot-learning', 'label-description': "Study of optimizing large language models' coding capabilities using few-shot examples in prompts.", 'gh-repo': 'Papers', 'confidence': 50.0}

@ShellLM ShellLM added AI-Agents Autonomous AI agents using LLMs code-generation code generation models and tools like copilot and aider llm Large Language Models MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re Research personal research notes for a topic software-engineering Best practice for software engineering labels Dec 18, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Dec 18, 2024

Related content

#706 similarity score: 0.9
#551 similarity score: 0.9
#681 similarity score: 0.88
#750 similarity score: 0.87
#734 similarity score: 0.87
#332 similarity score: 0.87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AI-Agents Autonomous AI agents using LLMs code-generation code generation models and tools like copilot and aider llm Large Language Models MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately Papers Research papers prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re Research personal research notes for a topic software-engineering Best practice for software engineering
Projects
None yet
Development

No branches or pull requests

1 participant