Does Few-Shot Learning Help LLM Performance in Code Synthesis? #960
Labels
AI-Agents
Autonomous AI agents using LLMs
code-generation
code generation models and tools like copilot and aider
llm
Large Language Models
MachineLearning
ML Models, Training and Inference
New-Label
Choose this option if the existing labels are insufficient to describe the content accurately
Papers
Research papers
prompt-engineering
Developing and optimizing prompts to efficiently use language models for various applications and re
Research
personal research notes for a topic
software-engineering
Best practice for software engineering
Does Few-Shot Learning Help LLM Performance in Code Synthesis?
Derek Xu1*, Tong Xie1*, Botao Xia1*, Haoyu Li2*
Yunsheng Bai3, Yizhou Sun1, Wei Wang1
1University of California Los Angeles
2University of Illinois Urbana-Champaign
3Nvidia
Abstract
Large language models (LLMs) have made significant strides at code generation through improved model design, training, and chain-of-thought. However, prompt-level optimizations remain an important yet under-explored aspect of LLMs for coding. This work focuses on the few-shot examples present in most code generation prompts, offering a systematic study on whether few-shot examples improve LLM's coding capabilities, which few-shot examples have the largest impact, and how to select impactful examples. Our work offers 2 approaches for selecting few-shot examples, a model-free method, CodeExemplar-Free, and a model-based method, CodeExamplar-Base. The 2 methods offer a trade-off between improved performance and reliance on training data and interpretability. Both methods significantly improve CodeLlama's coding ability across the popular HumanEval+ coding benchmark. In summary, our work provides valuable insights into how to pick few-shot examples in code generation prompts to improve LLM code generation capabilities.
1 Introduction
Recently, large language models (LLMs) (Radford et al., 2019) have demonstrated impressive capabilities outside of natural language processing, including in mathematics (Touvron et al., 2023; Luo et al., 2023), time series forecasting (Ansari et al., 2024; Woo et al., 2024; Das et al., 2023), tabular data understanding (Hegselmann et al., 2023; Hollmann et al., 2022; Xu et al., 2024), and multi-modal understanding (Liu et al., 2024a). Among these capabilities, LLMs' application to software engineering is particularly exciting. In just a few months, LLMs exhibit zero-shot abilities in code completion (Peng et al., 2023), code generation (Roziere et al., 2023; Guo et al., 2024; Dubey et al., 2024), test case generation (Vikram et al., 2023; Lemieux et al., 2023; Schäfer et al., 2023), debugger interaction (Islam et al., 2024; Ho et al., 2024), and repository-level generation (Shrivastava et al., 2023; Bui et al., 2024; Wang et al., 2024a).
In this work, we study the important task of code generation (Roziere et al., 2023), where an LLM agent generates code described by a prompt consisting of a natural language function description and few-shot input-output examples. Current LLMs exhibit code generation capabilities through improved model design (Guo et al., 2024), training (Roziere et al., 2023), and chain-of-thought (Li et al., 2023). Our work aims to enhance existing LLMs for code generation by improving the prompt itself, which remains an under-explored research problem. In fact, existing techniques evaluate on predefined prompt templates with little to no modifications (Austin et al., 2021; Liu et al., 2024b).
To improve the prompt itself, we break down the prompt template into 2 components: (1) a natural language description, providing a high-level description of the code, and (2) few-shot input-output examples, describing function input outputs to disambiguate the natural language description. For example, the description "return a list with elements incremented by 1" can be disambiguated by the example "incr_list([1,2,3]) == [2,3,4]," which shows that each numerical element of the list should be incremented by 1.
Inspired by existing work (Liu et al., 2021, 2024c) that show LLM's in-context learning (ICL) ability is shaped by which examples are included in the ICL prompt, we hypothesize LLM's coding ability is also shaped by which few-shot examples are included in the code generation prompt 1. We confirm our hypothesis on several language models: T5-Small, T5-Base, Mistral, and CodeLlama on the HumanEval+ benchmark. Furthermore, we provide analysis on which examples contribute most to LLM's coding capabilities.
Given few-shot examples greatly affect LLM coding capability, we provide two methods for selecting few-shot examples: (1) a model-free algorithm, CodeExemplar-Free, that picks examples based on an input metric, and (2) a model-based algorithm, CodeExamplar-Base that picks examples based on a bootstrapped training dataset. The former offers an interpretable data-agnostic algorithm, and the latter offers a better performing data-driven model. Both approaches support arbitrary token cost constraints. Both approaches substantially improve CodeLlama's coding capabilities on the HumanEval+ benchmark under fixed token constraints. We summarize our contributions:
• We demonstrate that choice of few-shot examples in the LLM prompt has a significant effect on LLM coding capabilities across 5 different LLMs.
• We propose an interpretable model-free algorithm, requiring no training, that improves LLM code generation ability by only modifying the input prompt.
• We propose a data-driven neural network, trained on a dataset of code generation prompts, that improves LLM code generation by only modifying the input prompt. Both algorithms are gray-box: they do not require access to ground-truth weights of the model, only the logits for given input.
[1] Unlike in-context learning (ICL), few-shot examples in code synthesis prompts are not in the same input output space. Hence, existing ICL techniques for example selection (Liu et al., 2021) does not work in our setting.
Suggested labels
{'label-name': 'few-shot-learning', 'label-description': "Study of optimizing large language models' coding capabilities using few-shot examples in prompts.", 'gh-repo': 'Papers', 'confidence': 50.0}
The text was updated successfully, but these errors were encountered: