self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

irthomasthomas · 2024-03-04T12:39:16Z

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding

Self-Speculative Decoding

Code associated with the paper:

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Self-Speculative Decoding is a novel inference scheme for accelerating Large Language Models (LLMs) without additional neural network training and extra memory footprint. It not only maintains consistent output quality but also ensures model compatibility, making it a plug-and-play and cost-effective solution for LLM inference acceleration.

Self-Speculative Decoding involves a two-stage process:

Drafting stage: Generates draft tokens by selectively skipping certain intermediate layers.

Verification stage: Employs the original LLM to validate draft tokens in one forward pass.

Cite Our Paper

If you find this code and paper useful in your research, please consider citing:

@article{zhang2023draft,
      title={Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding}, 
      author={Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra},
      year={2023},
      eprint={2309.08168},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Requirements

PyTorch
Transformer
NumPy
More in ssd.yml

Files

searching.py: Selection of skipped layers by Bayesian optmization
decoding.py: Core process of self-speculative decoding
modeling_llama.py: Model structure with self-speculative decoding
search.ipynb: Main script searches for skipped layers
evaluate_sum.ipynb: Main script evaluates self-speculative decoding on text generation task
evaluate_code.ipynb: Main script evaluates self-speculative decoding on code generation task
skip_layers.json: Layers skipped by draft models corresponding to different base models
ssd.yml: Relevant environment

Usage

Configure the relevant environment according to ssd.yml;
Execute search.ipynb to get skipped layers to generate a draft model;
Execute evaluate_sum.ipynb to evaluate self-speculative decoding on summarization;
Execute evaluate_code.ipynb to evaluate self-speculative decoding on code generation.

View on GitHub

Suggested labels

{'label-name': 'Inference-Scheme', 'label-description': 'Describes a novel approach for accelerating Large Language Models without additional training or memory footprint.', 'confidence': 71.69}

The text was updated successfully, but these errors were encountered:

irthomasthomas · 2024-03-04T12:39:18Z

Related issues

#495: Paper page - Accelerating LLM Inference with Staged Speculative Decoding

### Details

Similarity score: 0.89 - [ ] [Paper page - Accelerating LLM Inference with Staged Speculative Decoding](https://huggingface.co/papers/2308.04623)

Paper Page - Accelerating LLM Inference with Staged Speculative Decoding

Published on Aug 9, 2023 | Featured in Daily Papers on Aug 10, 2023

Authors: Benjamin Spector, Chris Re

Abstract

Recent advances with large language models (LLM) have highlighted their diverse capabilities. This paper proposes a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. The algorithm restructures the speculative batch as a tree, reducing generation costs and increasing the expected tokens per batch. Additionally, it introduces a second stage of speculative decoding, further decreasing single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model, all while perfectly preserving output quality.

Read the Paper »

Suggested labels

{ "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }

#391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

### Details

Similarity score: 0.89 - [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/)

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.

Test Setup

The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.

Exllama v2 Results

Model: Xwin-LM-70B-V0.1-4.0bpw-h6-exl2

Draft Model: TinyLlama-1.1B-1T-OpenOrca-GPTQ

Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.

No SD

Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second
Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second

With SD

Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second
Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

#492: speculative decoding in llama.cpp : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp

### Details

Similarity score: 0.88 - [ ] [speculative : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/pull/2926)

Title: speculative : PoC for speeding-up inference via speculative sampling #292

Suggested labels

{ "label-name": "LLM-speed-optimization", "description": "Optimizing LLama model inference speed", "confidence": 80.85 }

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### Details

Similarity score: 0.88 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)

Deepseek Coder Introduction

Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

Key Features

Massive Training Data: Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
Highly Flexible & Scalable: Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
Superior Model Performance: State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
Advanced Code Completion Capabilities: A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.

Model Summary

deepseek-coder-5.7bmqa-base: A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens.
Home Page: DeepSeek
Repository: deepseek-ai/deepseek-coder
Chat With DeepSeek Coder: DeepSeek-Coder

How to Use

This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.

Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Code Insertion

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """<|begin|>def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[0]
    left = []
    right = []
<|hole|>
    if arr[i] < pivot:
        left.append(arr[i])
    else:
        right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])

Repository Level Code Completion

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()

input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

def load_data():
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target

    # Standardize the data
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Convert numpy data to PyTorch tensors
    X_train = torch.tensor(X_train, dtype=torch.float32)
    X_test = torch.tensor(X_test, dtype=torch.float32)
    y_train = torch.tensor(y_train, dtype=torch.int64)
    y_test = torch.tensor(y_test, dtype=torch.int64)

     return X_train, X_test, y_train, y_test

def evaluate_predictions(y_test, y_pred):
    return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class IrisClassifier(nn.Module):
    def __init__(self):
        super(IrisClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(4, 16),
            nn.ReLU(),
            nn.Linear(16, 3)
        )

    def forward(self, x):
        return self.fc(x)

    def train_model(self, X_train, y_train, epochs, lr, batch_size):
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(self.parameters(), lr=lr)

        # Create DataLoader for batches
        dataset = TensorDataset(X_train, y_train)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        for epoch in range(epochs):
            for batch_X, batch_y in dataloader:
                optimizer.zero_grad()
                outputs = self(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()

    def predict(self, X_test):
        with torch.no_grad():
            outputs = self(X_test)
            _, predicted = outputs.max(1)
        return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier

def main():
    # Model training and evaluation
"""

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))

License

This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.

See the LICENSE-MODEL for more details.

Contact

If you have any questions, please raise an issue or contact us at agi_code@deepseek.com.

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### Details

Similarity score: 0.88 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models:

Knowledge Distillation
Network Pruning
Quantization
Inference Acceleration
Efficient MOE
Text Compression
Low-Rank Decomposition
Hardware/System Tuning
Survey
Leaderboard
🚀 Updates
Contributing

Inference Acceleration

…
Add your paper here, generate the required format, and submit a pull request.

Updates

Sep 27, 2023: Add tag for papers accepted at NeurIPS'23.
Sep 6, 2023: Add a new subdirectory project/ to organize those projects designed for developing a lightweight LLM.
July 11, 2023: Create a new subdirectory efficient_plm/ for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs.

Contributing

If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

URL: https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

irthomasthomas mentioned this issue Mar 16, 2024

Self-Retrieval: An LLM-Driven Information Retrieval Architecture for the Era of Large Language Models #768

Open

1 task

ShellLM mentioned this issue Apr 12, 2024

Inference with Reference: Lossless Acceleration of Large Language Models by Nan Yang et al. #803

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

irthomasthomas commented Mar 4, 2024

irthomasthomas commented Mar 4, 2024 •

edited

Loading

Paper Page - Accelerating LLM Inference with Staged Speculative Decoding

Suggested labels

{ "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

Test Setup

Exllama v2 Results

No SD

With SD

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

Title: speculative : PoC for speeding-up inference via speculative sampling #292

Suggested labels

{ "label-name": "LLM-speed-optimization", "description": "Optimizing LLama model inference speed", "confidence": 80.85 }

Deepseek Coder Introduction

Key Features

Model Summary

How to Use

Code Completion

Code Insertion

Repository Level Code Completion

License

Contact

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

Awesome-Efficient-LLM

Inference Acceleration

Updates

Contributing

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding #680

Comments

irthomasthomas commented Mar 4, 2024

Self-Speculative Decoding

Cite Our Paper

Requirements

Files

Usage

Suggested labels

{'label-name': 'Inference-Scheme', 'label-description': 'Describes a novel approach for accelerating Large Language Models without additional training or memory footprint.', 'confidence': 71.69}

irthomasthomas commented Mar 4, 2024 • edited Loading

Related issues

#495: Paper page - Accelerating LLM Inference with Staged Speculative Decoding

Paper Page - Accelerating LLM Inference with Staged Speculative Decoding

Suggested labels

{ "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }

#391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

Speculative Decoding in Exllama v2 and llama.cpp Comparison

Discussion

Test Setup

Exllama v2 Results

No SD

With SD

Suggested labels

{ "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

#492: speculative decoding in llama.cpp : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp

Title: speculative : PoC for speeding-up inference via speculative sampling #292

Suggested labels

{ "label-name": "LLM-speed-optimization", "description": "Optimizing LLama model inference speed", "confidence": 80.85 }

#383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

Deepseek Coder Introduction

Key Features

Model Summary

How to Use

Code Completion

Code Insertion

Repository Level Code Completion

License

Contact

Suggested labels

{ "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

Awesome-Efficient-LLM

Inference Acceleration

Updates

Contributing

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

irthomasthomas commented Mar 4, 2024 •

edited

Loading