Skip to content

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

License

Notifications You must be signed in to change notification settings

gchatz22/streaming-llm

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Efficient Streaming Language Models with Attention Sinks with Retrieval Augmented Generation

Fork of https://github.com/mit-han-lab/streaming-llm for MIT 6.5940 final project.

Demo

Watch the demo here

Usage

Environment Setup

conda create -yn streaming python=3.8
conda activate streaming

pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece

conda install -c conda-forge faiss-gpu

python setup.py develop

Run Streaming Llama Chatbot

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming

Run Streaming Llama Chatbot with RAG

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming --enable_rag

About

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%