Is DSPy compatible with vllm and how good is DSPy with vllm? #818

brando90 · 2024-04-14T23:17:53Z

brando90
Apr 14, 2024

I wanted to use vLLM and DSPy and since vLLM is the fastest inference framework for open source llms, I thought to ask. Is it possible to use together? is there a tutorial/colab for it I may build upon and contribute?

found this link on the DSPy discord server
make a duplicate question on the DSPy discord server
Twitter question post here.

okhat · 2024-04-14T23:31:05Z

okhat
Apr 14, 2024
Maintainer

Yes — tagging @XenonMolecule @arnavsinghvi11 (mainly Arnav, but Michael has set up SGLang which might be a lot faster)

0 replies

XenonMolecule · 2024-04-15T00:31:34Z

XenonMolecule
Apr 15, 2024
Maintainer

Yes DSPy is compatible with VLLM! You can find the VLLM model in /dspy/dsp/modules/hf_client.py:

dspy/dsp/modules/hf_client.py

Line 119 in afdf353

class HFClientVLLM(HFModel):

I have used this same class with SGLang before and it has been quite efficient.

Example usage:
llama = dspy.HFClientVLLM(model="meta-llama/Llama-2-13b-chat-hf",port=None,url=["http://URL:7000","http://URL:7001"],max_tokens=150)

I'm unsure of any tutorials/colab notebooks. It is mostly a drop in replacement for HF TGI! It could still be nice to have a tutorial or some docs about this, I'll defer to @arnavsinghvi11 to point out if any such documentation exists!

5 replies

brando90 Apr 16, 2024
Author

@XenonMolecule do you mind doing pip freeze for me to see all your deps? pytorch 2.2 is not supproted by vllm yet and I can' tfigure out the deps for vllm + dpsy-ai (yet) reference vllm-project/vllm#2747 (comment)

pip freeze > requirements.txt

thank you!

XenonMolecule Apr 16, 2024
Maintainer

Sure! I'll include my environment details below, however I host the SGLang server in a different environment from where I run DSPy. Since I am only doing inference and not finetuning, this works for me, but may not be compatible with your workflow.

Environment where I host SGLang:

aiohttp==3.9.3
aioprometheus==23.12.0
aiosignal==1.3.1
alembic==1.13.1
annotated-types==0.6.0
anyio==4.3.0
attrs==23.2.0
backoff==2.2.1
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
colorlog==6.8.2
datasets==2.14.7
dill==0.3.7
diskcache==5.6.3
distro==1.9.0
dspy-ai==2.4.0
fastapi==0.110.0
filelock==3.13.1
frozenlist==1.4.1
fsspec==2023.10.0
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.21.4
idna==3.6
interegular==0.3.3
Jinja2==3.1.3
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
lark==1.1.9
llvmlite==0.42.0
Mako==1.3.2
MarkupSafe==2.1.5
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.15
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
numba==0.59.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
openai==1.14.1
optuna==3.6.0
orjson==3.9.15
outlines==0.0.30
packaging==24.0
pandas==2.2.1
pillow==10.2.0
plumbum==1.8.2
protobuf==5.26.0
psutil==5.9.8
pyarrow==15.0.2
pyarrow-hotfix==0.6
pydantic==2.5.0
pydantic_core==2.14.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
pytz==2024.1
PyYAML==6.0.1
pyzmq==25.1.2
quantile-python==1.1
ray==2.9.3
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
rpyc==6.0.0
safetensors==0.4.2
scipy==1.12.0
sentencepiece==0.2.0
sglang==0.1.12
six==1.16.0
sniffio==1.3.1
SQLAlchemy==2.0.28
starlette==0.36.3
sympy==1.12
tokenizers==0.15.2
torch==2.1.2+cu118
tqdm==4.66.2
transformers==4.38.2
triton==2.1.0
typing_extensions==4.10.0
tzdata==2024.1
ujson==5.9.0
urllib3==2.2.1
uvicorn==0.28.0
uvloop==0.19.0
vllm @ https://github.com/vllm-project/vllm/releases/download/v0.2.6/vllm-0.2.6+cu118-cp311-cp311-manylinux1_x86_64.whl#sha256=5306437fe5e59d91b705760b82b0e5be91f05c1b92c5b05e29fb8e1bbb6fe577
watchfiles==0.21.0
websockets==12.0
xformers==0.0.23.post1+cu118
xxhash==3.4.1
yarl==1.9.4
zmq==0.0.0

Environment where I run DSPy:

aiohttp==3.9.3
aiosignal==1.3.1
alembic==1.13.1
annotated-types==0.6.0
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
backoff==2.2.1
beautifulsoup4==4.12.3
bleach==6.1.0
certifi @ file:///croot/certifi_1707229174982/work/certifi
cffi==1.16.0
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
colorlog==6.8.2
comm==0.2.1
datasets==2.14.6
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.7
executing==2.0.1
fastjsonschema==2.19.1
filelock @ file:///work/perseverance-python-buildout/croot/filelock_1701733993137/work
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2023.10.0
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httpx==0.27.0
huggingface-hub==0.21.3
idna @ file:///work/perseverance-python-buildout/croot/idna_1698845632828/work
ipykernel==6.29.3
ipynb==0.5.1
ipython==8.22.2
ipywidgets==8.1.2
isoduration==20.11.0
jedi==0.19.1
Jinja2 @ file:///work/perseverance-python-buildout/croot/jinja2_1707343043683/work
joblib==1.3.2
json5==0.9.20
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.9.0
jupyter-lsp==2.2.4
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.13.0
jupyter_server_terminals==0.5.2
jupyterlab==4.1.3
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
jupyterlab_widgets==3.0.10
Mako==1.3.2
MarkupSafe @ file:///work/perseverance-python-buildout/croot/markupsafe_1707342994941/work
matplotlib-inline==0.1.6
mistune==3.0.2
mkl-fft @ file:///work/perseverance-python-buildout/croot/mkl_fft_1698845673361/work
mkl-random @ file:///work/perseverance-python-buildout/croot/mkl_random_1698845720894/work
mkl-service==2.4.0
mpmath @ file:///work/perseverance-python-buildout/croot/mpmath_1698864994882/work
multidict==6.0.5
multiprocess==0.70.15
nbclient==0.9.0
nbconvert==7.16.2
nbformat==5.9.2
nest-asyncio==1.6.0
networkx @ file:///work/perseverance-python-buildout/croot/networkx_1698865062738/work
notebook==7.1.1
notebook_shim==0.2.4
numpy @ file:///croot/numpy_and_numpy_base_1708638617955/work/dist/numpy-1.26.4-cp312-cp312-linux_x86_64.whl#sha256=1d700f51d8b4fa684d858c9e3b56b1656bc5c82b6b79ff08d4e3b491c430059f
openai==0.28.1
optuna==3.4.0
overrides==7.7.0
packaging==23.2
pandas==2.1.1
pandocfilters==1.5.1
parso==0.8.3
pexpect==4.9.0
pillow @ file:///croot/pillow_1707233021655/work
platformdirs==4.2.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==15.0.0
pycparser==2.21
pydantic==2.6.3
pydantic_core==2.16.3
Pygments==2.17.2
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
pytz==2024.1
PyYAML @ file:///work/perseverance-python-buildout/croot/pyyaml_1698849903511/work
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
rapidfuzz==3.6.2
referencing==0.33.0
regex==2023.10.3
requests @ file:///croot/requests_1707355572290/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
Send2Trash==1.8.2
setuptools==68.2.2
six==1.16.0
sniffio==1.3.1
soupsieve==2.5
SQLAlchemy==2.0.28
stack-data==0.6.3
sympy @ file:///croot/sympy_1701397643339/work
terminado==0.18.0
tinycss2==1.2.1
tokenizers==0.15.2
torch==2.2.1
torchaudio==2.2.1
torchvision==0.17.1
tornado==6.4
tqdm==4.66.1
traitlets==5.14.1
transformers==4.38.2
types-python-dateutil==2.8.19.20240106
typing_extensions @ file:///croot/typing_extensions_1705619912070/work
tzdata==2024.1
ujson==5.8.0
uri-template==1.3.0
urllib3 @ file:///work/perseverance-python-buildout/croot/urllib3_1701735813240/work
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
wheel==0.41.2
widgetsnbextension==4.0.10
xxhash==3.4.1
yarl==1.9.4

brando90 Apr 16, 2024
Author

@XenonMolecule thank you! One observation. The one you did not labeled as dspy but does have dspy installed (the first one) is the only one that has vllm installed. I was hoping to get a dspy pytorch vllm installation that worked with all 3 of them (given vllm does not work with pytorch 2.2.x as detailed here. So I will try to copy your first one.

brando90 Apr 16, 2024
Author

curious, how did you pip install torch==2.1.2+cu118? I can't sadly. Related issue here: Stability-AI/StableCascade#23

XenonMolecule Apr 16, 2024
Maintainer

I believe I did it through conda:
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia

brando90 · 2024-04-15T01:02:44Z

brando90
Apr 15, 2024
Author

Nice! My main question is...I'm only familiar with loading the model in vllm itself -- not using urls. e.g., from the quick start tutorial https://docs.vllm.ai/en/latest/getting_started/quickstart.html : ```python # https://github.com/brando90/snap-cluster-setup/blob/main/src/test_vllm.py # copy pasted from https://docs.vllm.ai/en/latest/getting_started/quickstart.html # do export VLLM_USE_MODELSCOPE=True from vllm import LLM, SamplingParams def test_vllm(): prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="facebook/opt-125m") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") if __name__ == "__main__": import time start_time = time.time() test_vllm() print(f"Time taken: {time.time() - start_time:.2f} seconds, or {(time.time() - start_time) / 60:.2f} minutes, or {(time.time() - start_time) / 3600:.2f} hours.\a") ``` How do I do it with that? Or alternatively, can you show me how your running your vllm server for your code to work. I'm also puzzled, some of DSPy claims to fine-tune/change the weights. How would that work if my model is local and using vllm? Perhaps that vllm + change weights feature is not supported?

…

On Sun, Apr 14, 2024 at 5:31 PM Michael Ryan ***@***.***> wrote: Yes DSPy is compatible with VLLM! You can find the VLLM model in /dspy/dsp/modules/hf_client.py: https://github.com/stanfordnlp/dspy/blob/afdf3539794b3f4b1f3d85dc74fec8254e4b0e1c/dsp/modules/hf_client.py#L119 I have used this same class with SGLang before and it has been quite efficient. Example usage: llama = dspy.HFClientVLLM(model="meta-llama/Llama-2-13b-chat-hf",port=None,url=[" http://URL:7000","http://URL:7001"],max_tokens=150) I'm unsure of any tutorials/colab notebooks. It is mostly a drop in replacement for HF TGI! It could still be nice to have a tutorial or some docs about this, I'll defer to @arnavsinghvi11 <https://github.com/arnavsinghvi11> to point out if any such documentation exists! — Reply to this email directly, view it on GitHub <#818 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOE6LUQKGDRKFCO6AW325DY5MNX3AVCNFSM6AAAAABGGLZHTKVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TCMJSGA4DM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

2 replies

sutyum Apr 15, 2024

vLLM would download the weights of the model and store it somewhere on your computer.

Currently, DSPy’s trainer uses the huggingface trainer. Its underdeveloped currently in order to finetune larger models. That being said, the train-serve loop can already be established with DSPy + vLLM.

DSPy + SgLang may lead to slightly higher performance though.

houyu0930 Aug 16, 2024

Thanks for asking this! I am also wondering if I could use DSPy and vLLM together for offline batch inference (efficiently + prompt optimizer). But right now seems we could only try vLLM (Serving mode) with DSPy?

https://dspy-docs.vercel.app/api/local_language_model_clients/vLLM

okhat · 2024-04-15T03:10:05Z

okhat
Apr 15, 2024
Maintainer

Finetuning happens only with BootstrapFinetune. The others are prompt optimizers. This isn’t through VLLM but uses HF trainers as @sutyum says.

We have two nice research projects building more finetuning-based optimizers, but they won’t be out until NeurIPS.

1 reply

brando90 Apr 15, 2024
Author

Make sense. To use vLLM + fine-tuning I think you have to use HF then copy the weights back to vLLM? Not 100% sure. If the internals are discussed here perhaps I can help implement this.

I want a fine-tuning module + prompt optimizer + very efficient generation/inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is DSPy compatible with vllm and how good is DSPy with vllm? #818

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is DSPy compatible with vllm and how good is DSPy with vllm? #818

brando90 Apr 14, 2024

Replies: 4 comments · 8 replies

okhat Apr 14, 2024 Maintainer

XenonMolecule Apr 15, 2024 Maintainer

brando90 Apr 16, 2024 Author

XenonMolecule Apr 16, 2024 Maintainer

brando90 Apr 16, 2024 Author

brando90 Apr 16, 2024 Author

XenonMolecule Apr 16, 2024 Maintainer

brando90 Apr 15, 2024 Author

sutyum Apr 15, 2024

houyu0930 Aug 16, 2024

okhat Apr 15, 2024 Maintainer

brando90 Apr 15, 2024 Author

brando90
Apr 14, 2024

Replies: 4 comments 8 replies

okhat
Apr 14, 2024
Maintainer

XenonMolecule
Apr 15, 2024
Maintainer

brando90 Apr 16, 2024
Author

XenonMolecule Apr 16, 2024
Maintainer

brando90 Apr 16, 2024
Author

brando90 Apr 16, 2024
Author

XenonMolecule Apr 16, 2024
Maintainer

brando90
Apr 15, 2024
Author

okhat
Apr 15, 2024
Maintainer

brando90 Apr 15, 2024
Author