flowchart TD;
PretrainedLlama3[Pretrained Llama3]-->LLaMAFactory["LLaMA-Factory (FT)"];
PretrainedLlama3-->PEFT["PEFT (FT)"];
PretrainedLlama3-->Unsloth["Unsloth (FT)"];
LLaMAFactory-->llamacpp-Q;
LLaMAFactory-->AutoAWQ["AutoAWQ (Q)"];
LLaMAFactory-->vLLM["vLLM (D)"];
LLaMAFactory-->TensorRT-LLM["TensorRT-LLM (D)"];
LLaMAFactory-->AutoGPTQ["AutoGPTQ (Q)"];
llamacpp-Q["llama.cpp (Q)"]-->llamacpp-D["llama.cpp (D)"];
llamacpp-Q-->ollama["ollama (D)"];
llamacpp-D-->LangChain-RAG["LangChain (RAG)"];
llamacpp-D-->LangChain-Agent["LangChain (Agent)"];
llamacpp-D-->LlamaIndex["LlamaIndex (RAG)"];
Note
FT: Fine-tuning, Q: Quantization, D: Deployment.
-
LLaMA-Factory
Specify
OUTPUT_DIR
andEXPORT_DIR
when executing the script; default values are./Meta-Llama-3-8B-Instruct-Adapter
and./Meta-Llama-3-8B-Instruct-zh-10k
.$ source ./finetune_llama-factory_lora.sh
-
llama.cpp
Example using
./Meta-Llama-3-8B-Instruct-zh-10k
:$ source ./quantize_llama.cpp.sh
-
AutoAWQ
Adjust the quantization settings as needed.
$ python3 quantize_autoawq.py \ --pretrained_model_dir /path/to/your-pretrain-model-dir \ --quantized_model_dir /path/to/your-quantized_model_dir
-
AutoGPTQ
Modify the quantization settings and examples according to your requirements.
$ python3 quantize_autogptq.py \ --pretrained_model_dir /path/to/your-pretrain-model-dir \ --quantized_model_dir /path/to/your-quantized_model_dir
-
llama.cpp
Assuming the GGUF file path is
./Meta-Llama-3-8B-Instruct-zh-10k/meta-llama-3-8b-instruct-zh-10k.Q8_0.gguf
:Deploy via command line:
$ source ./deploy_llama.cpp_cli.sh
Or deploy using Docker (untested):
$ source ./deploy_llama.cpp_docker.sh
Test the deployment:
$ source ./deploy_llama.cpp_test.sh
-
ollama
Preparation for deployment:
$ source ./deploy_ollama_prepare.sh
For initial deployment of a custom model:
$ source ./deploy_ollama_create.sh
This step involves configuring the
Modelfile
. An example is provided for guidance, which you can customize as needed.Host the LLM locally:
$ source ./deploy_ollama.sh
Single turn chat test:
$ source ./deploy_ollama_test_chat.sh
Multi-turn chat test:
$ python3 deploy_ollama_test_chat-multi-turn.py
Note: We use the OpenAI function call format in this
.py
file to interact with our model.Start the server first for sequential conversations:
$ source ./deploy_ollama_server.sh
Then run:
$ source ./deploy_ollama.sh
Subsequent steps remain the same.
-
Fine-tuning:
- PEFT
- Unsloth
-
Quantization: N / A
-
Deployment:
- TensorRT-LLM & Triton
- vLLM
-
RAG:
- LangChain
- LlamaIndex