python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
All interview scripts accept the following common options:
--interview
directly run instruct-completion (default: senior)--input
run a pre-prepared interview used for completion and fim
python ./interview_litellm.py --model <provider>/<model_id> --apikey <key>
See LiteLLM documentation for the full list of supported providers.
python ./interview_litellm.py --model openai/<model_id> --apibase http://<host>:<port>/
If the runtime cannot be inferred from the endpoint, you will be asked to provide --runtime
ollama serve <model_id>
python ./interview_litellm.py --model ollama_chat/<model_id>
llama-server -m /home/mike/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf -c 8192 -fa -ngl 99 --host 0.0.0.0 --port 8080
Note: -fa
enables flash attention, -ngl 99
enables GPU offloading
python3 ./interview-litellm.py --model openai/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --apibase http://127.0.0.1:8080
koboldcpp /home/mike/models/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --contextsize 8192 --flashattention --gpulayers 99 --usecublas 1 --host 0.0.0.0 --port 8080
Note: --flashattention
enables flash attention, --gpulayers 99 --usecublas 1
enables GPU offloading
python3 ./interview-litellm.py --model openai/Meta-Llama-3.1-8B-Instruct-Q6_K.gguf --apibase http://127.0.0.1:8080
See Ooba Docs for how to launch.
python3 ./interview-litellm.py --model openai/<modelid>> --apibase http://127.0.0.1:8080 --runtime oobabooga
The local CUDA executor will use all available GPUs by default, use CUDA_VISIBLE_DEVICES
if you have connected accelerators you don't want used.
pip install -r requirements.txt -r requirements-transformers.txt
python ./interview_cuda.py --model <model> --runtime transformers
pip install -r requirements.txt -r requirements-vllm.txt
python ./interview_cuda.py --model <model> --runtime vllm
pip install wheel && pip install -r requirements.txt -r requirements-exl2.txt
python ./interview_cuda.py --model <model> --runtime exllama2
python ./interview_modal.py --model <model> --runtime <runtime> --gpu <gpu>
See modal docs for valid GPUs.
TODO
TODO
bulk-eval.sh
is a quick and easy way to run the evaluate.py
script for all results/interview*
it finds.
streamlit run app.py "results/eval*"
will then show you local results only.