-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache optimized routing ("PrefixHash" load balancing - i.e. CHWBL) #333
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
d658966
Rebase main
nstogner a23c497
Checkpoint
nstogner 219bd17
Checkpoint
nstogner 7a8e3eb
Refactor request parsing into apiutils
nstogner 2a96444
Set load balancing default
nstogner 0a6470b
Rename vars
nstogner 4f42cde
Fix unit tests
nstogner b930776
Fix Messenger integration test
nstogner 32b8f5e
Fix unit test
nstogner 56aac4b
Pass through prefix in request to load balancer
nstogner d279796
Fix messenger int test
nstogner 005686a
Hopefully fix panic when removing endpoint
nstogner 4c9361a
Add tests for load balancing strategies
nstogner 214816f
Update lb behavior tests to include both strategies
nstogner ccbbe53
Fix test
nstogner 47cce1b
Fix prefix parsing
nstogner 83433cc
Add k6 benchmark for chat threads
nstogner c91eb71
Add benchmark and docs
nstogner 58a7a06
Address comments
nstogner 95c0427
Update benchmark - remove big data file
nstogner File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
data/ShareGPT_V3_unfiltered_cleaned_split.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
data/*.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
FROM ubuntu:20.04 | ||
|
||
RUN apt-get update && apt-get install -y build-essential make python3 wget vim | ||
|
||
# Install k6 binary. | ||
ENV K6_VERSION=v0.55.0 | ||
RUN wget https://github.com/grafana/k6/releases/download/${K6_VERSION}/k6-${K6_VERSION}-linux-amd64.tar.gz && tar -zxvf k6-${K6_VERSION}-linux-amd64.tar.gz && mv k6-${K6_VERSION}-linux-amd64/k6 /usr/local/bin && rm k6-${K6_VERSION}-linux-amd64.tar.gz | ||
|
||
WORKDIR /work | ||
|
||
COPY ./k6.js . | ||
COPY ./Makefile . | ||
COPY ./data ./data | ||
COPY ./scenarios ./scenarios |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
data/ShareGPT_V3_unfiltered_cleaned_split.json: | ||
cd data && wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json | ||
|
||
.PHONY: data | ||
data: data/ShareGPT_V3_unfiltered_cleaned_split.json | ||
cd data && python prepare-message-threads.py | ||
|
||
run: | ||
ls scenarios/${SCENARIO} | ||
CONFIG_DIR=scenarios/${SCENARIO} DATA_DIR=data MODEL_ADDR=kubeai/openai k6 run ./k6.js |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
import json | ||
|
||
|
||
def main(): | ||
with open("./ShareGPT_V3_unfiltered_cleaned_split.json", "r") as f: | ||
data = json.load(f) | ||
|
||
# Select a subnet the first conversations that start with a human. | ||
max = 2000 | ||
output = [] | ||
for entry in data: | ||
conv = entry.get("conversations") | ||
if conv and conv[0]["from"] == "human" and len(conv[0]["value"]) != 0: | ||
# Filter the conversation to only include messages from a human using a for loop. | ||
# entry["userMessages"] = [c["value"] for c in conv if c["from"] == "human"] | ||
totalContentLength = 0 | ||
userMessages = [] | ||
for c in conv: | ||
if c["from"] == "human": | ||
content = c["value"] | ||
userMessages.append(content) | ||
totalContentLength += len(content) | ||
|
||
if totalContentLength < 2500: | ||
continue | ||
|
||
if len(userMessages) < 5: | ||
continue | ||
|
||
# Delete the original conversation | ||
entry["userMessages"] = userMessages | ||
del entry["conversations"] | ||
output.append(entry) | ||
|
||
if len(output) >= max: | ||
break | ||
|
||
with open("./message-threads.json", "w") as f: | ||
data = json.dump(output, f, indent=4) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
import { check } from 'k6'; | ||
import { scenario } from 'k6/execution'; | ||
import http from 'k6/http'; | ||
import { Trend, Counter } from 'k6/metrics'; | ||
|
||
const model_addr = __ENV.MODEL_ADDR; | ||
const config_dir = __ENV.CONFIG_DIR; | ||
const data_dir = __ENV.DATA_DIR; | ||
|
||
const timePerToken = new Trend('time_per_token', true); | ||
const tokens = new Counter('tokens'); | ||
const new_tokens = new Counter('new_tokens'); | ||
const input_tokens = new Counter('input_tokens'); | ||
|
||
const k6Options = JSON.parse(open(`${config_dir}/k6.json`)); | ||
const baseRequest = JSON.parse(open(`${config_dir}/base-request.json`)); | ||
const messageThreads = JSON.parse(open(`${data_dir}/message-threads.json`)) | ||
|
||
export const options = k6Options; | ||
|
||
export default function run() { | ||
const headers = { 'Content-Type': 'application/json' }; | ||
const msgThread = messageThreads[scenario.iterationInTest % messageThreads.length]; | ||
var payload = JSON.parse(JSON.stringify(baseRequest)); | ||
|
||
// console.log(`Message thread: ${JSON.stringify(msgThread)}`); | ||
|
||
// Iterate over all the messages in the thread, appending the completions to the same payload. | ||
for (let i = 0; i < msgThread["userMessages"].length; i++) { | ||
payload.messages.push({ | ||
"role": "user", | ||
"content": msgThread["userMessages"][i] | ||
}); | ||
//console.log(`Payload: ${JSON.stringify(payload)}`); | ||
|
||
const res = http.post(`http://${model_addr}/v1/chat/completions`, JSON.stringify(payload), { | ||
headers, | ||
}); | ||
if (res.status >= 400 && res.status < 500) { | ||
return; | ||
} | ||
|
||
check(res, { | ||
'Post status is 200': (res) => res.status === 200, | ||
}); | ||
const duration = res.timings.duration; | ||
|
||
if (res.status === 200) { | ||
// console.log(`Status: ${res.status}`); | ||
const body = res.json(); | ||
|
||
const completion_tokens = body.usage.completion_tokens; | ||
const prompt_tokens = body.usage.prompt_tokens; | ||
const latency_ms_per_token = duration / completion_tokens; | ||
|
||
new_tokens.add(completion_tokens); | ||
input_tokens.add(prompt_tokens); | ||
timePerToken.add(latency_ms_per_token); | ||
tokens.add(completion_tokens + prompt_tokens); | ||
|
||
const msg0 = body.choices[0].message; | ||
payload.messages.push({ | ||
"role": msg0.role, | ||
"content": msg0.content | ||
}); | ||
} else { | ||
console.log(`Error Status: ${res.status}`); | ||
console.log(`Response: ${res.body}`); | ||
} | ||
} | ||
} |
131 changes: 131 additions & 0 deletions
131
benchmarks/chat/scenarios/least-load-vs-prefix-hash/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# Results | ||
|
||
Under specific conditions: | ||
|
||
* Restricted GPU memory | ||
* Low `max_tokens` to be generated | ||
* Chat threads with decently long user messages | ||
|
||
Prefix hashing was shown to have `34%` decrease in average time per token. | ||
|
||
`712.11ms (LeastLoad) --> 469.34ms (PrefixHash)` | ||
|
||
## Steps taken | ||
|
||
```bash | ||
gcloud container clusters create-auto cluster-1 \ | ||
--location=us-central1 | ||
skaffold run -f ./skaffold.yaml --tail --port-forward --profile kubeai-only-gke --default-repo us-central1-docker.pkg.dev/substratus-dev | ||
|
||
cd ./benchmarks/chat | ||
make data | ||
export IMG=us-central1-docker.pkg.dev/substratus-dev/default/kubeai-benchmark-chat:v0.0.2 | ||
docker build -t $IMG . && docker push $IMG | ||
|
||
kubectl apply -f ./scenarios/least-load-vs-prefix-hash/model.yaml | ||
kubectl apply -f ./scenarios/least-load-vs-prefix-hash/pod.yaml | ||
|
||
# Run 2x (to ensure both cases start with a preloaded cache) | ||
kubectl exec -it chat-benchmark -- SCENARIO=least-load-vs-prefix-hash make run | ||
|
||
kubectl patch model llama-3.1-8b-instruct-fp8-l4 --type='merge' -p '{"spec": {"loadBalancing": {"strategy": "PrefixHash"}}}' | ||
kubectl exec -it chat-benchmark -- SCENARIO=least-load-vs-prefix-hash make run | ||
``` | ||
|
||
## Next Steps | ||
|
||
* Rerun with increased replicas (i.e. 10 instead of 2) | ||
|
||
## Benchmark Output | ||
|
||
### LeastLoad | ||
|
||
``` | ||
/\ Grafana /‾‾/ | ||
/\ / \ |\ __ / / | ||
/ \/ \ | |/ / / ‾‾\ | ||
/ \ | ( | (‾) | | ||
/ __________ \ |_|\_\ \_____/ | ||
|
||
execution: local | ||
script: ./k6.js | ||
output: - | ||
|
||
scenarios: (100.00%) 1 scenario, 80 max VUs, 10m30s max duration (incl. graceful stop): | ||
* chat: 1000 iterations shared among 80 VUs (maxDuration: 10m0s, gracefulStop: 30s) | ||
|
||
|
||
✓ Post status is 200 | ||
|
||
checks.........................: 100.00% 7341 out of 7341 | ||
data_received..................: 4.7 MB 7.9 kB/s | ||
data_sent......................: 25 MB 42 kB/s | ||
http_req_blocked...............: avg=161.4µs min=2.83µs med=5.8µs max=16.67ms p(90)=8.06µs p(95)=10.19µs | ||
http_req_connecting............: avg=55.73µs min=0s med=0s max=8.41ms p(90)=0s p(95)=0s | ||
http_req_duration..............: avg=6.31s min=165.25ms med=6.66s max=11.65s p(90)=8.55s p(95)=9.07s | ||
{ expected_response:true }...: avg=6.31s min=165.25ms med=6.66s max=11.65s p(90)=8.55s p(95)=9.07s | ||
✓ http_req_failed................: 0.00% 0 out of 7341 | ||
http_req_receiving.............: avg=84.64µs min=29.4µs med=74.05µs max=732.69µs p(90)=129.94µs p(95)=154.19µs | ||
http_req_sending...............: avg=68µs min=12.1µs med=32.3µs max=1.38ms p(90)=144.04µs p(95)=173.19µs | ||
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s | ||
http_req_waiting...............: avg=6.31s min=165.04ms med=6.66s max=11.65s p(90)=8.55s p(95)=9.07s | ||
http_reqs......................: 7341 12.422953/s | ||
input_tokens...................: 4990223 8444.803735/s | ||
iteration_duration.............: avg=46.39s min=6.73s med=41.26s max=4m13s p(90)=1m8s p(95)=1m28s | ||
iterations.....................: 1000 1.69227/s | ||
new_tokens.....................: 68062 115.179268/s | ||
time_per_token.................: avg=712.11ms min=39.56ms med=703.28ms max=2.69s p(90)=928.58ms p(95)=1.09s | ||
tokens.........................: 5058285 8559.983003/s | ||
vus............................: 1 min=0 max=80 | ||
vus_max........................: 80 min=21 max=80 | ||
|
||
|
||
running (09m50.9s), 00/80 VUs, 1000 complete and 0 interrupted iterations | ||
chat ✓ [======================================] 80 VUs 09m50.9s/10m0s 1000/1000 shared iters | ||
``` | ||
|
||
### PrefixHash | ||
|
||
``` | ||
/\ Grafana /‾‾/ | ||
/\ / \ |\ __ / / | ||
/ \/ \ | |/ / / ‾‾\ | ||
/ \ | ( | (‾) | | ||
/ __________ \ |_|\_\ \_____/ | ||
|
||
execution: local | ||
script: ./k6.js | ||
output: - | ||
|
||
scenarios: (100.00%) 1 scenario, 80 max VUs, 10m30s max duration (incl. graceful stop): | ||
* chat: 1000 iterations shared among 80 VUs (maxDuration: 10m0s, gracefulStop: 30s) | ||
|
||
|
||
✓ Post status is 200 | ||
|
||
checks.........................: 100.00% 7341 out of 7341 | ||
data_received..................: 4.7 MB 12 kB/s | ||
data_sent......................: 25 MB 65 kB/s | ||
http_req_blocked...............: avg=268.24µs min=2.94µs med=5.76µs max=28.19ms p(90)=8.17µs p(95)=10.41µs | ||
http_req_connecting............: avg=136.33µs min=0s med=0s max=17.7ms p(90)=0s p(95)=0s | ||
http_req_duration..............: avg=4.08s min=151.9ms med=2.45s max=12.32s p(90)=9.63s p(95)=10.26s | ||
{ expected_response:true }...: avg=4.08s min=151.9ms med=2.45s max=12.32s p(90)=9.63s p(95)=10.26s | ||
✓ http_req_failed................: 0.00% 0 out of 7341 | ||
http_req_receiving.............: avg=81.81µs min=28.68µs med=72.08µs max=786.09µs p(90)=125.04µs p(95)=148.6µs | ||
http_req_sending...............: avg=63.61µs min=11.85µs med=31.65µs max=1.59ms p(90)=136.85µs p(95)=161.88µs | ||
http_req_tls_handshaking.......: avg=0s min=0s med=0s max=0s p(90)=0s p(95)=0s | ||
http_req_waiting...............: avg=4.08s min=151.81ms med=2.45s max=12.32s p(90)=9.63s p(95)=10.26s | ||
http_reqs......................: 7341 19.230625/s | ||
input_tokens...................: 4990576 13073.409349/s | ||
iteration_duration.............: avg=29.98s min=2.37s med=20.29s max=2m53s p(90)=1m1s p(95)=1m18s | ||
iterations.....................: 1000 2.619619/s | ||
new_tokens.....................: 68218 178.705191/s | ||
time_per_token.................: avg=469.34ms min=44.2ms med=257.72ms max=3.86s p(90)=1s p(95)=1.1s | ||
tokens.........................: 5058794 13252.11454/s | ||
vus............................: 3 min=0 max=80 | ||
vus_max........................: 80 min=19 max=80 | ||
|
||
|
||
running (06m21.7s), 00/80 VUs, 1000 complete and 0 interrupted iterations | ||
chat ✓ [======================================] 80 VUs 06m21.7s/10m0s 1000/1000 shared iters | ||
``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ignoring the system prompt when using chat completion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep