Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache optimized routing ("PrefixHash" load balancing - i.e. CHWBL) #333

Merged
merged 20 commits into from
Dec 18, 2024

Conversation

nstogner
Copy link
Contributor

@nstogner nstogner commented Dec 3, 2024

Implementation of proposal: #314

  • Add .spec.loadBalancing field to Model
  • Add PrefixHash (i.e. "Consistent Hashing with Bounded Loads" - CHWBL) load balancing strategy
  • Rename endpoints package to loadbalancer
  • Rename modelscaler package to modelclient
  • Refactor request parsing logic out of modelproxy and messenger and into apiutils as a shared library
  • Add Load Balancing concepts doc
  • Add benchmark showing 34% improvement in time per generated token using PrefixHash over LeastLoad in specific circumstances

TODO:

  • File issue for making PrefixHash the default strategy in the future if benchmarks look good

@@ -144,6 +148,33 @@ type Adapter struct {
URL string `json:"url"`
}

type LoadBalancing struct {
Strategy LoadBalancingStrategy `json:"strategy"`
CHWBL CHWBL `json:"chwbl"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, prefer to use the full name instead of abbreviations that most people don't understand. It's fine if it's long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to PrefixHash since this is a user-facing API and thats probably easier to understand.

api/v1/model_types.go Outdated Show resolved Hide resolved
@nstogner nstogner changed the title WIP: Cache optimized routing (CHWBL) WIP: Cache optimized routing ("PrefixHash" load balancing - i.e. CHWBL) Dec 10, 2024
Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not read all the code but very nice work!

var buf bytes.Buffer
mw := multipart.NewWriter(&buf)
// Keep the same boundary as the initial request (probably not necessary)
mw.SetBoundary(boundary)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: handle error. SetBoundary could fail or cause unexpected behavior

internal/apiutils/request.go Show resolved Hide resolved

r.LoadBalancing = model.Spec.LoadBalancing

if r.LoadBalancing.Strategy == v1.PrefixHashStrategy && r.bodyPayload != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bodyPayload is always nil

Copy link
Contributor Author

@nstogner nstogner Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will push a fix with tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


type Request struct {
Body []byte
bodyPayload map[string]interface{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is never set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a big problem, good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

)

type Request struct {
Body []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to avoid buffering but clearly out of scope for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always store the body to facilitate retries at the moment. It would probably be nice to be able to disable this.

// endpoint is found with acceptable load.
defaultEndpoint = &ep
}
if chwblLoadOK(ep.inFlight.Load(), g.totalInFlight.Load(), len(g.endpoints), loadFactor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be done before setting the default endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default endpoint is the endpoint that is able to serve the request (has the adapter) but might not meet the load requirement after all other endpoints have been checked.

Adding comment.

}
}

i++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personal preference: i = (i + 1) % len(g.chwblSortedHashes) // wrap around

right := len(g.chwblSortedHashes) - 1
for left <= right {
middle := (left + right) / 2
if g.chwblSortedHashes[middle] == val {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: read g.chwblSortedHashes[middle] just once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good recommendation, done

return true
}

avgLoad := float64(totalLoad+1) / float64(n)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment why +1 here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

internal/loadbalancer/balance_least_load.go Show resolved Hide resolved
@nstogner nstogner changed the title WIP: Cache optimized routing ("PrefixHash" load balancing - i.e. CHWBL) Cache optimized routing ("PrefixHash" load balancing - i.e. CHWBL) Dec 16, 2024
@nstogner nstogner requested a review from samos123 December 16, 2024 02:54
Replication int `json:"replication,omitempty"`
// PrefixCharLength is the number of characters to count when building the prefix to hash.
// +kubebuilder:validation:Optional
// +kubebuilder:default=100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ignoring the system prompt when using chat completion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@nstogner nstogner merged commit 3beb635 into main Dec 18, 2024
16 checks passed
@nstogner nstogner deleted the chwbl branch December 18, 2024 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants