Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better concurrent request handling for model host address #38

Merged
merged 6 commits into from
Jan 11, 2024

Conversation

alpe
Copy link
Contributor

@alpe alpe commented Dec 15, 2023

Like in #36 the reconcile may be affected by external requests. This refactoring helps by reducing lock conflicts

  • Handle request context timeout
  • Optimise concurrent access in endpoints type by separating r/w lock and notification lock
  • Added some tests and Go doc

I have also added a benchmark that shows that the new rwlock is ~30% faster than before on my box. But this is all within ns and does not really matter:

new: BenchmarkEndpointGroup-12    	 7667690	       154.6 ns/op
old: BenchmarkEndpointGroup-12    	 4968279	       234.2 ns/op

The key benefit of this PR is handling request timeout

Copy link
Contributor

@nstogner nstogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! This is a tricky part of the codebase. I think we need a solution for waiting until an endpoint is available, but also respecting the context. I think there might be a leak in here as implemented - love to get your thoughts.

pkg/endpoints/manager.go Outdated Show resolved Hide resolved
pkg/endpoints/endpoints.go Outdated Show resolved Hide resolved
@alpe
Copy link
Contributor Author

alpe commented Dec 21, 2023

Thank you very much for the feedback! I applied some change but I think this needs some better testing before it can be merged. It is a much more complex beast than I thought initially

@samos123
Copy link
Contributor

samos123 commented Dec 21, 2023

Agree on the additional testing. Looks like the larger scale (300 concurrent requests) system test is catching some kind of issue: https://github.com/substratusai/lingo/actions/runs/7286882573/job/19856443938?pr=38#step:5:1098

I can help with additions to the system tests if you have specific scenarios that should be tested. The original system tests were a result of concurrent request handling being broken and needing to ensure scale up and scale down works as expected and request and responses are being returned for a realistic backend.

Edit: I've triggered a re-run of the system tests to ensure it wasnt just a flaky test.

manager.getEndpoints(myService).
setIPs(map[string]struct{}{myService: {}}, map[string]int32{myPort: 1})

testCases := map[string]struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these would be better written as individual tests instead of a table of tests cases. They each test different things. For example: for the timeout example it would be good to assert that the returned error is due to context cancellation and this code would only be used for that test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the error check a bit vague but IMHO it makes sense to have a spec for the methods that defines all cases. I find it more readable.
But to be fair, I use table tests as my default structure for unit tests and may be biased. If this is very important for you, I can refactor. The error type is checked now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a strong opinion, good with this.

@@ -56,7 +56,13 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
defer complete()

log.Println("Waiting for IPs", id)
host := h.Endpoints.GetHost(r.Context(), deploy, "http")
host, err := h.Endpoints.AwaitHostAddress(r.Context(), deploy, "http")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more robust to check this error instead of assuming is it a timeout error. In the future the error logic in the invoked function might be updated to return different error types but this call-site might not be reconsidered. Also, it is not always a timeout today: if the caller cancels the request the context will cancel (not technically a timeout).

log.Printf("error while finding the host address %v", err)
switch {
case errors.Is(err, context.Canceled):
w.WriteHeader(http.StatusInternalServerError)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some research on the inet what http code makes sense but looks like this case is not handled explicit very often. Alternatively 499 was suggested, which is not part of Go stdlib though

@nstogner nstogner merged commit cbfa863 into substratusai:main Jan 11, 2024
3 checks passed
@alpe alpe deleted the req_context branch January 11, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants