Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage controller proxies requests by intent leading to unavailability #9062

Closed
VladLazar opened this issue Sep 19, 2024 · 0 comments · Fixed by #9065
Closed

Storage controller proxies requests by intent leading to unavailability #9062

VladLazar opened this issue Sep 19, 2024 · 0 comments · Fixed by #9065
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/bug Issue Type: Bug

Comments

@VladLazar
Copy link
Contributor

VladLazar commented Sep 19, 2024

I looked at the cloudbench run on 2024-09-18 (link).

Timeline

It failed to create a few branches:

2024-09-18T22:06:38.858Z
ERROR
cloudbench	creating branch for project failed: decode response: error: code 500: {Code: Message:unknown error}	
{"unit": 5252, "project_id": "wild-frog-19741510", "times": 6, 
 "error": "creating branch for project failed: decode response: error: code 500: {Code: Message:unknown error}"}

Control plane proxied the request to the storage controller but got a 404 error (logs)

{"level":"ERR","ts":"2024-09-18T22:05:45.992Z","logger":"publicapiv2","message":"incoming request finished with error: internal: UNKNOWN: could not create project-branch: pageserver error","http_meth":"POST","http_path":"/api/v2/projects/wild-frog-19741510/branches","route":"CreateProjectBranch","request_id":"b3be7202-08c8-4623-b113-23383b397795","trace_id":"QDXxW2SC4esgx4jLzLQgsp","project_id":"wild-frog-19741510","account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","ingress_duration_ms":693,"status":500,"account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","status":404,"message":"NotFound: tenant fa8e211cd9784317f0143c713e3cbb09","error":"incoming request finished with error: internal: UNKNOWN: could not create project-branch: pageserver error"}

Storage controller received the request and proxied it to a pageserver, but got a 404 error back (logs)

2024-09-18T22:05:45.471294Z  INFO request{method=GET path=/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97 request_id=b3be7202-08c8-4623-b113-23383b397795}: Proxying request for tenant fa8e211cd9784317f0143c713e3cbb09 (/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97)
	
2024-09-18T22:05:45.991205Z  INFO request{method=GET path=/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97 request_id=b3be7202-08c8-4623-b113-23383b397795}: Request handled, status: 404 Not Found

Pageserver received the request, but didn't have the tenant attached to it (logs):

2024-09-18T22:05:45.487653Z  INFO request{method=GET path=/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97 request_id=17f9d525-64e1-4244-b743-362706975271}: Error processing HTTP request: NotFound: tenant fa8e211cd9784317f0143c713e3cbb09

What happened?

Pageserver holding shard 0 for tenant fa8e211cd9784317f0143c713e3cbb09 was briefly marked as offline:

2024-09-18T22:05:44.162464Z  INFO spawn_heartbeat_driver: Node 9355 transition to offline
...
2024-09-18T22:05:52.695691Z  INFO spawn_heartbeat_driver: Node 9355 transition to active

In response to this, the intent state was updated for shard 0 of tenant fa8e211cd9784317f0143c713e3cbb09 and reconciles
triggered. The reconcile for shard 0 of the tenant in question got stuck waiting on the semaphore:

2024-09-18T22:05:44.179449Z  INFO spawn_heartbeat_driver: Concurrency limited: enqueued for reconcile later tenant_id=fa8e211cd9784317f0143c713e3cbb09 shard_id=0000

When proxying requests to pageservers we use the intent state and hope that it matches reality (code).

That was not the case since we changed the intent state in response to the node going offline, so we proxied the request to
the wrong pageserver.

@VladLazar VladLazar added c/storage Component: storage c/storage/controller Component: Storage Controller t/bug Issue Type: Bug labels Sep 19, 2024
jcsp added a commit that referenced this issue Sep 25, 2024
…9065)

## Problem

These commits are split off from
https://github.com/neondatabase/neon/pull/8971/commits where I was
fixing this to make a better scale test pass -- Vlad also independently
recognized these issues with cloudbench in
#9062.

1. The storage controller proxies GET requests to pageservers based on
their intent, not the ground truth of where they're really attached.
2. Proxied requests can race with scheduling to tenants, resulting in
404 responses if the request hits the wrong pageserver.

Closes: #9062

## Summary of changes

1. If a shard has a running reconciler, then use the database
generation_pageserver to decide who to proxy the request to
2. If such a request gets a 404 response and its scheduled node has
changed since the request was dispatched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant