Storage controller proxies requests by intent leading to unavailability #9062
Labels
c/storage/controller
Component: Storage Controller
c/storage
Component: storage
t/bug
Issue Type: Bug
I looked at the cloudbench run on 2024-09-18 (link).
Timeline
It failed to create a few branches:
Control plane proxied the request to the storage controller but got a 404 error (logs)
Storage controller received the request and proxied it to a pageserver, but got a 404 error back (logs)
Pageserver received the request, but didn't have the tenant attached to it (logs):
What happened?
Pageserver holding shard 0 for tenant
fa8e211cd9784317f0143c713e3cbb09
was briefly marked as offline:In response to this, the intent state was updated for shard 0 of tenant fa8e211cd9784317f0143c713e3cbb09 and reconciles
triggered. The reconcile for shard 0 of the tenant in question got stuck waiting on the semaphore:
When proxying requests to pageservers we use the intent state and hope that it matches reality (code).
That was not the case since we changed the intent state in response to the node going offline, so we proxied the request to
the wrong pageserver.
The text was updated successfully, but these errors were encountered: