-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempting to delete a topic which has recently been created sometimes fails with a 500 error, with an internal broker error of "Maximum redirect reached" preceeded by "Topic was already not existing" and "MetadataNotFoundException: Managed ledger not found" #12555
Comments
The following issues were all observed in response to similar testing: The condition that caused these issues to occur appears to be interaction with various pulsar entities (e.g. creating/deleting things in the management API, or attempting to create consumers) immediately after those entities were created or immediately after entities with the same name were deleted. I think the number of issues observed speaks to a defect in the management API functionality in general. Considering the severity of these issues (in many cases it is possible to force a topic/namespace into a permanently corrupted state), I hope a resolution can be found for the general/common root cause rather than fixing individual bug-inducing conditions. I suspect that the common root cause is that many management API operations are asynchronous that should not be. Ideally, the resolution of all of these issues would be the same: a management API operation--any operation--should not return successfully until all observable side effects of that operation across a Pulsar cluster (including brokers, proxies, bookies, and ZK) were completed. All caches of metadata (e.g. on all brokers/proxies in the cluster) related to the operation should be cleared, and all persistent state (including ledger deletion, bookie cleanup, ZooKeeper metadata, etc.) should be updated during management API operations, and not afterwards. If that means that management API operations take many seconds or minutes, that's still vastly preferable to not knowing when it is safe to interact with a cluster again after performing "DDL"-type changes. |
Additional info: when deleting a topic via a proxy rather than directly through the broker, the proxy returns a 502: Bad Gateway error when the broker prints out the stacktrace in this issue. |
When connecting through a proxy, this error persists; it doesn't go away after a period of time. |
Thank you @zbentley . This is great analysis. @merlimat @codelipenghui @Jason918 Please take a look. ^^^^^ |
@lhotari thanks! I haven't gone back through the repro steps of this family of issues in several months. In March of 22 I'll set aside some time for trying to repro these on master and will close things up if the linked fixes have helped. Thanks for taking a look! |
The issue had no activity for 30 days, mark with Stale label. |
This no longer repros on 2.9.1. Thanks for the fix! |
Describe the bug
Identical to #12554, except the error that occurs first in the broker's log is different ("MetadataNotFoundException: Managed ledger not found"); the NoNodeError referenced in the 12554 is not observed in the logs in this issue.
To Reproduce
Run reproduction plan for #12551; sometimes no error will occur, sometimes the error described in that issue or others will occur, and sometimes this error will occur.
Expected behavior
Environment
Same environment as #12551
What my client sees
First (earliest) stacktrace in broker log that coincides with this error
Second (last in time) stacktrace in broker that coincides with this error
The text was updated successfully, but these errors were encountered: