-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safely stop(close replication-producer) and remove replicator #152
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Nice catch on this one
@@ -666,8 +666,7 @@ public void deleteCursorComplete(Object ctx) { | |||
@Override | |||
public void deleteCursorFailed(ManagedLedgerException exception, Object ctx) { | |||
log.error("[{}] Failed to delete cursor {}", topic, name); | |||
// Connect the producers back | |||
replicators.get(remoteCluster).startProducer(); | |||
replicators.remove(remoteCluster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one thing here. If the delete cursor fails during the topic load, then the topic load fails and that is fine since it will be re-tried anyway.
If it fails after a policies change though, it will not be retried.
I think the whole PersistentTopic.checkReplication()
should be rescheduled after a while to make sure we clean up the cursor properly.
Just retrying the cursor delete could be dangerous, since the remote cluster could be added again before we retry, but the checkReplication()
is idempotent and uses the latest configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, made the change to schedule checkReplication()
while deleteCursorFailed
on onPoliciesUpdate()
7d41c04
to
efc2de2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good, just a minor thing in the log messages
return checkReplication(); | ||
CompletableFuture<Void> result = new CompletableFuture<Void>(); | ||
checkReplication().thenAccept(res -> { | ||
log.info("Policies updated successfully {}", data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should include the topic name here. Also I don't think that data
will print the json content and anyway it would be printing too much.
log.info("Policies updated successfully {}", data); | ||
result.complete(null); | ||
}).exceptionally(th -> { | ||
log.error("Policies update failed {} {}, scheduled retry in {} seconds", data, th.getMessage(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add topic name here as well
}).exceptionally(th -> { | ||
log.error("Policies update failed {} {}, scheduled retry in {} seconds", data, th.getMessage(), | ||
POLICY_UPDATE_FAILURE_RETRY_TIME_SECONDS, th); | ||
brokerService.executor().schedule(this::checkReplication, POLICY_UPDATE_FAILURE_RETRY_TIME_SECONDS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this will only retry 1 time and then give up. Should we keep retrying instead? There might be some intermittent error (eg: failures to write on ZK) that will get fixed after a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, updated the change with retry on failure.
cc7c9c0
to
4b73e9a
Compare
CLA is valid! |
* changing publish function to accept string for classname * cleaning up * making change for python api
…tion (apache#17915) (apache#152) (cherry picked from commit 0854032) Fixes apache#9962 ### Motivation Offloaded ledgers can be orphaned on topic deletion. This is a redo of apache#15914 which conflicted with concurrently merged apache#17736 thus resulting in apache#17889 . apache#17736 made a decision to not allow managed ledger trimming for the fenced mledgers because in many case fencing indicates a problems that should stop all operations on mledger. At the same time fencing is used before deletion starts, so trimming added to the deletion process cannot proceed. After discussion with @eolivelli I introduced new state, FencedForDeletion, which acts as Fenced state except for the trimming/deletion purposes. ### Modifications Topic to be truncated before deletion to delete offloaded ledgers properly and fail if truncation fails. ### Verifying this change local fork tests: #1 - [ ] Make sure that the change passes the CI checks. This change added integration tests ### Does this pull request potentially affect one of the following parts: *If `yes` was chosen, please highlight the changes* Nothing changed in the options but admin CLI will implicitly run truncate before topic delete. - Dependencies (does it add or upgrade a dependency): (yes / no) - The public API: (yes / no) - The schema: (yes / no / don't know) - The default values of configurations: (yes / no) - The wire protocol: (yes / no) - The rest endpoints: (yes / no) - The admin cli options: (yes / no) - Anything that affects deployment: (yes / no / don't know) ### Documentation Check the box below or label this PR directly. Need to update docs? - [ ] `doc-required` (Your PR needs to update docs and you will update later) - [x] `doc-not-needed` (Please explain why) - [ ] `doc` (Your PR contains doc changes) - [ ] `doc-complete` (Docs have been already added)
Motivation
Modifications
Result