Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely stop(close replication-producer) and remove replicator #152

Merged
merged 1 commit into from
Jan 10, 2017

Conversation

rdhabalia
Copy link
Contributor

Motivation

  • Safely close replication-producer while disconnecting replicator
  • While deleting replication-cluster of the topic: Sometime broker fails to delete replicator-cursor and it tries to restart even after closing the cursor. It causes broker to retries cursor-recovery for already deleted replicator-cluster.

Modifications

  • On replicator-disconnect: close producer even it's not connected with remote yet.
  • Avoid restarting of replicator when get error while deleting cursor.

Result

  • It will prevent producer to reconnect event after replicator disconnection
  • It will prevent broker to keep retrying cursor recovery for already deleted replication-cluster.

@rdhabalia rdhabalia added the type/bug The PR fixed a bug or issue reported a bug label Dec 23, 2016
@rdhabalia rdhabalia added this to the 1.16 milestone Dec 23, 2016
@rdhabalia rdhabalia self-assigned this Dec 23, 2016
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Nice catch on this one

@@ -666,8 +666,7 @@ public void deleteCursorComplete(Object ctx) {
@Override
public void deleteCursorFailed(ManagedLedgerException exception, Object ctx) {
log.error("[{}] Failed to delete cursor {}", topic, name);
// Connect the producers back
replicators.get(remoteCluster).startProducer();
replicators.remove(remoteCluster);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one thing here. If the delete cursor fails during the topic load, then the topic load fails and that is fine since it will be re-tried anyway.

If it fails after a policies change though, it will not be retried.
I think the whole PersistentTopic.checkReplication() should be rescheduled after a while to make sure we clean up the cursor properly.

Just retrying the cursor delete could be dangerous, since the remote cluster could be added again before we retry, but the checkReplication() is idempotent and uses the latest configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, made the change to schedule checkReplication() while deleteCursorFailed on onPoliciesUpdate()

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good, just a minor thing in the log messages

return checkReplication();
CompletableFuture<Void> result = new CompletableFuture<Void>();
checkReplication().thenAccept(res -> {
log.info("Policies updated successfully {}", data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the topic name here. Also I don't think that data will print the json content and anyway it would be printing too much.

log.info("Policies updated successfully {}", data);
result.complete(null);
}).exceptionally(th -> {
log.error("Policies update failed {} {}, scheduled retry in {} seconds", data, th.getMessage(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add topic name here as well

}).exceptionally(th -> {
log.error("Policies update failed {} {}, scheduled retry in {} seconds", data, th.getMessage(),
POLICY_UPDATE_FAILURE_RETRY_TIME_SECONDS, th);
brokerService.executor().schedule(this::checkReplication, POLICY_UPDATE_FAILURE_RETRY_TIME_SECONDS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this will only retry 1 time and then give up. Should we keep retrying instead? There might be some intermittent error (eg: failures to write on ZK) that will get fixed after a while.

Copy link
Contributor Author

@rdhabalia rdhabalia Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated the change with retry on failure.

@rdhabalia rdhabalia force-pushed the del_cursor branch 3 times, most recently from cc7c9c0 to 4b73e9a Compare January 10, 2017 01:34
@yahoocla
Copy link

CLA is valid!

@merlimat merlimat merged commit a6186b2 into apache:master Jan 10, 2017
@rdhabalia rdhabalia deleted the del_cursor branch January 23, 2017 22:10
sijie pushed a commit to sijie/pulsar that referenced this pull request Mar 4, 2018
* changing publish function to accept string for classname

* cleaning up

* making change for python api
dlg99 added a commit to dlg99/pulsar that referenced this pull request Nov 8, 2022
…tion (apache#17915) (apache#152)

(cherry picked from commit 0854032)

Fixes apache#9962 

### Motivation

Offloaded ledgers can be orphaned on topic deletion. 

This is a redo of apache#15914 which conflicted with concurrently merged apache#17736 thus resulting in apache#17889 .

apache#17736 made a decision to not allow managed ledger trimming for the fenced mledgers because in many case fencing indicates a problems that should stop all operations on mledger. At the same time fencing is used before deletion starts, so trimming added to the deletion process cannot proceed.
After discussion with @eolivelli I introduced new state, FencedForDeletion, which acts as Fenced state except for the trimming/deletion purposes.

### Modifications

Topic to be truncated before deletion to delete offloaded ledgers properly and fail if truncation fails.

### Verifying this change

local fork tests: #1

- [ ] Make sure that the change passes the CI checks.

This change added integration tests

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the changes*

Nothing changed in the options but admin CLI will implicitly run truncate before topic delete. 

  - Dependencies (does it add or upgrade a dependency): (yes / no)
  - The public API: (yes / no)
  - The schema: (yes / no / don't know)
  - The default values of configurations: (yes / no)
  - The wire protocol: (yes / no)
  - The rest endpoints: (yes / no)
  - The admin cli options: (yes / no)
  - Anything that affects deployment: (yes / no / don't know)

### Documentation

Check the box below or label this PR directly.

Need to update docs? 

- [ ] `doc-required` 
(Your PR needs to update docs and you will update later)
  
- [x] `doc-not-needed` 
(Please explain why)
  
- [ ] `doc` 
(Your PR contains doc changes)

- [ ] `doc-complete`
(Docs have been already added)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants