[fix][tiered-storage] Don't cleanup data when offload met Metastore exception #17512

zymap · 2022-09-07T08:55:43Z

Motivation

There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may lose data if the metadata updated successfully.

When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

Modifications

don't delete data if has a meta store exception

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API: (yes / no)
The schema: (yes / no / don't know)
The default values of configurations: (yes / no)
The wire protocol: (yes / no)
The rest endpoints: (yes / no)
The admin cli options: (yes / no)
Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

doc-required
(Your PR needs to update docs and you will update later)
doc-not-needed
(Please explain why)
doc
(Your PR contains doc changes)
doc-complete
(Docs have been already added)

eolivelli

LGTM
but we must log something that explains what happened

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

zymap · 2022-09-07T12:32:12Z

@eolivelli Added. PTAL

eolivelli

LGTM

hangc0276 · 2022-09-08T01:17:17Z

Nice catch!

--- *Motivation* There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions. We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may loss data if the metadata update successfully. When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying. So we don't clean up the data if this happens. *Modification* - don't delete data if has meta store exception

codelipenghui

I have left 2 minor comments about the logs.
Provide more information which can help people like operation team who don't have such detailed knowledge to understand the behavior.

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

zymap · 2022-09-09T00:50:53Z

@codelipenghui Done. PTAL, thanks

…xception (#17512) * [fix][tiered-storage] Don't cleanup data when offload met BadVersion --- *Motivation* There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions. We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may loss data if the metadata update successfully. When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying. So we don't clean up the data if this happens. *Modification* - don't delete data if has meta store exception * log error when skip deleting * improve logs (cherry picked from commit c2588ba)

…xception (apache#17512) * [fix][tiered-storage] Don't cleanup data when offload met BadVersion --- *Motivation* There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions. We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may loss data if the metadata update successfully. When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying. So we don't clean up the data if this happens. *Modification* - don't delete data if has meta store exception * log error when skip deleting * improve logs (cherry picked from commit c2588ba) (cherry picked from commit 917f997)

…xception (#17512) * [fix][tiered-storage] Don't cleanup data when offload met BadVersion --- *Motivation* There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions. We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may loss data if the metadata update successfully. When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying. So we don't clean up the data if this happens. *Modification* - don't delete data if has meta store exception * log error when skip deleting * improve logs (cherry picked from commit c2588ba)

…xception (#17512) * [fix][tiered-storage] Don't cleanup data when offload met BadVersion --- *Motivation* There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions. We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may loss data if the metadata update successfully. When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying. So we don't clean up the data if this happens. *Modification* - don't delete data if has meta store exception * log error when skip deleting * improve logs

…xception --- ### Motivation apache#17915 changes the fix apache#17512 which lead the offload data is deleted when metadata store exception happened. Then the ledger can not be read. The logs shows ``` Failed to update offloaded metadata for the ledgerId 6197907, the offloaded data will not be cleaned up ``` But the ledger deleted. Then managed ledger failed to open it ``` Error opening ledger for reading at position 6197907:0 ```

zymap added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/tieredstorage release/2.11.1 release/2.10.3 labels Sep 7, 2022

zymap added this to the 2.12.0 milestone Sep 7, 2022

zymap requested review from hangc0276, eolivelli, codelipenghui and gaoran10 September 7, 2022 08:55

zymap self-assigned this Sep 7, 2022

zymap changed the title ~~[fix][tiered-storage] Don't cleanup data when offload met BadVersion~~ [fix][tiered-storage] Don't cleanup data when offload met Metastore exception Sep 7, 2022

Technoboy- approved these changes Sep 7, 2022

View reviewed changes

eolivelli requested changes Sep 7, 2022

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Show resolved Hide resolved

zymap force-pushed the fix-cleanup-offload branch from 45d78d2 to 7fe6bcd Compare September 7, 2022 12:31

eolivelli approved these changes Sep 7, 2022

View reviewed changes

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 7, 2022

hangc0276 approved these changes Sep 8, 2022

View reviewed changes

zymap added release/2.8.5 release/2.9.4 labels Sep 8, 2022

zymap added 2 commits September 8, 2022 10:36

log error when skip deleting

23b3af2

Technoboy- force-pushed the fix-cleanup-offload branch from 7fe6bcd to 23b3af2 Compare September 8, 2022 02:36

codelipenghui reviewed Sep 8, 2022

View reviewed changes

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Outdated Show resolved Hide resolved

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java Outdated Show resolved Hide resolved

improve logs

ff1eac4

Merge branch 'master' into fix-cleanup-offload

49bba36

codelipenghui approved these changes Sep 9, 2022

View reviewed changes

zymap merged commit c2588ba into apache:master Sep 14, 2022

zymap added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Sep 15, 2022

Jason918 mentioned this pull request Sep 17, 2022

[fix][storage] refresh the ledgers map when the offload complete failed #17228

Closed

5 tasks

Jason918 added cherry-picked/branch-2.10 release/2.10.2 and removed release/2.10.3 labels Sep 26, 2022

congbobo184 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 9, 2022

Technoboy- added the cherry-picked/branch-2.11 label Feb 8, 2023

zymap mentioned this pull request Dec 7, 2023

[fix][offload] Don't cleanup data when offload met MetaStore exception #21686

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][tiered-storage] Don't cleanup data when offload met Metastore exception #17512

[fix][tiered-storage] Don't cleanup data when offload met Metastore exception #17512

zymap commented Sep 7, 2022

eolivelli left a comment

zymap commented Sep 7, 2022

eolivelli left a comment

hangc0276 commented Sep 8, 2022

codelipenghui left a comment

zymap commented Sep 9, 2022

[fix][tiered-storage] Don't cleanup data when offload met Metastore exception #17512

[fix][tiered-storage] Don't cleanup data when offload met Metastore exception #17512

Conversation

zymap commented Sep 7, 2022

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

eolivelli left a comment

Choose a reason for hiding this comment

zymap commented Sep 7, 2022

eolivelli left a comment

Choose a reason for hiding this comment

hangc0276 commented Sep 8, 2022

codelipenghui left a comment

Choose a reason for hiding this comment

zymap commented Sep 9, 2022