Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][tiered-storage] Don't cleanup data when offload met Metastore exception #17512

Merged
merged 4 commits into from
Sep 14, 2022

Conversation

zymap
Copy link
Member

@zymap zymap commented Sep 7, 2022


Motivation

There have two ways that will cause the offload data cleanup. One is met offload conflict exception, and another is completeLedgerInfoForOffloaded reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should be careful about this exception, because we may lose data if the metadata updated successfully.

When a MetaStore exception happens, we can not make sure the metadata update is failed or not. Because we have a retry on the connection loss, it is possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

Modifications

  • don't delete data if has a meta store exception

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API: (yes / no)
  • The schema: (yes / no / don't know)
  • The default values of configurations: (yes / no)
  • The wire protocol: (yes / no)
  • The rest endpoints: (yes / no)
  • The admin cli options: (yes / no)
  • Anything that affects deployment: (yes / no / don't know)

Documentation

Check the box below or label this PR directly.

Need to update docs?

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@zymap zymap added type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages area/tieredstorage release/2.11.1 release/2.10.3 labels Sep 7, 2022
@zymap zymap added this to the 2.12.0 milestone Sep 7, 2022
@zymap zymap self-assigned this Sep 7, 2022
@zymap zymap changed the title [fix][tiered-storage] Don't cleanup data when offload met BadVersion [fix][tiered-storage] Don't cleanup data when offload met Metastore exception Sep 7, 2022
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
but we must log something that explains what happened

@zymap
Copy link
Member Author

zymap commented Sep 7, 2022

@eolivelli Added. PTAL

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Sep 7, 2022
@hangc0276
Copy link
Contributor

Nice catch!

---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left 2 minor comments about the logs.
Provide more information which can help people like operation team who don't have such detailed knowledge to understand the behavior.

@zymap
Copy link
Member Author

zymap commented Sep 9, 2022

@codelipenghui Done. PTAL, thanks

@zymap zymap merged commit c2588ba into apache:master Sep 14, 2022
zymap added a commit that referenced this pull request Sep 15, 2022
…xception (#17512)

* [fix][tiered-storage] Don't cleanup data when offload met BadVersion
---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception

* log error when skip deleting

* improve logs

(cherry picked from commit c2588ba)
@zymap zymap added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Sep 15, 2022
Jason918 pushed a commit that referenced this pull request Sep 26, 2022
…xception (#17512)

* [fix][tiered-storage] Don't cleanup data when offload met BadVersion
---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception

* log error when skip deleting

* improve logs

(cherry picked from commit c2588ba)
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Sep 28, 2022
…xception (apache#17512)

* [fix][tiered-storage] Don't cleanup data when offload met BadVersion
---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception

* log error when skip deleting

* improve logs

(cherry picked from commit c2588ba)
(cherry picked from commit 917f997)
congbobo184 pushed a commit that referenced this pull request Nov 9, 2022
…xception (#17512)

* [fix][tiered-storage] Don't cleanup data when offload met BadVersion
---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception

* log error when skip deleting

* improve logs

(cherry picked from commit c2588ba)
@congbobo184 congbobo184 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Nov 9, 2022
congbobo184 pushed a commit that referenced this pull request Nov 26, 2022
…xception (#17512)

* [fix][tiered-storage] Don't cleanup data when offload met BadVersion
---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception

* log error when skip deleting

* improve logs

(cherry picked from commit c2588ba)
Technoboy- pushed a commit that referenced this pull request Feb 8, 2023
…xception (#17512)

* [fix][tiered-storage] Don't cleanup data when offload met BadVersion
---

*Motivation*

There have two ways that will cause the offload data cleanup. One is met
offload conflict exception, and another is completeLedgerInfoForOffloaded
reaches max retry time and throws zookeeper exceptions.

We retry the zookeeper operation on connection loss exception. We should
be careful about this exception, because we may loss data if the metadata
update successfully.

When a MetaStore exception happens, we can not make sure the metadata update is
failed or not. Because we have a retry on the connection loss, it is
possible to get a BadVersion or other exception after retrying.

So we don't clean up the data if this happens.

*Modification*

- don't delete data if has meta store exception

* log error when skip deleting

* improve logs
zymap added a commit to zymap/pulsar that referenced this pull request Dec 7, 2023
…xception

---

### Motivation

apache#17915 changes the fix apache#17512 which lead the offload data
is deleted when metadata store exception happened. Then the
ledger can not be read.

The logs shows
```
Failed to update offloaded metadata for the ledgerId 6197907, the offloaded data will not be cleaned up
```
But the ledger deleted. Then managed ledger failed to open it
```
Error opening ledger for reading at position 6197907:0
```
zymap added a commit to zymap/pulsar that referenced this pull request Dec 7, 2023
…xception

---

### Motivation

apache#17915 changes the fix apache#17512 which lead the offload data
is deleted when metadata store exception happened. Then the
ledger can not be read.

The logs shows
```
Failed to update offloaded metadata for the ledgerId 6197907, the offloaded data will not be cleaned up
```
But the ledger deleted. Then managed ledger failed to open it
```
Error opening ledger for reading at position 6197907:0
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tieredstorage cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life cherry-picked/branch-2.10 cherry-picked/branch-2.11 doc-not-needed Your PR changes do not impact docs release/2.8.5 release/2.9.4 release/2.10.2 release/2.11.1 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants