Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [meta]Switch to the metadata store thread after zk operation #20303

Merged
merged 3 commits into from
May 30, 2023

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented May 11, 2023

Motivation

#5358 makes the thread switch to the metadata store thread after the ZK operation.
#13043 Add the feature batch operation for ZKMetadataStore, but after the batch operation is done, the thread is not switched back.

Now, if we call put /a, it will use the event thread of the ZK client after the operation; but if we call put /a/b/c(the node /a is not exists), it will use the metadata store thread after the operation.

Modifications

  • Switch to the metadata store thread after the batch operation.
  • If there has any exception, error handling also uses the metadata store thread

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

  • x

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 11, 2023
@poorbarcode poorbarcode self-assigned this May 11, 2023
@poorbarcode poorbarcode added this to the 3.1.0 milestone May 11, 2023
@lhotari
Copy link
Member

lhotari commented May 12, 2023

@poorbarcode codes this also resolve the deadlock issue reported in #20148 ?

@poorbarcode poorbarcode requested a review from nodece May 12, 2023 14:04
@poorbarcode
Copy link
Contributor Author

@lhotari

codes this also resolve the deadlock issue reported in #20148 ?

Only solve the error ZooKeeper session reconnection timeout. Notifying session is lost. the deadlock still exists(The locked thread switches from zk client event thread to metadata store thread).

I will try to fix #20148 later. Thank you for reminding me. You are really a careful man.

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michaeljmarshall
Copy link
Member

michaeljmarshall commented May 17, 2023

Now, if we call put /a, it will use the event thread of the ZK client after the operation; but if we call put /a/b/c(the node /a is not exists), it will use the metadata store thread after the operation.

What is the consequence of using different threads? Did you see out of order events?

EDIT: looks like the consequence is this error ZooKeeper session reconnection timeout. Notifying session is lost.

@poorbarcode poorbarcode force-pushed the fix/meta_store_thread_switch branch from 0ede03e to e498f17 Compare May 18, 2023 03:32
@poorbarcode
Copy link
Contributor Author

rebase master

@Technoboy- Technoboy- force-pushed the fix/meta_store_thread_switch branch from cbf91d1 to 305cf9c Compare May 30, 2023 08:11
poorbarcode added a commit that referenced this pull request May 30, 2023
liangyepianzhou pushed a commit that referenced this pull request Jul 7, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 17, 2023
poorbarcode added a commit that referenced this pull request Aug 28, 2023
…ata when doing lookup (#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR #20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
poorbarcode added a commit that referenced this pull request Aug 29, 2023
…ata when doing lookup (#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR #20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
poorbarcode added a commit that referenced this pull request Aug 31, 2023
…ata when doing lookup (#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR #20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
poorbarcode added a commit that referenced this pull request Aug 31, 2023
…ata when doing lookup (#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR #20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
Demogorgon314 pushed a commit to streamnative/pulsar-archived that referenced this pull request Dec 14, 2023
…ata when doing lookup (apache#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR apache#20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 15, 2024
…ata when doing lookup (apache#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR apache#20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
(cherry picked from commit 2e534d2)
mukesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 19, 2024
…ata when doing lookup (apache#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR apache#20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
(cherry picked from commit 2e534d2)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 23, 2024
…ata when doing lookup (apache#21063)

Motivation: If we set `allowAutoTopicCreationType` to `PARTITIONED`, the flow of the create topic progress is like the below:
1. `Client-side`: Lookup topic to get partitioned topic metadata to create a producer.
1. `Broker-side`: Create partitioned topic metadata.
1. `Broker-side`: response `{"partitions":3}`.
1. `Client-side`: Create separate connections for each partition of the topic.
1. `Broker-side`: Receive 3 connect requests and create 3 partition-topics.

In the `step 2` above, the flow of the progress is like the below:
1. Check the policy of topic auto-creation( the policy is `{allowAutoTopicCreationType=PARTITIONED, defaultNumPartitions=3}` )
1. Check the partitioned topic metadata already exists.
1. Try to create the partitioned topic metadata if it does not exist.
1. If created failed by the partitioned topic metadata already exists( maybe another broker is also creating now), read partitioned topic metadata from the metadata store and respond to the client.

There is a race condition that makes the client get non-partitioned metadata of the topic:
| time | `broker-1` | `broker-2` |
| --- | --- | --- |
| 1 | get policy: `PARTITIONED, 3` | get policy: `PARTITIONED, 3` |
| 2 | check the partitioned topic metadata already exists | Check the partitioned topic metadata already exists |
| 3 | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path | Partitioned topic metadata does not exist, the metadata cache will cache an empty optional for the path |
| 4 |  | succeed create the partitioned topic metadata |
| 5 | Receive a ZK node changed event to invalidate the cache of the partitioned topic metadata |
| 6 | Creating the metadata failed due to it already exists |
| 7 | Read the partitioned topic metadata again |

If `step-5` is executed later than `step-7`, `broker-1` will get an empty optional from the cache of the partitioned topic metadata and respond non-partitioned metadata to the client.

**What thing would make the `step-5` is executed later than `step-7`?**
Provide a scenario: Such as the issue that the PR apache#20303 fixed, it makes `zk operation` and `zk node changed notifications`  executed in different threads: `main-thread of ZK client` and `metadata store thread`.

Therefore, the mechanism of the lookup partitioned topic metadata is fragile and we need to optimize it.

Modifications: Before reading the partitioned topic metadata again, refresh the cache first.
(cherry picked from commit d099ac4)
(cherry picked from commit 2e534d2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants