fix:dup conflict with balance #1590

ninsmiracle · 2023-08-29T08:06:17Z

What problem does this PR solve?

What is changed and how does it work?

There are two conflict may occure when opening duplication and balance:
1.load(one stage of duplication) with replica close
Add a atomic parameter and check replica still have doing log load when it's closed.
So how to end load and let replica close continue?There are two way could make it happen:
-duplication step to next stage(shipping)
-wait duplication do update_duplication_map,and it will remove_all_duplications when replica loss primary identity.

In remove_all_duplications,duplication will use a map named _replica to check the replica identity,so I protected it when replica close.

2.gc useless replica with replica close (close one replica twice)
After replica connected with meta, meta will request for config to replica every 500ms. And replica server update itself's config when meta reply.In this logic ,on_node_query_reply_scatter2 will gc useless replica(which status is PS_INACTIVE),to be precise will set status from PS_INACTIVE TO PS_ERROR.
When replica doing update_local_configuration,it will exec close action when status changed and new status is PS_INACTIVE or PS_ERROR.
So I judge replica is already closed or doing close when replica exec close replica to deal with above problem.

Tests

Cluster test(mentioned in issue#1589)

In summary:

The earliest zlock coredump was caused by the necessity of acquiring a lock during the 'dup' load stage, which requires a lock from a member variable of the 'replica' class. However, at this point, the 'replica' had already been closed, related members are destructedresulting in obtaining,resulting zlock get an error _lock value.

The root cause of the replica being closed actually stems from the logic where the meta requests 'replica' for configuration, triggering a 'replica gc' process that sets the replica itself to 'PS_ERROR'. Based on the existing logic, both 'PS_INACTIVE' and 'PS_ERROR' trigger the 'begin_close' logic, with 'inactive' having a 10-minute delay and 'error' being executed immediately.

A double safeguard was implemented for the modification. Firstly, when enqueuing the close operation, it is checked whether the replica has already been closed or is in the process of closing. Secondly, during the execution of the close operation, in case it encounters a 'dup load', it is configured to enter the task queue with a delay of one minute.

acelyc111 · 2024-01-30T06:49:01Z

src/replica/duplication/duplication_sync_timer.h

+    // duplication_sync_timer is async with replica close,so replica_duplicator_manager may already
+    // release
+    // dup should stop right now
+    bool replica_is_cloing_or_closed(gpid id);


The same to #1608, you can judge the state of the replica by itself.

ninsmiracle and others added 3 commits August 29, 2023 14:37

delay close replica when replica doing duplication load

53bfffb

aviod gc replica(make close replica twice)

f53f9da

format code

ffdd302

github-actions bot added the cpp label Aug 29, 2023

format2

1d4524a

acelyc111 requested review from foreverneverer and neverchanje October 16, 2023 16:25

ninsmiracle and others added 4 commits January 18, 2024 16:23

Optimized the replica close method

242d0f2

delete uncessary changes

16ff9e8

format code

8813fca

Merge branch 'master' into fix_dup_conflict_with_balance

34639f6

acelyc111 reviewed Jan 30, 2024

View reviewed changes

ninsmiracle and others added 6 commits January 31, 2024 14:56

move replica_is_closing_or_closed

5da414f

small fix

f3e5d9e

Merge branch 'apache:master' into fix_dup_conflict_with_balance

a4090b0

format 3.9

0327900

Merge branch 'apache:master' into fix_dup_conflict_with_balance

7d16128

pass iwyu

3c746b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix:dup conflict with balance #1590

fix:dup conflict with balance #1590

ninsmiracle commented Aug 29, 2023

acelyc111 Jan 30, 2024

fix:dup conflict with balance #1590

Are you sure you want to change the base?

fix:dup conflict with balance #1590

Conversation

ninsmiracle commented Aug 29, 2023

What problem does this PR solve?

What is changed and how does it work?

Tests

In summary:

acelyc111 Jan 30, 2024

Choose a reason for hiding this comment