improve region reload strategy #1122

zyguan · 2024-01-17T06:21:51Z

This PR add a new reload strategy: if a region is marked as needExpireAfterTTL, we don't update lastAccess any more, thus the region will expire after RegionCacheTTL. Currently, we set this flag:

if a region has down peers
if a region has a store which is stale or not reachable

This ensure that those region will be reloaded every RegionCacheTTL. So the issues like #879 and pingcap/tidb#35418 should be resolved.

This PR also optimize region fields and replace async-reload with delayed-reload.

Also ref: #1104

Regression test for #879

set region-cache-ttl and max-store-down-time to 5m, see the following timeline:

16:08 killed tikv-1@z3, the store turned to disconnect
16:13 store became down, pd started down-peer scheduling
16:13 ~ 16:33 regions were moved to tikv-0@z3, tidb reloaded regions periodically, cross-zone traffic decreased
16:33 ~ 16:45 all regions were health, no region reload
16:45 recovered tikv-1@z3, pd started to balance regions

Regression test for #1029

set region-cache-ttl and max-store-down-time to 5m, see the following timeline:

20:30 kill tiflash-0, the store turned to disconnect
20:35 store became down
20:35 ~ 20:42 no region reload since the cached regions had no down peer
20:42 exec show table regions to reload regions manually
20:42 ~ 21:02 reloaded regions every 5m since regions had a down peer after show table regions
21:00 recovered tiflash-0, load didn't blance due to down peer
21:02 load blanced after last region reload

Regression test for #843

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk · 2024-01-22T03:24:17Z

@crazycs520 PTAL

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk · 2024-01-29T02:09:15Z

internal/locate/region_cache.go

+		if reloadOnAccess {
+			lr, err = c.loadRegion(bo, key, isEndKey)
+		} else {
+			lr, err = c.loadRegionByID(bo, r.GetID())
+		}


What's the difference here using by id or by key?

Here is the replacement of previous async reload introduced by #843, which uses loadRegionByID. I just try to keep the original implementation but do not very sure about why we use loadRegionByID actually. Maybe it's more efficient than loadRegionByKey.

cfzjywxk · 2024-01-29T02:13:13Z

internal/locate/region_cache.go

-	updated  int32 = iota // region is updated and no need to reload.
-	needSync              // need sync new region info.
+	needReloadOnAccess     int32 = 1 << iota // indicates the region will be reloaded on next access
+	needExpireAfterTTL                       // indicates the region will expire after RegionCacheTTL (even when it's accessed continuously)


The strategy is to expire these regions with certain conditions like down peers, we may need to consider how to make the region reloads smoother as the current TTL is a constant value.

I think it's relative fine for regions that have stale or unreachable peers, since the cache GC only scan 50 regions per seconds. For regions that have down peers, maybe we can add a random value to lastAccess when constructing them by newRegion?

Maybe we could do some improvements about smoothing region reloads in another PR, considering both expire reloading and region cache miss reloading. I rembember we've encountered lots of region reload caused by sudden TTL expire or something like it.

Agreed. One idea I'm thinking about is that we may change lastAccess to TTL and push it to now + RegionCacheTTL + RandomJitter on access.

cfzjywxk

LGTM

internal/locate/region_cache.go

you06

rest LGTM

you06 · 2024-01-30T10:28:41Z

internal/locate/region_cache.go

@@ -1123,9 +1093,18 @@ func (c *RegionCache) findRegionByKey(bo *retry.Backoffer, key []byte, isEndKey
 			c.insertRegionToCache(r, true, true)
 			c.mu.Unlock()
 		}
-	} else if r.checkNeedReloadAndMarkUpdated() {
+	} else if flags := r.resetSyncFlags(needReloadOnAccess | needAsyncReloadReady); flags > 0 {


Seems the async reload becomes sync here.

Yes, this is a potential regression. In the current implementation, reloading is divided into three levels:

invalidate: the region cannot be used any more, reloading is required.

reload on access: we need to reload the region on access, but we can tolerate reloading failure. (it's equivalent to the previous syncflag)

async reload: the region needs to be reload later, but not too emergent.

So, it's not a real "async" reload and the name is somehow misleading. I've considered keeping the async reload goroutine or doing reload in cache GC, but gave up for simplicity.

Currently, we mark a region as NeedAsyncReload only when we find it has a store which is stale but reachable or its leader store is stale. Please let me know if the real async reload matters here, I'll change it. Or maybe we just rename the flag (eg. DelayedReload)?

Rename or comment the flag, both OK to me.

The current comment indicates it will be reloaded in async mode, which is inaccurate.

needAsyncReloadPending // indicates the region will be reloaded in async mode needAsyncReloadReady // indicates the region is ready to be reloaded in async mode

Renamed, PTAL.

you06 · 2024-01-30T10:32:41Z

internal/locate/region_cache.go

+					continue
+				}
+				if syncFlags&needAsyncReloadPending > 0 {
+					region.setSyncFlags(needAsyncReloadReady)


May I understand the flag change here is to avoid too frequent meaningless reloading?

Yes, it tries to limit the rate of reloading here.

Co-authored-by: ekexium <eke@fastmail.com> Signed-off-by: zyguan <zhongyangguan@gmail.com>

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk · 2024-01-31T02:17:43Z

@zyguan
An PR to update the client-go dependency of tidb is also needed.

zyguan · 2024-01-31T02:40:47Z

@zyguan An PR to update the client-go dependency of tidb is also needed.

I'll submit another PR soon, which tries to make reloading caused by TTL smooth (#1122 (comment)) and adds some metrics to observe the reason of reloading.

zyguan added 4 commits January 17, 2024 14:07

refine region reload strategy

09dd8ec

Signed-off-by: zyguan <zhongyangguan@gmail.com>

fix data race in ut

94308bb

Signed-off-by: zyguan <zhongyangguan@gmail.com>

fix another data race

870ac8a

Signed-off-by: zyguan <zhongyangguan@gmail.com>

access store.epoch atomic

ed52ab7

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan marked this pull request as ready for review January 17, 2024 13:24

zyguan added 3 commits January 18, 2024 12:45

re-implement async reload by sync flags

9e5df18

Signed-off-by: zyguan <zhongyangguan@gmail.com>

a minor optimization

7af49bc

Signed-off-by: zyguan <zhongyangguan@gmail.com>

Merge remote-tracking branch 'origin/master' into region-reload

c4dba5d

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk requested review from cfzjywxk, you06, MyonKeminta and ekexium January 22, 2024 03:24

zyguan mentioned this pull request Jan 23, 2024

fix the issue that health check may set liveness wrongly #1127

Merged

zyguan added 2 commits January 26, 2024 17:44

Merge remote-tracking branch 'origin/master' into region-reload

b391718

Signed-off-by: zyguan <zhongyangguan@gmail.com>

fix ut

ee512e3

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk reviewed Jan 29, 2024

View reviewed changes

cfzjywxk approved these changes Jan 29, 2024

View reviewed changes

ekexium reviewed Jan 30, 2024

View reviewed changes

internal/locate/region_cache.go Outdated Show resolved Hide resolved

you06 reviewed Jan 30, 2024

View reviewed changes

Update internal/locate/region_cache.go

38f16ab

Co-authored-by: ekexium <eke@fastmail.com> Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan force-pushed the region-reload branch from 61f9365 to 38f16ab Compare January 30, 2024 13:29

cfzjywxk requested review from ekexium and you06 January 31, 2024 00:54

zyguan added 2 commits January 31, 2024 09:57

rename async-reload to delayed-reload

b4d3967

Signed-off-by: zyguan <zhongyangguan@gmail.com>

Merge remote-tracking branch 'origin/master' into region-reload

1383152

you06 approved these changes Jan 31, 2024

View reviewed changes

cfzjywxk merged commit 6e501a1 into tikv:master Jan 31, 2024
10 checks passed

This was referenced Jan 31, 2024

introduce a random jitter to region cache ttl #1148

Merged

let access-follower reload region on no candidate #901

Closed

unexpected continuous cross AZ traffic after tikv shutdown #879

Closed

*: bump tikv client to fix #50974 pingcap/tidb#50976

Closed

zyguan mentioned this pull request Feb 22, 2024

RegionCache Enhancement #1104

Open

8 tasks

zyguan mentioned this pull request Apr 18, 2024

region_cache: fix issue that LocateKey may returns a wrong region #1299

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve region reload strategy #1122

improve region reload strategy #1122

zyguan commented Jan 17, 2024 •

edited

Loading

cfzjywxk commented Jan 22, 2024

cfzjywxk Jan 29, 2024

zyguan Jan 29, 2024

cfzjywxk Jan 29, 2024

zyguan Jan 29, 2024

cfzjywxk Jan 29, 2024

zyguan Jan 30, 2024 •

edited

Loading

cfzjywxk left a comment

you06 left a comment

you06 Jan 30, 2024

zyguan Jan 30, 2024 •

edited

Loading

you06 Jan 31, 2024

zyguan Jan 31, 2024

you06 Jan 30, 2024

zyguan Jan 30, 2024

cfzjywxk commented Jan 31, 2024

zyguan commented Jan 31, 2024

improve region reload strategy #1122

improve region reload strategy #1122

Conversation

zyguan commented Jan 17, 2024 • edited Loading

cfzjywxk commented Jan 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyguan Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

cfzjywxk left a comment

Choose a reason for hiding this comment

you06 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyguan Jan 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfzjywxk commented Jan 31, 2024

zyguan commented Jan 31, 2024

zyguan commented Jan 17, 2024 •

edited

Loading

zyguan Jan 30, 2024 •

edited

Loading

zyguan Jan 30, 2024 •

edited

Loading