server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269

ajwerner · 2021-08-24T00:21:00Z

…ossip

This commit removes the code which connected the settings to their backing
table via the gossipped system config. Instead it unconditionally enables the
rangefeed-backed settingswatcher which was developed to support tenants.

Note that it is rather tested code that has been used in multi-tenant sql
pods for about a year now and all the existing tests still pass.

Release justification: Low risk, high benefit change to existing functionality

Release note: None

cockroach-teamcity · 2021-08-24T00:21:06Z

This change is

ajwerner · 2021-08-24T00:24:12Z

Actually there's a bit more to do here.

Grafted from cockroachdb#69269. This seems like a useful primitive for users of this library. We intend to use it in cockroachdb#69661 and cockroachdb#69614. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

70317: kvclient/rangefeed: emit checkpoint events r=irfansharif a=irfansharif Grafted from #69269. This seems like a useful primitive for users of this library. We intend to use it in #69661 and #69614. Release note: None Co-authored-by: Andrew Werner <awerner32@gmail.com>

Release note: None

…ossip This commit removes the code which connected the settings to their backing table via the gossipped system config. Instead it unconditionally enables the rangefeed-backed `settingswatcher` which was developed to support tenants. Note that it is rather tested code that has been used in multi-tenant sql pods for about a year now and all the existing tests still pass. Release note: None

ajwerner · 2021-12-22T00:50:04Z

@RaduBerinde I've rebased this and I think it's ready for a pass. I suspect there's missing testing somewhere. Please let me know what you're looking for.

RaduBerinde

Thanks for working on this.

It feels like we're processing each rangefeed event in two different places and in two different ways (one indirectly, after buffering). What's the benefit of buffering events? Why not just keep mu.data up to date in in the main rangefeed callback? That would make everything a lot simpler. I don't think we'd even need to keep track of frontier timestamps anymore - whenever we get an event, we either spawn the async storage task, or if it is running already, we set a flag indicating that it needs to run again (and the task can check that flag and restart).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/server/settingswatcher/settings_watcher.go, line 66 at r3 (raw file):

// bootstrap settings state.
type Storage interface {
	WriteKVs(ctx context.Context, kvs []roachpb.KeyValue) error

[nit] This suggests that we may be writing KVs in batches, whereas IIUC each call is a full snapshot. Maybe SaveKVs or SnapshotKVs?

We guarantee that one instance of the call will be running at any one time, right? We should advertise that here.

pkg/server/settingswatcher/settings_watcher.go, line 255 at r3 (raw file):

				return
			}
		}

[nit] add a comment here saying that a call was already running and we need to try again.

This singleflight+retry mechanism feels awkward to me (perhaps because each call still spawns an async task separately waiting for what should really be a single process). Wouldn't it be simpler to have at most one async task running, along the lines of:

invariant: if frontierToSave < frontierSaved then there is an async task running or starting. If frontierToSave >= frontierSaved the async task is not running (or it's exiting).
Before we forward frontierToSave, we check the above condition and if we didn't have an async task running, we start it after the forward.
In the async task, we run a loop until frontierSaved >= frontierToSave. The latter can change during the loop, causing more iterations.

pkg/server/settingswatcher/settings_watcher_external_test.go, line 186 at r3 (raw file):

func (f *fakeStorage) WriteKVs(ctx context.Context, kvs []roachpb.KeyValue) error {
	f.Lock()
	defer f.Unlock()

[nit] should we introduce a random delay here? I want to make sure to test the situation where an event comes in while the async storage task is running. If we do that, we should also assert that only one instance of the method is running at a time (we can increment and defer(decrement) an atomic counter and check that it was 0)

ajwerner

What's the benefit of buffering events? Why not just keep mu.data up to date in in the main rangefeed callback? That would make everything a lot simpler.

The complexity exists to ensure that whenever we write out a snapshot of settings, it corresponds to a snapshot which actually existed in the settings table at some point in time. The problem is that updates may come out-of-order. The buffer is a hack to avoid needing to maintain a versioned store for data. I'm not sure it's saving much in the way of complexity.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

RaduBerinde · 2022-01-07T20:36:51Z

Huh, the range feed events don't have monotonically increasing timestamps?? That is a huge thing, it should be documented in at least 3 places (the RangeFeed Internal API itself, kvDB.RangeFeed(), and in rangefeed.RangeFeed).

If we care about the ordering in the settings snapshot, it feels like we should care about it in memory as well (even though the window of potential problems would be much smaller in practice).

I assume that more often than not, we will care about the proper ordering when we write a range feed "client" (which means each use will have to get involved with buffering and frontiers). At the very least, it's much easier to reason about things if you can just get the nice semantics.

So I think we should build a small layer into rangefeed.RangeFeed (or maybe even kvDB.RangeFeed) that can be used optionally which internally buffers the KV events and provides the nice monotonic guarantee on the callback. We can specify a memory limit after which the range feed errors out.

ajwerner · 2022-01-07T20:51:00Z

Huh, the range feed events don't have monotonically increasing timestamps??

It has very specific ordering guarantees. It guarantees for that any individual key you will see writes for the first time in increasing timestamp order. It makes no statement at all about ordering of events corresponding to different rows. In cockroach, you can have a txn write to row A at t2 and then subsequently a txn writes to row B at t1. To not emit the t2 event until a t1 event becomes impossible would effectively negate the design of rangefeeds.

That is a huge thing, it should be documented in at least 3 places (the RangeFeed Internal API itself, kvDB.RangeFeed(), and in rangefeed.RangeFeed).

I don't disagree that it is under-documented. Here's a comment which says some things but is far afield from where one might look.

cockroach/pkg/ccl/changefeedccl/sink_cloudstorage.go

Lines 84 to 100 in 707af75

    
           // Changefeeds offer the following two ordering guarantees to external clients: 
        
           // 
        
           // 1. Rows are emitted with a timestamp. Individual rows are emitted in 
        
           // timestamp order. There may be duplicates, but once a row is seen at a given 
        
           // timestamp no previously unseen version of that row will be emitted at a less 
        
           // (or equal) timestamp. For example, you may see 1 2 1 2, or even 1 2 1, but 
        
           // never simply 2 1. 
        
           // 2. Periodically, a resolved timestamp is emitted. This is a changefeed-wide 
        
           // guarantee that no previously unseen row will later be seen with a timestamp 
        
           // less (or equal) to the resolved one. The cloud storage sink is structured as 
        
           // a number of distsql processors that each emit some part of the total changefeed. 
        
           // These processors only write files containing row data (initially only in ndjson 
        
           // format in this cloudStorageSink). This mapping is stable for a given distsql 
        
           // flow of a changefeed (meaning any given row is always emitted by the same 
        
           // processor), but it's not stable across restarts (pause/unpause). Each of these 
        
           // processors report partial progress information to a central coordinator 
        
           // (changeFrontier), which is responsible for writing the resolved timestamp files.

If we care about the ordering in the settings snapshot, it feels like we should care about it in memory as well (even though the window of potential problems would be much smaller in practice).

It's a tradeoff. If we wanted to wait for a snapshot, we'd have to wait for the closed timestamp which is on the order of seconds. Even today we don't update the settings atomically with the gossip update, though the window for things to be out of sync is extremely small; the updater is non-atomic.

One approach to reducing the delay is #73399

So I think we should build a small layer into rangefeed.RangeFeed (or maybe even kvDB.RangeFeed) that can be used optionally which internally buffers the KV events and provides the nice monotonic guarantee on the callback.

This is what @irfansharif was setting out to do with

cockroach/pkg/kv/kvclient/rangefeed/rangefeedbuffer/buffer.go

Lines 32 to 36 in 24465b8

    
           // Buffer provides a thin memory-bounded buffer to sit on top of a rangefeed. It 
        
           // accumulates raw events which can then be flushed out in timestamp sorted 
        
           // order en-masse whenever the rangefeed frontier is bumped. If we accumulate 
        
           // more events than the limit allows for, we error out to the caller. 
        
           type Buffer struct {

but I agree it's too low-level.

We can specify a memory limit after which the range feed errors out.

Yeah. The data structure can in principle maintain exactly one entry per key while waiting for a checkpoint. Hitting a memory error for these use cases we're intending to use this for ought to be extremely rare and indicative of something pathological. I'm all for it existing as a guard rail, but at least for settings, it feels like the sort of thing where if we use too much ram, we ought to be crashing the server or something drastic like that. Thinking through more complex handling doesn't seem worth it.

RaduBerinde

If we wanted to wait for a snapshot, we'd have to wait for the closed timestamp which is on the order of seconds.

This is another critical detail that should be better advertised, e.g. in the comments for WithOnFrontierAdvance / OnFrontierAdvance.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @RaduBerinde)

pkg/server/settingswatcher/settings_watcher.go, line 53 at r4 (raw file):

	// State used to store settings values to disk.
	buffer *rangefeedbuffer.Buffer

Can you add a comment here explaining that we need the buffer because the range feed callbacks can be out-of-order? Also mention that the buffer will hold a few seconds worth of changes in practice.

irfansharif · 2022-01-07T22:02:30Z

One approach to reducing the delay is #73399

For posterity another would be setting a more aggressive kv.closed_timestamp.target_duration over specific tables (we default to 3s): #70614 (comment).

ajwerner · 2022-01-07T22:10:05Z

One approach to reducing the delay is #73399

For posterity another would be setting a more aggressive kv.closed_timestamp.target_duration over specific tables (we default to 3s): #70614 (comment).

That's a particularly legit idea for the settings table. We don't allow transactional writes to that table. If we set it to zero and we implemented the logic to stop tracking point reads to which we've written as mentioned in #52768 (comment), then, hopefully, we'd insta-close the range for the settings table. This would also rely on the whole table being in one range, but that is not that farfetched in a future release where we break up the system config span.

ajwerner · 2022-01-10T14:50:12Z

Replacing this with some form of #74612. It's much cleaner.

74612: rangefeedcache,settingswatcher,systemcfgwatcher: lose gossip deps r=ajwerner a=ajwerner This is a pile of commits which supersedes #69269 and pretty much puts in place all of the pieces to move on #70560. 75726: sql: refactor pg_has_role to remove privilege.GRANT usage r=RichardJCai a=ecwall refs #73129 Also combines some layers of privilege checking code. Release note: None 75770: vendor: bump cockroachdb/apd to v3.1.0, speed up decimal division r=nvanbenschoten a=nvanbenschoten Picks up two PRs that improved the performance of `Quo`, `Sqrt`, `Cbrt`, `Exp`, `Ln`, `Log`, and `Pow`: - cockroachdb/apd#114 - cockroachdb/apd#115 Almost all of the testing changes here are due to the rounding behavior in cockroachdb/apd#115. This brings us closer to PG's behavior, but also creates a lot of noise in this diff. To verify that this noise wasn't hiding any correctness regressions caused by the rewrite of `Context.Quo` in the first PR, I created #75757, which only includes the first PR. #75757 passes CI with minimal testing changes. The testing changes that PR did require all have to do with trailing zeros, and most of them are replaced in this PR. Release note (performance improvement): The performance of many DECIMAL arithmetic operators has been improved by as much as 60%. These operators include division (`/`), `sqrt`, `cbrt`, `exp`, `ln`, `log`, and `pow`. ---- ### Speedup on TPC-DS dataset The TPC-DS dataset is full of decimal columns, so it's a good playground to test this change. Unfortunately, the variance in the runtime performance of the TPC-DS queries themselves is high (many queries varied by 30-40% per attempt), so it was hard to get signal out of them. Instead, I imported the TPC-DS dataset with a scale factor of 10 and ran some custom aggregation queries against the largest table (web_sales, row count = 7,197,566): ```sql # q1 select sum(ws_wholesale_cost / ws_ext_list_price) from web_sales; # q2 select sum(ws_wholesale_cost / ws_ext_list_price - sqrt(ws_net_paid_inc_tax)) from web_sales; ``` Here's the difference in runtime of these two queries before and after this change on an `n2-standard-8` instance: ``` name old s/op new s/op delta TPC-DS/custom/q1 22.4 ± 0% 8.6 ± 0% -61.33% (p=0.002 n=6+6) TPC-DS/custom/q2 91.4 ± 0% 32.1 ± 0% -64.85% (p=0.008 n=5+5) ``` 75771: colexec: close the ordered sync used by the external sorter r=yuzefovich a=yuzefovich **colexec: close the ordered sync used by the external sorter** Previously, we forgot to close the ordered synchronizer that is used by the external sorter to merge already sorted partitions. This could result in some tracing spans never being finished and is now fixed. Release note: None **colexec: return an error rather than logging it on Close in some cases** This error eventually will be logged anyway, but we should try to propagate it up the stack as much as possible. Release note: None 75807: ui: Add CircleFilled component r=ericharmeling a=ericharmeling Previously, there was no component for a filled circle icon in the `ui` package. Recent product designs have requested this icon for the DB Console (see #67510, #73463). This PR adds a `CircleFilled` component to the `ui` package. Release note: None 75813: sql: fix flakey TestShowRangesMultipleStores r=ajwerner a=ajwerner Fixes #75699 Release note: None 75836: dev: add generate protobuf r=ajwerner a=ajwerner Convenient, fast. Release note: None Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Evan Wall <wall@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>

ajwerner marked this pull request as draft August 24, 2021 01:39

irfansharif mentioned this pull request Sep 16, 2021

kvclient/rangefeed: emit checkpoint events #70317

Merged

This was referenced Sep 22, 2021

server: can't hook initialization steps past cluster settings propagation #17032

Closed

*: stop gossiping the system config span #70560

Closed

server,settingswatcher: adopt settingswatcher in system tenant #70566

Closed

ajwerner force-pushed the ajwerner/adopt-settingswatcher-on-system-tenant branch from dec8f50 to 1c00077 Compare October 25, 2021 05:30

ajwerner added 3 commits December 21, 2021 18:55

server/settingswatcher: minor refactor

6afdf2b

Release note: None

server/settingswatcher: add support for writing checkpoints to storage

a35cf0c

Release note: None

ajwerner force-pushed the ajwerner/adopt-settingswatcher-on-system-tenant branch from 1c00077 to 2e8c2c4 Compare December 22, 2021 00:07

ajwerner marked this pull request as ready for review December 22, 2021 00:49

ajwerner requested a review from RaduBerinde December 22, 2021 00:49

RaduBerinde reviewed Dec 22, 2021

View reviewed changes

ajwerner commented Jan 4, 2022

View reviewed changes

RaduBerinde reviewed Jan 7, 2022

View reviewed changes

ajwerner mentioned this pull request Jan 10, 2022

rangefeedcache,settingswatcher,systemcfgwatcher: lose gossip deps #74612

Merged

ajwerner closed this Jan 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269

server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269

ajwerner commented Aug 24, 2021

cockroach-teamcity commented Aug 24, 2021

ajwerner commented Aug 24, 2021

ajwerner commented Dec 22, 2021

RaduBerinde left a comment

ajwerner left a comment

RaduBerinde commented Jan 7, 2022

ajwerner commented Jan 7, 2022

RaduBerinde left a comment

irfansharif commented Jan 7, 2022

ajwerner commented Jan 7, 2022

ajwerner commented Jan 10, 2022

server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269

server: adopt for settings rangefeed-backed settingswatcher, remove g… #69269

Conversation

ajwerner commented Aug 24, 2021

cockroach-teamcity commented Aug 24, 2021

ajwerner commented Aug 24, 2021

ajwerner commented Dec 22, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

RaduBerinde commented Jan 7, 2022

ajwerner commented Jan 7, 2022

RaduBerinde left a comment

Choose a reason for hiding this comment

irfansharif commented Jan 7, 2022

ajwerner commented Jan 7, 2022

ajwerner commented Jan 10, 2022