schematelemetry,eventpb: add schema telemetry #84761

postamar · 2022-07-20T20:50:42Z

This commit adds:

the event definitions and logic for generating them,
the scheduling and jobs boilerplate to periodically log them.

Care is taken to redact all strings present in descriptors which might
unintentionally be leaking PIIs.

The event generation logic is tested on the schema of a bootstrapped
test cluster: the test checks that the events match expectations.

Fixes #84284.

Release note (general change): CRDB will now collect schema info if
phoning home is enabled. This schema info is added to the telemetry log
by a built-in scheduled job which runs on a weekly basis by default.
This recurrence can be changed via the sql.schema.telemetry.recurrence
cluster setting. The schedule can also be paused via PAUSE SCHEDULE
followed by its ID, which can be retrieved by querying
SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'.

cockroach-teamcity · 2022-07-20T20:50:54Z

This change is

postamar · 2022-07-20T20:58:19Z

This change requires google.protobuf.Any support in eventpb. @knz may you, specifically, please take a look? I didn't worry about redaction at all (yet), my understanding is that the logical schema does not contain any PII. On that, and many other things, I may be wrong.

knz

The overall architecture of this change looks good to me. Although I am raising my eyebrows at the introduction of the two SQL built-in functions: what do you suggest the UX will be? Are we expecting users to call that themselves? What happens if they are called two times in a row?

As to this:

didn't worry about redaction at all (yet), my understanding is that the logical schema does not contain any PII.

This is, alas, incorrect. Even though we've made progress on schema identifiers (table/column names etc), you need to edit the literals out scalar expressions - DEFAULT, CHECK, scalar expressions inside view queries etc.

As to the actual code review, I'd invite you to ask Andrew for a more detailed scrutiny.

Reviewed 2 of 2 files at r1, 12 of 12 files at r2, 29 of 29 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @postamar)

postamar · 2022-07-21T14:27:57Z

Thanks for taking a look on short notice! Much appreciated.

Although I am raising my eyebrows at the introduction of the two SQL built-in functions: what do you suggest the UX will be? Are we expecting users to call that themselves?

Right now I'm the target audience: I'm using them mainly for prototyping and testing. I expect both will eventually be privilege-gated to admin roles. Perhaps the job creation one will eventually disappear and tests will instead rely on a programmatic interface to trigger its behaviour. The schedule creation builtin I expect to remain but never really be used, as is the case for the stats compaction schedule creation.

What happens if they are called two times in a row?

The schedule creation builtin should be idempotent. What's missing is a hook in the server startup to ensure that the schedule exists, like the auto stats compaction schedule's jobMonitor, and perhaps also a hook in the schedule frequency cluster setting, again like the auto stats compaction. I've deferred writing that boilerplate until after getting some early validation :)

This is, alas, incorrect. Even though we've made progress on schema identifiers (table/column names etc), you need to edit the literals out scalar expressions - DEFAULT, CHECK, scalar expressions inside view queries etc.

Ah, yes! I'd wondered about those. Will do.

As to the actual code review, I'd invite you to ask Andrew for a more detailed scrutiny.

I will in due time. At this time, you've covered exactly the questions I was interested in and which blocked me. Thanks!

postamar · 2022-07-21T21:24:54Z

This is ready for review for a broader audience at this point.

postamar · 2022-08-08T16:13:50Z

#85643 has merged. This can now be cockroach.sql.sqlbase.Descriptor

Very nice! This is a welcome change. It's done. Ready for another look.

ajwerner · 2022-08-08T17:00:37Z

Seems like we've still got dependency cycles in the tests :(

ajwerner

This is getting close.

Reviewed 6 of 38 files at r29, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)

pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go line 117 at r27 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

why make the channel buffered if you block when it's full? I think you either want the channel to always be synchronous, or you want to never queue up more than one notification.

If an update comes while we're mid updateSchedule, we should update the schedule again. To achieve that, I think 1) you do want the channel to have a buffer of size 1 and we don't want to block here. Below is the idiom to not block if the channel is full.

select {
case ch <- struct{}{}:
default:
}

pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go line 45 at r30 (raw file):

	"sql.schema.telemetry.recurrence",
	"cron-tab recurrence for SQL schema telemetry job",
	"@weekly", /* defaultValue */

I'm sad that this @weekly is going to end up being at literally the same time for every pod in the entire cluster. It'll mean that once a week, we create a ton of load on the logging infrastructure. Ideally we ought to find a way to spread out when in the week we do this logging. See #54537

pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go line 139 at r30 (raw file):

		log.Infof(ctx, "failed to update SQL schema telemetry schedule: %s", ErrVersionGate)
	}
	const maxRetryAttempts = 5

Honestly, I don't see a great reason to ever give up so long as you back off enough. I'd just turn this whole thing into a for look using the retrier, log on every failure, and set an initial retry to something on the order of seconds and back off up to something on the order 5-10 minutes.

pkg/sql/sem/builtins/builtins.go line 6798 at r25 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

a built-in for tests? Is there no other way? Can't the code backing this built-in be copied to the tests?

Did you want to do anything about this?

This commit adds: - the event definitions and logic for generating them, - the scheduling and jobs boilerplate to periodically log them. Care is taken to redact all strings present in descriptors which might unintentionally be leaking PIIs. The event generation logic is tested on the schema of a bootstrapped test cluster: the test checks that the events match expectations. Fixes cockroachdb#84284. Release note (general change): CRDB will now collect schema info if phoning home is enabled. This schema info is added to the telemetry log by a built-in scheduled job which runs on a weekly basis by default. This recurrence can be changed via the sql.schema.telemetry.recurrence cluster setting. The schedule can also be paused via PAUSE SCHEDULE followed by its ID, which can be retrieved by querying SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'.

This breaks an import dependency cycle. Release note: None

This new EventPayload implementation makes it possible to create instances of this interface in tests without needing to import the eventpb package. This helps break import dependency cycles. Release note: None

postamar · 2022-08-08T21:07:32Z

This needs to ship now, it's time. I'm out of any after today until next week. It seems to me that what's still up for debate at this point is neither important nor urgent. I got rid of the import cycles, it wasn't too bad though it was an unwelcome waste of time considering the rather meager benefits of not having an Any in the logs.

If an update comes while we're mid updateSchedule, we should update the schedule again. To achieve that, I think 1) you do want the channel to have a buffer of size 1 and we don't want to block here. Below is the idiom to not block if the channel is full.

Done! I didn't quite get what you meant the other day when you mentioned a select-default. This makes sense, thanks.

I'm sad that this @Weekly is going to end up being at literally the same time for every pod in the entire cluster. It'll mean that once a week, we create a ton of load on the logging infrastructure. Ideally we ought to find a way to spread out when in the week we do this logging. See #54537

If the logging breaks down because of a measly <1M events then we have bigger problems than not jittering the schedule, even though jittering is in fact desirable for other, better reasons.

Honestly, I don't see a great reason to ever give up so long as you back off enough. I'd just turn this whole thing into a for look using the retrier, log on every failure, and set an initial retry to something on the order of seconds and back off up to something on the order 5-10 minutes.

Done. Out of curiosity: why is it better to retry indefinitely? That feels mildly dangerous somehow, more than the user's SET CLUSTER SETTING remaining without effect at any rate.

Did you want to do anything about this?

@andreimatei is on vacation. I'm doing doing anything about this now. If this builtin is particularly offensive for whatever reason, a release blocker issue can be filed and addressed later on.

ajwerner

Done. Out of curiosity: why is it better to retry indefinitely? That feels mildly dangerous somehow, more than the user's SET CLUSTER SETTING remaining without effect at any rate.

I was stupidly thinking this code run periodically also to make sure the schedule exists. I'm fine with it either way. I guess a reason to not keep retrying would be if the expression were somehow deemed to be invalid.

If the logging breaks down because of a measly <1M events then we have bigger problems than not jittering the schedule, even though jittering is in fact desirable for other, better reasons.

In terms of the burst size, isn't it a factor of clusters. So we're talking about 3-4 orders of magnitude of change in terms of the size of the burst of message doing this aligned vs unaligned. Elsewhere I've read an assumption that we try to keep these events at < 10/s. This is going to be way more than that. I've stated my unease, but will leave it at that.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)

pkg/sql/catalog/schematelemetry/schema_telemetry_event.go line 178 at r32 (raw file):

			k := keys[i]
			keys[i] = keys[j]
			keys[j] = k

nit: go supports in-place swap correctly: keys[i], keys[j] = keys[j], keys[i]

Code quote:

			k := keys[i]
			keys[i] = keys[j]
			keys[j] = k

pkg/sql/catalog/schematelemetry/schema_telemetry_event.go line 187 at r32 (raw file):

		descIDsInSnapshot.Add(k.id)
		if k.nsKey == nil {
			orphanedDescIDs.Add(k.id)

aren't these just going to generally be dropped tables waiting for their GC ttl? The code seems to be treating them like they're important or likely to be interesting when in fact I think they're the least likely to be interesting. It'd be nice if the commentary talked about the fact that we expect all of the dropped tables to be in this set.

pkg/sql/catalog/schematelemetry/schema_telemetry_event.go line 199 at r32 (raw file):

}

func redactTableDescriptor(d *descpb.TableDescriptor) (errs []error) {

My reading of the winds is that it'd be desirable at some future point to redact constants as opposed to literally everything. Can you leave a TODO or file a follow-up issue?

postamar · 2022-08-08T23:35:15Z

I'm still not sure what's up with CI, the failures are consistent, but not the same in bazel vs make...

I've stated my unease, but will leave it at that.

If the beta gives cause for worry I won't have any problems reverting this PR if it's what it takes.

aren't these just going to generally be dropped tables waiting for their GC ttl?

I hope so, but don't trust that this is always going to be the case.

My reading of the winds is that it'd be desirable at some future point to redact constants as opposed to literally everything. Can you leave a TODO or file a follow-up issue?

I'll file a follow-up issue as soon as this is merged.

ajwerner

If the beta gives cause for worry I won't have any problems reverting this PR if it's what it takes.

I don't understand this comment. I'm suggesting that if we make the default some variation of the following, maybe by making the random choices deterministic based on cluster ID, then it'd be weekly but uniformly spread out.

fmt.Sprintf("%d %d * * %d", rand.Intn(60), rand.Intn(24), rand.Intn(7))

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)

ajwerner

We've spoken offline to leave that as follow-up

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)

postamar · 2022-08-08T23:56:58Z

👍 We definitely need to be cleverer about when to collect the snapshot, there's no doubt about that.

postamar · 2022-08-09T00:08:07Z

Thanks for the reviews. I filed #85788 and #85787 and tagged them as GA-blockers (because they are).

postamar · 2022-08-09T00:38:36Z

bors r+

craig · 2022-08-09T02:40:06Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-09T04:12:25Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-09T07:00:15Z

Build succeeded:

Bazel Essential CI (Cockroach)

postamar changed the title ~~jobspb: add schema telemetry job type~~ [WIP] schema telemetry Jul 20, 2022

postamar force-pushed the schema-telemetry branch from 78b5204 to 3ccaaef Compare July 21, 2022 11:45

knz reviewed Jul 21, 2022

View reviewed changes

postamar force-pushed the schema-telemetry branch from 3ccaaef to 1818111 Compare July 21, 2022 20:09

postamar changed the title ~~[WIP] schema telemetry~~ schematelemetry: add schema telemetry Jul 21, 2022

postamar force-pushed the schema-telemetry branch from 1818111 to dcf7d5f Compare July 21, 2022 21:21

postamar marked this pull request as ready for review July 21, 2022 21:23

postamar requested review from a team July 21, 2022 21:23

postamar requested review from a team as code owners July 21, 2022 21:23

postamar requested a review from a team July 21, 2022 21:23

postamar requested review from a team as code owners July 21, 2022 21:23

postamar removed request for a team July 21, 2022 21:24

postamar force-pushed the schema-telemetry branch from dcf7d5f to 1a4ce7b Compare July 21, 2022 23:27

postamar force-pushed the schema-telemetry branch from fb556ea to 662d1f9 Compare August 8, 2022 16:12

postamar requested review from a team as code owners August 8, 2022 17:57

postamar requested review from herkolategan and removed request for a team August 8, 2022 17:57

ajwerner reviewed Aug 8, 2022

View reviewed changes

postamar force-pushed the schema-telemetry branch from 7357589 to 36c89f1 Compare August 8, 2022 19:38

Marius Posta added 3 commits August 8, 2022 16:51

security: move constants to certnames package

804f3c1

This breaks an import dependency cycle. Release note: None

logpb: introduce EventPayload implementation for tests

7370772

This new EventPayload implementation makes it possible to create instances of this interface in tests without needing to import the eventpb package. This helps break import dependency cycles. Release note: None

postamar force-pushed the schema-telemetry branch from 36c89f1 to 7370772 Compare August 8, 2022 20:51

ajwerner reviewed Aug 8, 2022

View reviewed changes

ajwerner approved these changes Aug 8, 2022

View reviewed changes

craig bot merged commit 9c840c6 into cockroachdb:master Aug 9, 2022

cockroach-teamcity mentioned this pull request Aug 9, 2022

schematelemetry,eventpb: add schema telemetry cockroachdb/docs#14748

Closed

postamar deleted the schema-telemetry branch August 9, 2022 11:55

andreimatei mentioned this pull request Aug 18, 2022

sql: TestAggregatesMonitorMemory failed #86372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schematelemetry,eventpb: add schema telemetry #84761

schematelemetry,eventpb: add schema telemetry #84761

postamar commented Jul 20, 2022 •

edited

Loading

cockroach-teamcity commented Jul 20, 2022

postamar commented Jul 20, 2022

knz left a comment

postamar commented Jul 21, 2022

postamar commented Jul 21, 2022

postamar commented Aug 8, 2022

ajwerner commented Aug 8, 2022

ajwerner left a comment

postamar commented Aug 8, 2022

ajwerner left a comment

postamar commented Aug 8, 2022

ajwerner left a comment

ajwerner left a comment

postamar commented Aug 8, 2022

postamar commented Aug 9, 2022

postamar commented Aug 9, 2022

craig bot commented Aug 9, 2022

craig bot commented Aug 9, 2022

craig bot commented Aug 9, 2022

schematelemetry,eventpb: add schema telemetry #84761

schematelemetry,eventpb: add schema telemetry #84761

Conversation

postamar commented Jul 20, 2022 • edited Loading

cockroach-teamcity commented Jul 20, 2022

postamar commented Jul 20, 2022

knz left a comment

Choose a reason for hiding this comment

postamar commented Jul 21, 2022

postamar commented Jul 21, 2022

postamar commented Aug 8, 2022

ajwerner commented Aug 8, 2022

ajwerner left a comment

Choose a reason for hiding this comment

postamar commented Aug 8, 2022

ajwerner left a comment

Choose a reason for hiding this comment

postamar commented Aug 8, 2022

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

postamar commented Aug 8, 2022

postamar commented Aug 9, 2022

postamar commented Aug 9, 2022

craig bot commented Aug 9, 2022

craig bot commented Aug 9, 2022

craig bot commented Aug 9, 2022

postamar commented Jul 20, 2022 •

edited

Loading