-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schematelemetry,eventpb: add schema telemetry #84761
Conversation
This change requires |
78b5204
to
3ccaaef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall architecture of this change looks good to me. Although I am raising my eyebrows at the introduction of the two SQL built-in functions: what do you suggest the UX will be? Are we expecting users to call that themselves? What happens if they are called two times in a row?
As to this:
didn't worry about redaction at all (yet), my understanding is that the logical schema does not contain any PII.
This is, alas, incorrect. Even though we've made progress on schema identifiers (table/column names etc), you need to edit the literals out scalar expressions - DEFAULT, CHECK, scalar expressions inside view queries etc.
As to the actual code review, I'd invite you to ask Andrew for a more detailed scrutiny.
Reviewed 2 of 2 files at r1, 12 of 12 files at r2, 29 of 29 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @postamar)
Thanks for taking a look on short notice! Much appreciated.
Right now I'm the target audience: I'm using them mainly for prototyping and testing. I expect both will eventually be privilege-gated to admin roles. Perhaps the job creation one will eventually disappear and tests will instead rely on a programmatic interface to trigger its behaviour. The schedule creation builtin I expect to remain but never really be used, as is the case for the stats compaction schedule creation.
The schedule creation builtin should be idempotent. What's missing is a hook in the server startup to ensure that the schedule exists, like the auto stats compaction schedule's
Ah, yes! I'd wondered about those. Will do.
I will in due time. At this time, you've covered exactly the questions I was interested in and which blocked me. Thanks! |
3ccaaef
to
1818111
Compare
1818111
to
dcf7d5f
Compare
This is ready for review for a broader audience at this point. |
dcf7d5f
to
1a4ce7b
Compare
fb556ea
to
662d1f9
Compare
Very nice! This is a welcome change. It's done. Ready for another look. |
Seems like we've still got dependency cycles in the tests :( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting close.
Reviewed 6 of 38 files at r29, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)
pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go
line 117 at r27 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
why make the channel buffered if you block when it's full? I think you either want the channel to always be synchronous, or you want to never queue up more than one notification.
If an update comes while we're mid updateSchedule
, we should update the schedule again. To achieve that, I think 1) you do want the channel to have a buffer of size 1 and we don't want to block here. Below is the idiom to not block if the channel is full.
select {
case ch <- struct{}{}:
default:
}
pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go
line 45 at r30 (raw file):
"sql.schema.telemetry.recurrence", "cron-tab recurrence for SQL schema telemetry job", "@weekly", /* defaultValue */
I'm sad that this @weekly
is going to end up being at literally the same time for every pod in the entire cluster. It'll mean that once a week, we create a ton of load on the logging infrastructure. Ideally we ought to find a way to spread out when in the week we do this logging. See #54537
pkg/sql/catalog/schematelemetry/schematelemetrycontroller/controller.go
line 139 at r30 (raw file):
log.Infof(ctx, "failed to update SQL schema telemetry schedule: %s", ErrVersionGate) } const maxRetryAttempts = 5
Honestly, I don't see a great reason to ever give up so long as you back off enough. I'd just turn this whole thing into a for look using the retrier, log on every failure, and set an initial retry to something on the order of seconds and back off up to something on the order 5-10 minutes.
pkg/sql/sem/builtins/builtins.go
line 6798 at r25 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
a built-in for tests? Is there no other way? Can't the code backing this built-in be copied to the tests?
Did you want to do anything about this?
7357589
to
36c89f1
Compare
This commit adds: - the event definitions and logic for generating them, - the scheduling and jobs boilerplate to periodically log them. Care is taken to redact all strings present in descriptors which might unintentionally be leaking PIIs. The event generation logic is tested on the schema of a bootstrapped test cluster: the test checks that the events match expectations. Fixes cockroachdb#84284. Release note (general change): CRDB will now collect schema info if phoning home is enabled. This schema info is added to the telemetry log by a built-in scheduled job which runs on a weekly basis by default. This recurrence can be changed via the sql.schema.telemetry.recurrence cluster setting. The schedule can also be paused via PAUSE SCHEDULE followed by its ID, which can be retrieved by querying SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'.
This breaks an import dependency cycle. Release note: None
This new EventPayload implementation makes it possible to create instances of this interface in tests without needing to import the eventpb package. This helps break import dependency cycles. Release note: None
36c89f1
to
7370772
Compare
This needs to ship now, it's time. I'm out of any after today until next week. It seems to me that what's still up for debate at this point is neither important nor urgent. I got rid of the import cycles, it wasn't too bad though it was an unwelcome waste of time considering the rather meager benefits of not having an
Done! I didn't quite get what you meant the other day when you mentioned a select-default. This makes sense, thanks.
If the logging breaks down because of a measly <1M events then we have bigger problems than not jittering the schedule, even though jittering is in fact desirable for other, better reasons.
Done. Out of curiosity: why is it better to retry indefinitely? That feels mildly dangerous somehow, more than the user's SET CLUSTER SETTING remaining without effect at any rate.
@andreimatei is on vacation. I'm doing doing anything about this now. If this builtin is particularly offensive for whatever reason, a release blocker issue can be filed and addressed later on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Out of curiosity: why is it better to retry indefinitely? That feels mildly dangerous somehow, more than the user's SET CLUSTER SETTING remaining without effect at any rate.
I was stupidly thinking this code run periodically also to make sure the schedule exists. I'm fine with it either way. I guess a reason to not keep retrying would be if the expression were somehow deemed to be invalid.
If the logging breaks down because of a measly <1M events then we have bigger problems than not jittering the schedule, even though jittering is in fact desirable for other, better reasons.
In terms of the burst size, isn't it a factor of clusters. So we're talking about 3-4 orders of magnitude of change in terms of the size of the burst of message doing this aligned vs unaligned. Elsewhere I've read an assumption that we try to keep these events at < 10/s. This is going to be way more than that. I've stated my unease, but will leave it at that.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)
pkg/sql/catalog/schematelemetry/schema_telemetry_event.go
line 178 at r32 (raw file):
k := keys[i] keys[i] = keys[j] keys[j] = k
nit: go supports in-place swap correctly: keys[i], keys[j] = keys[j], keys[i]
Code quote:
k := keys[i]
keys[i] = keys[j]
keys[j] = k
pkg/sql/catalog/schematelemetry/schema_telemetry_event.go
line 187 at r32 (raw file):
descIDsInSnapshot.Add(k.id) if k.nsKey == nil { orphanedDescIDs.Add(k.id)
aren't these just going to generally be dropped tables waiting for their GC ttl? The code seems to be treating them like they're important or likely to be interesting when in fact I think they're the least likely to be interesting. It'd be nice if the commentary talked about the fact that we expect all of the dropped tables to be in this set.
pkg/sql/catalog/schematelemetry/schema_telemetry_event.go
line 199 at r32 (raw file):
} func redactTableDescriptor(d *descpb.TableDescriptor) (errs []error) {
My reading of the winds is that it'd be desirable at some future point to redact constants as opposed to literally everything. Can you leave a TODO or file a follow-up issue?
I'm still not sure what's up with CI, the failures are consistent, but not the same in bazel vs make...
If the beta gives cause for worry I won't have any problems reverting this PR if it's what it takes.
I hope so, but don't trust that this is always going to be the case.
I'll file a follow-up issue as soon as this is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the beta gives cause for worry I won't have any problems reverting this PR if it's what it takes.
I don't understand this comment. I'm suggesting that if we make the default some variation of the following, maybe by making the random choices deterministic based on cluster ID, then it'd be weekly but uniformly spread out.
fmt.Sprintf("%d %d * * %d", rand.Intn(60), rand.Intn(24), rand.Intn(7))
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've spoken offline to leave that as follow-up
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @herkolategan, @knz, and @postamar)
👍 We definitely need to be cleverer about when to collect the snapshot, there's no doubt about that. |
bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
This commit adds:
Care is taken to redact all strings present in descriptors which might
unintentionally be leaking PIIs.
The event generation logic is tested on the schema of a bootstrapped
test cluster: the test checks that the events match expectations.
Fixes #84284.
Release note (general change): CRDB will now collect schema info if
phoning home is enabled. This schema info is added to the telemetry log
by a built-in scheduled job which runs on a weekly basis by default.
This recurrence can be changed via the sql.schema.telemetry.recurrence
cluster setting. The schedule can also be paused via PAUSE SCHEDULE
followed by its ID, which can be retrieved by querying
SELECT * FROM [SHOW SCHEDULES] WHERE label = 'sql-schema-telemetry'.