Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: new logging channels OPS and HEALTH #57171

Merged
merged 3 commits into from
Dec 16, 2020
Merged

Conversation

knz
Copy link
Contributor

@knz knz commented Nov 26, 2020

These two commits introduce the OPS and HEALTH channel.

See the individual commits for details.

@knz knz requested a review from itsbilal November 26, 2020 11:36
@knz knz requested a review from a team as a code owner November 26, 2020 11:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Dec 3, 2020

Rebased, RFAL

@knz
Copy link
Contributor Author

knz commented Dec 14, 2020

I have rebased this off master so as to not make it dependent on #57170. Hopefully this makes it easier to review. @itsbilal can you have a look? Thanks

@knz knz force-pushed the 20201126-log-ops branch 2 times, most recently from 0d6fb5d to f7e9558 Compare December 14, 2020 16:07
Copy link
Member

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: save for some minor comments

Reviewed 15 of 15 files at r1, 48 of 48 files at r2, 4 of 4 files at r3, 8 of 8 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz)


pkg/gossip/gossip.go, line 495 at r2 (raw file):

	if newResolverFound {
		if log.V(1) {
			log.Health.Infof(ctx, "found new resolvers from storage; signaling bootstrap")

This seems more like an Ops event to me?


pkg/server/status/runtime.go, line 452 at r2 (raw file):

	mem := gosigar.ProcMem{}
	if err := mem.Get(pid); err != nil {
		log.Health.Errorf(ctx, "unable to get mem usage: %v", err)

Also seems like more of an Ops error, as we can probably stay healthy without this? (Same for all the other calls in this method). I don't feel strongly either way though.


pkg/util/log/logconfig/testdata/yaml, line 99 at r2 (raw file):

----
ERROR: yaml: unmarshal errors:
  line 26: field fluent-servers not found in type logconfig.SinkConfig

Might want to remove the fluent-servers part from this PR.

Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal)


pkg/gossip/gossip.go, line 495 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This seems more like an Ops event to me?

yep, good catch


pkg/server/status/runtime.go, line 452 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Also seems like more of an Ops error, as we can probably stay healthy without this? (Same for all the other calls in this method). I don't feel strongly either way though.

All good points. Done.


pkg/util/log/logconfig/testdata/yaml, line 99 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Might want to remove the fluent-servers part from this PR.

Good point. Done.

@knz
Copy link
Contributor Author

knz commented Dec 15, 2020

TFYR!

bors r=itsbilal

@craig
Copy link
Contributor

craig bot commented Dec 15, 2020

Merge conflict.

Release note (cli change): Logging events that are relevant to cluster
operators are now categorized under the new OPS and HEALTH logging
channels. These can now be redirected separately from other logging
events.

The OPS channel is the channel used to report "point" operational events,
initiated by user operators or automation:

- operator or system actions on server processes: process starts,
  stops, shutdowns, crashes (if they can be logged),
  including each time: command-line parameters, current version being run.
- actions that impact the topology of a cluster: node additions,
  removals, decommissions, etc.
- cluster setting changes.
- zone configuration changes.

The HEALTH channel is the channel used to report "background" operational
events, initiated by CockroachDB or reporting on automatic processes:

- current resource usage, including critical resource usage.
- node-node connection events, including connection errors and
  gossip details.
- range and table leasing events.
- up-, down-replication; range unavailability.
Release note (cli change): Server terminations that are triggered when
a node encounters an internal fatal error are now reported on the OPS
channel. The exact text of the error is not reported on the OPS
channel however, as it may be complex (e.g. when there is a replica
inconsistency) and the OPS channels is typically monitored by tools
that just detect irregularities. The text of the message refers
instead to the channel where the additional details can be found.
Zone config changes interest the DBAs of the cluster as a whole more
than they interest the DBA of the individual SQL application.

Release note (cli change): The notable events `set_zone_config` and
`remove_zone_config` are now sent to the OPS channel.
@knz
Copy link
Contributor Author

knz commented Dec 16, 2020

bors r=itsbilal

@craig
Copy link
Contributor

craig bot commented Dec 16, 2020

Build succeeded:

@craig craig bot merged commit 265122d into cockroachdb:master Dec 16, 2020
@knz knz deleted the 20201126-log-ops branch December 16, 2020 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants