Add a guide to metrics for monitoring Teleport #46645

ptgott · 2024-09-16T17:26:57Z

This change turns the Metrics guide in admin-guides into a conceptual guide to the most important metrics for monitoring a Teleport cluster.

Since Agent metrics have inconsistent comprehensiveness across Teleport services--and to reduce the scope of this change--this guide focuses on self-hosted clusters.

To make this a conceptual guide instead of a reference, this change removes the reference table from the admin-guides metrics page. There is already a table in the dedicated metrics reference guide.

Note that, while the new metrics guide is specific to self-hosted clusters, this change does not move the guide to the subsection of Admin Guides related to self-hosting Teleport. Doing this would mean having one subsection of Admin Guides for diagnostics-related guides and one subsection for self-hosted-specific diagnostics, which is potentially confusing. We may also want to add Agent-specific metrics eventually.

Finally, this change does not include alert thresholds for the metrics it describes. We can define these in a subsequent change.

github-actions · 2024-09-16T17:37:33Z

🤖 Vercel preview here: https://docs-hwwmn4qux-goteleport.vercel.app/docs/ver/preview

github-actions · 2024-09-17T17:50:22Z

🤖 Vercel preview here: https://docs-3vr59qbq9-goteleport.vercel.app/docs/ver/preview

mmcallister · 2024-09-18T07:11:20Z

docs/pages/admin-guides/management/diagnostics/metrics.mdx

+The backend throughput metrics discussed in the previous section map on to
+latency metrics. Whenever the Auth Service increments one of the throughput
+metrics, it reports one of the corresponding latency metrics. See the table
+below for which throughput metrics miap to which latency metrics. Each metric


Suggested change

below for which throughput metrics miap to which latency metrics. Each metric

below for which throughput metrics map to which latency metrics. Each metric

Fixed in b6b8d4f

github-actions · 2024-09-23T18:33:25Z

🤖 Vercel preview here: https://docs-et5246prw-goteleport.vercel.app/docs/ver/preview

github-actions · 2024-09-27T17:38:04Z

🤖 Vercel preview here: https://docs-civ0o1ngr-goteleport.vercel.app/docs/ver/preview

github-actions · 2024-10-02T15:54:53Z

🤖 Vercel preview here: https://docs-g60stgyxr-goteleport.vercel.app/docs/ver/preview

ptgott · 2024-10-03T20:23:12Z

@evanfreed I've added new information based on your feedback. Checking to make sure it's accurate. Thanks!

evanfreed

Looks good, one spelling find

docs/pages/admin-guides/management/diagnostics/metrics.mdx

Closes #40664 This change turns the Metrics guide in `admin-guides` into a conceptual guide to the most important metrics for monitoring a Teleport cluster. Since Agent metrics have inconsistent comprehensiveness across Teleport services--and to reduce the scope of this change--this guide focuses on self-hosted clusters. To make this a conceptual guide instead of a reference, this change removes the reference table from the `admin-guides` metrics page. There is already a table in the dedicated metrics reference guide. Note that, while the new metrics guide is specific to self-hosted clusters, this change does not move the guide to the subsection of Admin Guides related to self-hosting Teleport. Doing this would mean having one subsection of Admin Guides for diagnostics-related guides and one subsection for self-hosted-specific diagnostics, which is potentially confusing. We may also want to add Agent-specific metrics eventually. Finally, this change does not include alert thresholds for the metrics it describes. We can define these in a subsequent change.

- Describe `backend_write_requests_failed_precondition_total` - Include the precondition metric in the write availability formula. - Turn the `registered_servers` discussion into a discussion of Teleport instance version, since it's not possible to group this metric by service and subtract the count of Auth Service/Proxy Service instances from the count of all registered services.

github-actions · 2024-10-09T16:06:25Z

🤖 Vercel preview here: https://docs-qjwtm0d7k-goteleport.vercel.app/docs/ver/preview

github-actions · 2024-10-09T16:06:47Z

🤖 Vercel preview here: https://docs-jj68b6zag-goteleport.vercel.app/docs/ver/preview

public-teleport-github-review-bot · 2024-10-09T18:42:02Z

@ptgott See the table below for backport results.

Branch	Result
branch/v14	Create PR
branch/v15	Create PR
branch/v16	Create PR

* Add a guide to metrics for monitoring Teleport Closes #40664 This change turns the Metrics guide in `admin-guides` into a conceptual guide to the most important metrics for monitoring a Teleport cluster. Since Agent metrics have inconsistent comprehensiveness across Teleport services--and to reduce the scope of this change--this guide focuses on self-hosted clusters. To make this a conceptual guide instead of a reference, this change removes the reference table from the `admin-guides` metrics page. There is already a table in the dedicated metrics reference guide. Note that, while the new metrics guide is specific to self-hosted clusters, this change does not move the guide to the subsection of Admin Guides related to self-hosting Teleport. Doing this would mean having one subsection of Admin Guides for diagnostics-related guides and one subsection for self-hosted-specific diagnostics, which is potentially confusing. We may also want to add Agent-specific metrics eventually. Finally, this change does not include alert thresholds for the metrics it describes. We can define these in a subsequent change. * Respond to evanfreed feedback - Describe `backend_write_requests_failed_precondition_total` - Include the precondition metric in the write availability formula. - Turn the `registered_servers` discussion into a discussion of Teleport instance version, since it's not possible to group this metric by service and subtract the count of Auth Service/Proxy Service instances from the count of all registered services.

ptgott temporarily deployed to vercel September 16, 2024 17:27 — with GitHub Actions Inactive

ptgott added no-changelog Indicates that a PR does not require a changelog entry backport/branch/v14 backport/branch/v15 backport/branch/v16 labels Sep 16, 2024

ptgott requested a review from jimbishopp September 16, 2024 17:27

github-actions bot added documentation size/md labels Sep 16, 2024

github-actions bot requested review from mmcallister, r0mant, xinding33 and zmb3 September 16, 2024 17:27

ptgott force-pushed the paul.gottschling/40664-monitoring branch from 7f35bb1 to 7443fd8 Compare September 17, 2024 17:39

ptgott temporarily deployed to vercel September 17, 2024 17:39 — with GitHub Actions Inactive

mmcallister reviewed Sep 18, 2024

View reviewed changes

mmcallister approved these changes Sep 18, 2024

View reviewed changes

ptgott force-pushed the paul.gottschling/40664-monitoring branch from 7443fd8 to b6b8d4f Compare September 18, 2024 20:22

ptgott had a problem deploying to vercel September 18, 2024 20:23 — with GitHub Actions Failure

mmcallister approved these changes Sep 19, 2024

View reviewed changes

strideynet self-requested a review September 19, 2024 22:41

ptgott force-pushed the paul.gottschling/40664-monitoring branch from b6b8d4f to 6a52832 Compare September 23, 2024 18:22

ptgott temporarily deployed to vercel September 23, 2024 18:22 — with GitHub Actions Inactive

ptgott force-pushed the paul.gottschling/40664-monitoring branch from 6a52832 to 6f602be Compare September 27, 2024 17:27

ptgott temporarily deployed to vercel September 27, 2024 17:27 — with GitHub Actions Inactive

ptgott force-pushed the paul.gottschling/40664-monitoring branch from 6f602be to 4d69a4f Compare September 30, 2024 13:19

ptgott temporarily deployed to vercel September 30, 2024 13:19 — with GitHub Actions Inactive

ptgott temporarily deployed to vercel October 2, 2024 15:44 — with GitHub Actions Inactive

ptgott requested a review from evanfreed October 2, 2024 15:44

evanfreed approved these changes Oct 8, 2024

View reviewed changes

docs/pages/admin-guides/management/diagnostics/metrics.mdx Outdated Show resolved Hide resolved

public-teleport-github-review-bot bot removed request for r0mant, jimbishopp, zmb3, strideynet and xinding33 October 8, 2024 22:22

ptgott force-pushed the paul.gottschling/40664-monitoring branch from c888cf2 to dd0d4f9 Compare October 9, 2024 15:56

ptgott temporarily deployed to vercel October 9, 2024 15:56 — with GitHub Actions Inactive

ptgott enabled auto-merge October 9, 2024 15:56

ptgott added 2 commits October 9, 2024 11:56

ptgott force-pushed the paul.gottschling/40664-monitoring branch from dd0d4f9 to da46d40 Compare October 9, 2024 15:56

ptgott temporarily deployed to vercel October 9, 2024 15:56 — with GitHub Actions Inactive

ptgott added this pull request to the merge queue Oct 9, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 9, 2024

ptgott added this pull request to the merge queue Oct 9, 2024

Merged via the queue into master with commit 7b38d5b Oct 9, 2024
40 checks passed

ptgott deleted the paul.gottschling/40664-monitoring branch October 9, 2024 18:39

This was referenced Oct 9, 2024

[v14] Add a guide to metrics for monitoring Teleport #47411

Merged

[v15] Add a guide to metrics for monitoring Teleport #47412

Merged

[v16] Add a guide to metrics for monitoring Teleport #47413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a guide to metrics for monitoring Teleport #46645

Add a guide to metrics for monitoring Teleport #46645

ptgott commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 17, 2024

mmcallister Sep 18, 2024

ptgott Sep 18, 2024

github-actions bot commented Sep 23, 2024

github-actions bot commented Sep 27, 2024

github-actions bot commented Oct 2, 2024

ptgott commented Oct 3, 2024

evanfreed left a comment

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

public-teleport-github-review-bot bot commented Oct 9, 2024

	below for which throughput metrics miap to which latency metrics. Each metric
	below for which throughput metrics map to which latency metrics. Each metric

Add a guide to metrics for monitoring Teleport #46645

Add a guide to metrics for monitoring Teleport #46645

Conversation

ptgott commented Sep 16, 2024

github-actions bot commented Sep 16, 2024

github-actions bot commented Sep 17, 2024

mmcallister Sep 18, 2024

Choose a reason for hiding this comment

ptgott Sep 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Sep 23, 2024

github-actions bot commented Sep 27, 2024

github-actions bot commented Oct 2, 2024

ptgott commented Oct 3, 2024

evanfreed left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

public-teleport-github-review-bot bot commented Oct 9, 2024