Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect system info and metrics for Azure #591

Merged
merged 19 commits into from
Sep 27, 2024
Merged

Collect system info and metrics for Azure #591

merged 19 commits into from
Sep 27, 2024

Conversation

keiko713
Copy link
Contributor

Collect system info and metrics for Azure Flexible Server and Cosmos DB for PostgreSQL. Note that we won't support this for Single Server as it's close to EOL.

This uses 3 different SDK to collect information:

  • Flexible Server SDK: collect database server information, such as location
  • Cosmos DB for PostgreSQL SDK: collect database server and cluster information
  • Azure Monitor Query client module: collect system metrics, like CPU usage

To start using this, the customer needs to supply a new config variable AZURE_SUBSCRIPTION_ID to the collector. Also, they'll need to assign the managed identity to the collector VM (if not already with Log Insights), then grant the Read permission of the database to that managed identity. This requires docs update, likely including some setup process change.

When AZURE_SUBSCRIPTION_ID is not supplied, the collector doesn't error out and simply won't collect the system info.

input/system/azure/system.go Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
return
}

// Server metrics
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from here, it's getting the metrics. Available metrics are different between Cosmos DB and Flexible Server, but we can use the same logic here.

input/system/azure/system.go Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
} else if system.Info.Azure.CoordinatorStorageMB != 0 {
// Cosmos DB
// TODO: check if we need to sum up the node storage MB too
// (e.g. system.Info.Azure.NodeStorageMB * system.Info.Azure.NodeCount)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lfittl Do you know the answer for this, by any chance? I haven't tried it since making a node with Cosmos DB is really expensive 🙈

Copy link
Member

@lfittl lfittl Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is no, i.e. we don't need to consider the worker node storage as part of the coordinator's storage (they can be considered separate servers, and my assumption is that Azure Monitor reports the per-server metrics, not anything across the whole cluster).

input/system/azure/system.go Show resolved Hide resolved
util/null.go Show resolved Hide resolved
@keiko713 keiko713 marked this pull request as ready for review August 29, 2024 09:31
@keiko713 keiko713 requested a review from a team September 3, 2024 23:10
Copy link
Contributor

@msakrejda msakrejda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, and did not actually try this out, but looks good.

input/system/azure/system.go Outdated Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
input/system/azure/system.go Show resolved Hide resolved
@keiko713
Copy link
Contributor Author

Redeployed this version of collector to test instances, looking good including tags:
image

This is ready for re-review!
cc/ @lfittl


if resourceID == "" {
server.SelfTest.MarkCollectionAspectWarning(state.CollectionAspectSystemStats, "unable to find the database server info")
server.SelfTest.HintCollectionAspect(state.CollectionAspectSystemStats, "Make sure the Monitoring Reader permission of the database is granted to the managed identity.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be moved to the earlier parts where we actually make the list API calls? (since we return early on error) -- or do we just not get any results if the permissions are insufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to close the loop: I added the comment above, but this can be reached when the managed identity is assigned to some db but not this db, so I think the wording is still accurate. When it's not assigned to any db resources, it'll raise an error when we do page, err := pager.NextPage(ctx) earlier in line 46.

I tweaked that part of code, to make sure that it'll early return when it receives errors (and show the same hint). It'll look like the following. It's not super good looking sadly (I think %v is just showing too much), but I think it does show good info for debugging so I'd say it's a good start.

image

@@ -102,6 +102,7 @@ func getCollectorConfig(c config.ServerConfig) state.CollectorConfig {
AzureAdTenantId: c.AzureADTenantID,
AzureAdClientId: c.AzureADClientID,
AzureHasAdCertificate: c.AzureADCertificatePath != "",
AzureSubscriptionID: c.AzureSubscriptionID,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to be emitted on output

// value is total of 1min
diffedNetworkStats.ReceiveThroughputBytesPerSecond = uint64(*metricValue.Average / 60)
case "read_iops":
diffedDiskStats.ReadOperationsPerSecond = *metricValue.Average
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentry reported a nil pointer dereference here https://sentry.io/issues/5910271246

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's already fixed actually, but thanks for the ping!

@keiko713
Copy link
Contributor Author

Addressed Lukas's feedback, also double checked that this works with both Flexible Server and Cosmos DB. Merging.

@keiko713 keiko713 merged commit fb3f7b8 into main Sep 27, 2024
3 checks passed
@keiko713 keiko713 deleted the azure-metrics branch September 27, 2024 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants