fix(log): Some improvements to the logs #537

james-w · 2024-05-14T11:34:59Z

The current logs are somewhat confusing, and missing some very useful information. This is an attempt
to clean it up in a few ways:

Log the number of items from each dg. Each datagatherer can return
the number of items it collected so that the logs tell us a bit more about
what was found in the cluster, and can help find where any items have
been missed.
Log a more informative error message when giving up on uploading
readings to the server. This would be the last message before
the pod exits, so if the pod ends up in CrashLoopBackoff it's
important to highlight this as the reason.
Remove the "Running Agent" log message, it's not clear what it means
Put [] around the body as if the body is empty the log message
can read like the following message is the body.

2024/05/10 16:33:43 retrying in 25.555756126s after error: received response with status code 404. Body:
W0510 16:33:43.832278   10875 reflector.go:535] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list route.openshift.io/v1, Resource=routes: the server could not find the requested resource

Remove "using parent stop channel" from a log message as it's not
clear what it means. It used to be conditional, and now it isn't,
so it's not providing any information, and it's confusing as to
what it might be trying to tell you if you aren't familiar with that
code.

There are couple of code cleanups included as well:

Fix a crash bug when REST config can't be loaded.
discoveryClient is uninitialized, so the pointer can't be deferenced,
so return an empty struct if there's an error.
Remove the informer context in the dynamic dg. The informer
context is created, but not passed to anything, so this code doesn't do anything.

The changes are split into a few logical commits if we want to break it up and review them independently,
as I know these might not be the best approaches for all of these things.

maelvls · 2024-05-14T13:02:31Z

Hey, I've taken a look at your PR. I've went ahead with reviewing this as part of the handover effort started this morning. I'll let Olu do a final review as I don't know much about this code base.

Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.

~~I wasn't able to reproduce the agent giving up~~ I was able to reproduce and test that the last log line is the reason why the agent stopped by using a dummy key and waiting for the 15 min backoff to hit:

$ export HTTPS_PROXY=foo
$ go run . agent -c config.yaml --client-id XXXXX -k /tmp/key --venafi-cloud -p 5s 2>&1 | grep api.venafi.cloud
2024/05/14 14:26:57 Posting data to: https://api.venafi.cloud/
2024/05/14 14:26:58 retrying in 41.388139367s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:27:39 Posting data to: https://api.venafi.cloud/
2024/05/14 14:27:40 retrying in 35.360750201s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:28:15 Posting data to: https://api.venafi.cloud/
2024/05/14 14:28:15 retrying in 1m38.081853176s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:29:53 Posting data to: https://api.venafi.cloud/
2024/05/14 14:29:54 retrying in 1m37.808502147s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:31:31 Posting data to: https://api.venafi.cloud/
2024/05/14 14:31:32 retrying in 3m17.774884905s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:34:50 Posting data to: https://api.venafi.clou
2024/05/14 14:34:50 retrying in 3m34.920269392s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:38:25 Posting data to: https://api.venafi.cloud/
2024/05/14 14:38:25 Exiting due to fatal error uploading: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}

(I've hidden all the useless messages about missing resources)

About the REST config bug, well spotted. I wasn't able to reproduce the panic (if that's how you spotted this)

Anyways, I think all the changes you made are sensible.

dbarranco

Looks good to me, just added a couple of minor suggestions.

We should think about structuring these logs and converting them into events soon 🤔

pkg/client/client_venafi_cloud.go

pkg/datagatherer/k8s/dynamic.go

james-w · 2024-05-14T13:19:09Z

Hey, I've taken a look at your PR. I've went ahead with reviewing this as part of the handover effort started this morning. I'll let Olu do a final review as I don't know much about this code base.

Thanks!

Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.

I wasn't able to reproduce the agent giving up, or I haven't waited long enough, I guess the backoff threshold is high:

I believe the default timeout is 15m before it gives up.

About the REST config bug, well spotted. I wasn't able to reproduce the panic (if that's how you spotted this)

It was. Running locally without a kubeconfig file should reproduce.

Anyways, I think all the changes you made are sensible.

Thanks.

pkg/agent/dummy_data_gatherer.go

tfadeyi

LGTM 👍

* Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason. * Remove the "Running Agent" log message, it's not clear what it means * Put [] around the body as if the body is empty the log message can read like the following message is the body. ``` 2024/05/10 16:33:43 retrying in 25.555756126s after error: received response with status code 404. Body: W0510 16:33:43.832278 10875 reflector.go:535] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list route.openshift.io/v1, Resource=routes: the server could not find the requested resource ``` * Remove "using parent stop channel" from a log message as it's not clear what it means. It used to be conditional, and now it isn't, so it's not providing any information, and it's confusing as to what it might be trying to tell you if you aren't familiar with that code.

`discoveryClient` is uninitialized, so the pointer can't be deferenced, so return an empty struct if there's an error.

The informer context is created, but not passed to anything, so this code doesn't do anything.

This returns the number of items collected by each datagatherer so that the logs tell us a bit more about what was found in the cluster, and can help find where any items have been missed.

james-w · 2024-05-15T13:24:20Z

Thanks all, I pushed an update with the suggested changes.

tfadeyi · 2024-05-16T11:22:11Z

Thank you 👍 LGTM, I'll merge the changes to master

dbarranco approved these changes May 14, 2024

View reviewed changes

pkg/client/client_venafi_cloud.go Outdated Show resolved Hide resolved

pkg/datagatherer/k8s/dynamic.go Outdated Show resolved Hide resolved

tfadeyi reviewed May 14, 2024

View reviewed changes

pkg/agent/dummy_data_gatherer.go Outdated Show resolved Hide resolved

tfadeyi approved these changes May 14, 2024

View reviewed changes

James Westby added 4 commits May 15, 2024 14:22

Fix a crash bug when REST config can't be loaded

a92953f

`discoveryClient` is uninitialized, so the pointer can't be deferenced, so return an empty struct if there's an error.

Remove the informer context in the dynamic dg

8accb51

The informer context is created, but not passed to anything, so this code doesn't do anything.

Log the number of items from each dg

e84e2e2

This returns the number of items collected by each datagatherer so that the logs tell us a bit more about what was found in the cluster, and can help find where any items have been missed.

james-w force-pushed the log-improvements branch from 90d5906 to e84e2e2 Compare May 15, 2024 13:23

tfadeyi changed the title ~~Some improvements to the logs~~ fix(log): Some improvements to the logs May 16, 2024

tfadeyi merged commit be5fdba into master May 16, 2024
8 checks passed

inteon deleted the log-improvements branch September 16, 2024 14:14

wallrj mentioned this pull request Nov 12, 2024

[VC-35738] Use klog and logr logger instead of log in the agent package #609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(log): Some improvements to the logs #537

fix(log): Some improvements to the logs #537

james-w commented May 14, 2024

maelvls commented May 14, 2024 •

edited

Loading

dbarranco left a comment

james-w commented May 14, 2024

tfadeyi left a comment

james-w commented May 15, 2024

tfadeyi commented May 16, 2024

fix(log): Some improvements to the logs #537

fix(log): Some improvements to the logs #537

Conversation

james-w commented May 14, 2024

maelvls commented May 14, 2024 • edited Loading

dbarranco left a comment

Choose a reason for hiding this comment

james-w commented May 14, 2024

tfadeyi left a comment

Choose a reason for hiding this comment

james-w commented May 15, 2024

tfadeyi commented May 16, 2024

maelvls commented May 14, 2024 •

edited

Loading