Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(log): Some improvements to the logs #537

Merged
merged 4 commits into from
May 16, 2024
Merged

fix(log): Some improvements to the logs #537

merged 4 commits into from
May 16, 2024

Conversation

james-w
Copy link

@james-w james-w commented May 14, 2024

The current logs are somewhat confusing, and missing some very useful information. This is an attempt
to clean it up in a few ways:

  • Log the number of items from each dg. Each datagatherer can return
    the number of items it collected so that the logs tell us a bit more about
    what was found in the cluster, and can help find where any items have
    been missed.

  • Log a more informative error message when giving up on uploading
    readings to the server. This would be the last message before
    the pod exits, so if the pod ends up in CrashLoopBackoff it's
    important to highlight this as the reason.

  • Remove the "Running Agent" log message, it's not clear what it means

  • Put [] around the body as if the body is empty the log message
    can read like the following message is the body.

2024/05/10 16:33:43 retrying in 25.555756126s after error: received response with status code 404. Body:
W0510 16:33:43.832278   10875 reflector.go:535] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list route.openshift.io/v1, Resource=routes: the server could not find the requested resource
  • Remove "using parent stop channel" from a log message as it's not
    clear what it means. It used to be conditional, and now it isn't,
    so it's not providing any information, and it's confusing as to
    what it might be trying to tell you if you aren't familiar with that
    code.

There are couple of code cleanups included as well:

  • Fix a crash bug when REST config can't be loaded.
    discoveryClient is uninitialized, so the pointer can't be deferenced,
    so return an empty struct if there's an error.

  • Remove the informer context in the dynamic dg. The informer
    context is created, but not passed to anything, so this code doesn't do anything.

The changes are split into a few logical commits if we want to break it up and review them independently,
as I know these might not be the best approaches for all of these things.

@maelvls
Copy link
Member

maelvls commented May 14, 2024

Hey, I've taken a look at your PR. I've went ahead with reviewing this as part of the handover effort started this morning. I'll let Olu do a final review as I don't know much about this code base.

Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.

I wasn't able to reproduce the agent giving up I was able to reproduce and test that the last log line is the reason why the agent stopped by using a dummy key and waiting for the 15 min backoff to hit:

$ export HTTPS_PROXY=foo
$ go run . agent -c config.yaml --client-id XXXXX -k /tmp/key --venafi-cloud -p 5s 2>&1 | grep api.venafi.cloud
2024/05/14 14:26:57 Posting data to: https://api.venafi.cloud/
2024/05/14 14:26:58 retrying in 41.388139367s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:27:39 Posting data to: https://api.venafi.cloud/
2024/05/14 14:27:40 retrying in 35.360750201s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:28:15 Posting data to: https://api.venafi.cloud/
2024/05/14 14:28:15 retrying in 1m38.081853176s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:29:53 Posting data to: https://api.venafi.cloud/
2024/05/14 14:29:54 retrying in 1m37.808502147s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:31:31 Posting data to: https://api.venafi.cloud/
2024/05/14 14:31:32 retrying in 3m17.774884905s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:34:50 Posting data to: https://api.venafi.clou
2024/05/14 14:34:50 retrying in 3m34.920269392s after error: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}
2024/05/14 14:38:25 Posting data to: https://api.venafi.cloud/
2024/05/14 14:38:25 Exiting due to fatal error uploading: post to server failed: failed to execute http request to VaaS. Request https://api.venafi.cloud/v1/oauth/token/serviceaccount, status code: 400, body: [{"error":"invalid_grant","error_description":"token_signature_verification_error"}

(I've hidden all the useless messages about missing resources)

About the REST config bug, well spotted. I wasn't able to reproduce the panic (if that's how you spotted this)

Anyways, I think all the changes you made are sensible.

Copy link
Member

@dbarranco dbarranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just added a couple of minor suggestions.

We should think about structuring these logs and converting them into events soon 🤔

pkg/client/client_venafi_cloud.go Outdated Show resolved Hide resolved
pkg/datagatherer/k8s/dynamic.go Outdated Show resolved Hide resolved
@james-w
Copy link
Author

james-w commented May 14, 2024

Hey, I've taken a look at your PR. I've went ahead with reviewing this as part of the handover effort started this morning. I'll let Olu do a final review as I don't know much about this code base.

Thanks!

Log a more informative error message when giving up on uploading readings to the server. This would be the last message before the pod exits, so if the pod ends up in CrashLoopBackoff it's important to highlight this as the reason.

I wasn't able to reproduce the agent giving up, or I haven't waited long enough, I guess the backoff threshold is high:

I believe the default timeout is 15m before it gives up.

About the REST config bug, well spotted. I wasn't able to reproduce the panic (if that's how you spotted this)

It was. Running locally without a kubeconfig file should reproduce.

Anyways, I think all the changes you made are sensible.

Thanks.

Copy link
Contributor

@tfadeyi tfadeyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

James Westby added 4 commits May 15, 2024 14:22
  * Log a more informative error message when giving up on uploading
    readings to the server. This would be the last message before
    the pod exits, so if the pod ends up in CrashLoopBackoff it's
    important to highlight this as the reason.

  * Remove the "Running Agent" log message, it's not clear what it means

  * Put [] around the body as if the body is empty the log message
    can read like the following message is the body.

```
2024/05/10 16:33:43 retrying in 25.555756126s after error: received response with status code 404. Body:
W0510 16:33:43.832278   10875 reflector.go:535] pkg/mod/k8s.io/client-go@v0.28.3/tools/cache/reflector.go:229: failed to list route.openshift.io/v1, Resource=routes: the server could not find the requested resource
```

  * Remove "using parent stop channel" from a log message as it's not
    clear what it means. It used to be conditional, and now it isn't,
    so it's not providing any information, and it's confusing as to
    what it might be trying to tell you if you aren't familiar with that
    code.
`discoveryClient` is uninitialized, so the pointer can't be deferenced,
so return an empty struct if there's an error.
The informer context is created, but not passed to anything, so this
code doesn't do anything.
This returns the number of items collected by each datagatherer so that
the logs tell us a bit more about what was found in the cluster, and
can help find where any items have been missed.
@james-w james-w force-pushed the log-improvements branch from 90d5906 to e84e2e2 Compare May 15, 2024 13:23
@james-w
Copy link
Author

james-w commented May 15, 2024

Thanks all, I pushed an update with the suggested changes.

@tfadeyi
Copy link
Contributor

tfadeyi commented May 16, 2024

Thank you 👍 LGTM, I'll merge the changes to master

@tfadeyi tfadeyi changed the title Some improvements to the logs fix(log): Some improvements to the logs May 16, 2024
@tfadeyi tfadeyi merged commit be5fdba into master May 16, 2024
8 checks passed
@inteon inteon deleted the log-improvements branch September 16, 2024 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants