Investigate and remediate certificate renewal failure #57

jcscottiii · 2022-07-08T16:59:31Z

@foolip reported that the certificate for the wpt.live was getting to close to expiration.

Problem

Diagnosis

Screenshot 1 - CPU Pegged

Screenshot 2 - No space left on the device

Screenshot 3 - Bucket has not been touched in awhile

Screenshot 4 - Logs do not indicate anything wrong

Screenshot 5 - Unable to log into the instance

Summary

CPU was pegged. There has to be some process that was gotten out of control since the cert renewal is only a cert bot script that runs once a day. Also something is causing the server to use up all the space. As a result of all of this, I could not log in to do further diagnosis. Need to restart/recreate the instance

Remediation Steps

Restart the instance
- Upon restarting, ran into a problem where us-central1 ran out of the instance type f1-micro
- I attempted to recreate the instance in all of the zones in us-central1 and got the same error.
- This error is still happening across two days
Find a new instance type that will successfully provision: e2.micro
Discuss the increase in billing with @past
Verify that the certificate is indeed updated
~~Create a PR with the new instance type to the terraform to reflect the new state of infrastructure~~ No longer needed

Will close this issue after finishing these steps

Other recommendations

If we want to save money, move the cert-renewal to Cloud Run instance that starts up, runs, and terminates upon a cron schedule. This will remove the need to have an instance constantly on.

cc: @DanielRyanSmith

DanielRyanSmith · 2022-07-08T17:48:37Z

Great write-up @jcscottiii, thanks for looking into this 😊 The thorough documentation and info is a big help to understand the problem well.

Edit: Also just learning now that adding these checkboxes to an issue tracks the number of steps that need to be taken on the issue - very cool!

past · 2022-07-10T08:34:03Z

There is some investigation of an earlier instance of the disk running out of space in #35 that could be related.

jcscottiii · 2023-02-02T14:39:48Z

There was another failure to renew the cert recently. I did not check the serial console for the out of space error prior to restarting. But I suspect that it is the same problem. I have added some alerts If this happens again, we should migrate to the Cloud Run instance in the "Other Recommendations" section above. It would be easier to do that change instead of chasing down what's causing the instance to eventually run out of disk space.

past · 2023-06-13T02:23:57Z

There was another such failure over the weekend and I confirmed that it was the same issue. After restarting the service the cert was renewed again.

Previously, the cert-renewal process was a long standing instance that ran the script once and slept. The problem arises that this sleep eventually breaks and the instance never recovers. This migrates the job to be a cloud run job that runs the script and that is it. The sleep is now handled by a cron schedule. This ensures the instance is always fresh. Fixes #57 Fixes #77

jcscottiii mentioned this issue Jun 13, 2023

Stabilize the wpt-live cert renewal process #77

Closed

jcscottiii mentioned this issue Oct 23, 2023

Migrate cert-renewal process to cloud run job #79

Merged

jcscottiii closed this as completed in #79 Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and remediate certificate renewal failure #57

Investigate and remediate certificate renewal failure #57

jcscottiii commented Jul 8, 2022 •

edited

Loading

DanielRyanSmith commented Jul 8, 2022 •

edited

Loading

past commented Jul 10, 2022

jcscottiii commented Feb 2, 2023

past commented Jun 13, 2023

Investigate and remediate certificate renewal failure #57

Investigate and remediate certificate renewal failure #57

Comments

jcscottiii commented Jul 8, 2022 • edited Loading

Problem

Diagnosis

Screenshot 1 - CPU Pegged

Screenshot 2 - No space left on the device

Screenshot 3 - Bucket has not been touched in awhile

Screenshot 4 - Logs do not indicate anything wrong

Screenshot 5 - Unable to log into the instance

Summary

Remediation Steps

Other recommendations

DanielRyanSmith commented Jul 8, 2022 • edited Loading

past commented Jul 10, 2022

jcscottiii commented Feb 2, 2023

past commented Jun 13, 2023

jcscottiii commented Jul 8, 2022 •

edited

Loading

DanielRyanSmith commented Jul 8, 2022 •

edited

Loading