-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate and remediate certificate renewal failure #57
Comments
Great write-up @jcscottiii, thanks for looking into this 😊 The thorough documentation and info is a big help to understand the problem well. Edit: Also just learning now that adding these checkboxes to an issue tracks the number of steps that need to be taken on the issue - very cool! |
There is some investigation of an earlier instance of the disk running out of space in #35 that could be related. |
There was another failure to renew the cert recently. I did not check the serial console for the out of space error prior to restarting. But I suspect that it is the same problem. I have added some alerts If this happens again, we should migrate to the Cloud Run instance in the "Other Recommendations" section above. It would be easier to do that change instead of chasing down what's causing the instance to eventually run out of disk space. |
There was another such failure over the weekend and I confirmed that it was the same issue. After restarting the service the cert was renewed again. |
Previously, the cert-renewal process was a long standing instance that ran the script once and slept. The problem arises that this sleep eventually breaks and the instance never recovers. This migrates the job to be a cloud run job that runs the script and that is it. The sleep is now handled by a cron schedule. This ensures the instance is always fresh. Fixes #57 Fixes #77
Previously, the cert-renewal process was a long standing instance that ran the script once and slept. The problem arises that this sleep eventually breaks and the instance never recovers. This migrates the job to be a cloud run job that runs the script and that is it. The sleep is now handled by a cron schedule. This ensures the instance is always fresh. Fixes #57 Fixes #77
Previously, the cert-renewal process was a long standing instance that ran the script once and slept. The problem arises that this sleep eventually breaks and the instance never recovers. This migrates the job to be a cloud run job that runs the script and that is it. The sleep is now handled by a cron schedule. This ensures the instance is always fresh. Fixes #57 Fixes #77
@foolip reported that the certificate for the wpt.live was getting to close to expiration.
Problem
Diagnosis
Screenshot 1 - CPU Pegged
Screenshot 2 - No space left on the device
Screenshot 3 - Bucket has not been touched in awhile
Screenshot 4 - Logs do not indicate anything wrong
Screenshot 5 - Unable to log into the instance
Summary
CPU was pegged. There has to be some process that was gotten out of control since the cert renewal is only a cert bot script that runs once a day. Also something is causing the server to use up all the space. As a result of all of this, I could not log in to do further diagnosis. Need to restart/recreate the instance
Remediation Steps
Create a PR with the new instance type to the terraform to reflect the new state of infrastructureNo longer neededWill close this issue after finishing these steps
Other recommendations
If we want to save money, move the cert-renewal to Cloud Run instance that starts up, runs, and terminates upon a cron schedule. This will remove the need to have an instance constantly on.
cc: @DanielRyanSmith
The text was updated successfully, but these errors were encountered: