Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and remediate certificate renewal failure #57

Closed
5 tasks done
jcscottiii opened this issue Jul 8, 2022 · 4 comments · Fixed by #79
Closed
5 tasks done

Investigate and remediate certificate renewal failure #57

jcscottiii opened this issue Jul 8, 2022 · 4 comments · Fixed by #79

Comments

@jcscottiii
Copy link
Collaborator

jcscottiii commented Jul 8, 2022

@foolip reported that the certificate for the wpt.live was getting to close to expiration.


Problem

Screenshot 2022-07-07 2 59 58 PM


Diagnosis

Screenshot 1 - CPU Pegged

Screenshot 2022-07-07 2 51 34 PM

Screenshot 2 - No space left on the device

Screenshot 2022-07-07 2 58 54 PM

Screenshot 3 - Bucket has not been touched in awhile

Screenshot 2022-07-07 3 00 48 PM (1)

Screenshot 4 - Logs do not indicate anything wrong

Screenshot 2022-07-07 2 57 54 PM

Screenshot 5 - Unable to log into the instance

image

Summary

CPU was pegged. There has to be some process that was gotten out of control since the cert renewal is only a cert bot script that runs once a day. Also something is causing the server to use up all the space. As a result of all of this, I could not log in to do further diagnosis. Need to restart/recreate the instance


Remediation Steps

  • Restart the instance
    • Upon restarting, ran into a problem where us-central1 ran out of the instance type f1-micro
    • image
    • I attempted to recreate the instance in all of the zones in us-central1 and got the same error.
    • This error is still happening across two days
  • Find a new instance type that will successfully provision: e2.micro
  • Discuss the increase in billing with @past
  • Verify that the certificate is indeed updated
    • image
  • Create a PR with the new instance type to the terraform to reflect the new state of infrastructure No longer needed

Will close this issue after finishing these steps

Other recommendations

If we want to save money, move the cert-renewal to Cloud Run instance that starts up, runs, and terminates upon a cron schedule. This will remove the need to have an instance constantly on.

cc: @DanielRyanSmith

@DanielRyanSmith
Copy link
Collaborator

DanielRyanSmith commented Jul 8, 2022

Great write-up @jcscottiii, thanks for looking into this 😊 The thorough documentation and info is a big help to understand the problem well.

Edit: Also just learning now that adding these checkboxes to an issue tracks the number of steps that need to be taken on the issue - very cool!

@past
Copy link
Member

past commented Jul 10, 2022

There is some investigation of an earlier instance of the disk running out of space in #35 that could be related.

@jcscottiii
Copy link
Collaborator Author

There was another failure to renew the cert recently. I did not check the serial console for the out of space error prior to restarting. But I suspect that it is the same problem. I have added some alerts If this happens again, we should migrate to the Cloud Run instance in the "Other Recommendations" section above. It would be easier to do that change instead of chasing down what's causing the instance to eventually run out of disk space.

@past
Copy link
Member

past commented Jun 13, 2023

There was another such failure over the weekend and I confirmed that it was the same issue. After restarting the service the cert was renewed again.

jcscottiii added a commit that referenced this issue Oct 23, 2023
Previously, the cert-renewal process was a long standing instance that
ran the script once and slept. The problem arises that this sleep
eventually breaks and the instance never recovers. This migrates the
job to be a cloud run job that runs the script and that is it. The sleep
is now handled by a cron schedule. This ensures the instance is always
fresh.

Fixes #57
Fixes #77
jcscottiii added a commit that referenced this issue Oct 23, 2023
Previously, the cert-renewal process was a long standing instance that
ran the script once and slept. The problem arises that this sleep
eventually breaks and the instance never recovers. This migrates the
job to be a cloud run job that runs the script and that is it. The sleep
is now handled by a cron schedule. This ensures the instance is always
fresh.

Fixes #57
Fixes #77
jcscottiii added a commit that referenced this issue Oct 24, 2023
Previously, the cert-renewal process was a long standing instance that
ran the script once and slept. The problem arises that this sleep
eventually breaks and the instance never recovers. This migrates the
job to be a cloud run job that runs the script and that is it. The sleep
is now handled by a cron schedule. This ensures the instance is always
fresh.

Fixes #57
Fixes #77
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants