Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness Checklist #1

Open
61 tasks
SMR39 opened this issue Sep 23, 2022 · 0 comments
Open
61 tasks

Readiness Checklist #1

SMR39 opened this issue Sep 23, 2022 · 0 comments

Comments

@SMR39
Copy link

SMR39 commented Sep 23, 2022

"This template is the Production Readiness Checklist (PRC) for Level A microservices. Please make sure you have read the PRC guidelines.

Production Readiness Review has the following 2 phases:

Please have Design phase review before beginning development of your microservice and have Pre-production phase review before rolling out production release.

Design checklist

This checklist contains items are things that must be considered during the design phase and verified before the start of implementation.

☀️ General

🔒 Security

  • Authentication - It is protected by an authentication service.
  • Authorization - Access is restricted to the appropriate level. Consider who should have access to each exposed API and what they are allowed to do.
  • Transport Security - It uses TLS to communicate with other services over the Internet.

🍀 Sustainability

  • No short-term transfer - Its team members are not forced to move to another team in the short term.
  • OnCall considered team - Its team follows OnCall practices.
  • Dependency SLA - Its team knows SLA of the service dependencies.
  • SLOs - Its SLOs and SLOs owner are defined.

Pre-production checklist (Mercari and Merpay common)

This checklist contains points that must be satisfied during implementation and verified prior to release.

It is recommended to ensure that your service is deployed in production (but not receiving production traffic) before requesting the PRC, as some of the points in the list below can only be validated (e.g. capacity estimation, dashboards, screenboards, alerting, profiling, ...) if the service is deployed in production and can receive some non-production traffic. This should be done only if your service will not impact other production services or datasets. Please let us know in the issue if you think this would be a problem for your service.

🔧 Maintainability

  • Unit test - It has unit tests. And the unit tests are running in a CI system.
  • Test coverage - Its test coverage is reported to Codecov in CI system.
  • High Test coverage - Its test coverage is over 80%.
  • Config in env-var - Its config can be overridden via environment variable.
  • dockerignore - It has dockerignore to reduce the Docker image size.
  • No latest tag - Its Docker image tag is not latest or master.
  • Dependabot - Its dependencies are automatically updated.
  • Automated build - Its build process is automated (binary build and Docker image build is in this scope).
  • Automatic build - Its automated build process is running in CI/CD system.
  • Automated deploy - Its deploy process is automated.
  • Automatic deploy - Its automated deploy process is running in CI/CD system.
  • Gradual deploy - Its deploy can be gradual if you want.
  • Automated rollback - Its rollback process is automated.
  • Automatic rollback - Its rollback process is automatic.

📉 Observability

  • Tracing - Its requests are traced by Datadog APM.
  • Timeboard - Its Datadog Timeboard is created.
  • Screenboard - Its Datadog Screenboard is created.
  • GCP metrics - Its GCP projects are integrated with Datadog.
  • Actionable alert - Its Datadog Monitors are created. And those alerts are actionable.
  • Warning alert - Its warning alerts are sent to Slack or a ticket system instead of PagerDuty.
  • Critical alert - Its critical alerts are sent to PagerDuty.
  • OnCall rotation - It has a PagerDuty team, escalation policy, schedules.
  • OnCall playbooks - It has OnCall playbooks.
  • Log to STDOUT - Its logs are output to STDOUT/STDERR.
  • Log as JSON - Its logs are emitted in container log format.
  • Log with annotation - Its logs have Request ID annotation
  • Profiling - It is profiled by GCP Stackdriver Profiler.
  • Error tracking - Its errors are tracked by Sentry.

✈️ Reliability

  • Auto Scale - It automatically scales horizontally to handle fluctuating workloads, its HPA is set as described in the Resource Requests and Limits documentation, and can be scaled manually if needed.
  • CPU req/limit - Its CPU limit and request are set as described in the Resource Requests and Limits documentation.
  • Memory req/limit - Its memory resource request value is as same as limit value.
  • Capacity planning - It can handle the expected load: either load test has been performed, or the expected traffic is under control (e.g., by Gateway).
  • Zero downtime deploy - Its deploy process does not cause service degradation or downtime (e.g. error rate does not increase during deploy).
  • Graceful shutdown - It can stop gracefully.
  • Graceful degradation - It keeps working, at least partially, while dependencies (e.g. other service or database) are not working partially or completely.
  • PreStop - It has a preStop. See more on Configure PreStop.
  • PDB - It has a PodDisruptionBudget set as described in the Configure Pod Distription Budget
  • Liveness Probe - It has a health check (endpoint) for liveness probe. And liveness probe is configured. See more on Configure Liveness Probe.
  • Readiness Probe - It has a health check (endpoint) for readiness probe. And readiness probe is configured.
  • Timeout - It sets an appropriate timeout for requests over a network.
  • Smart retry - It performs smart retries when interacting with dependencies (e.g. other services or database).

🔒 Security

  • Security review - It has completed the security design review by security team.
  • Non-root user - Its docker container runs as non-root user
  • Secrets - Its sensitive configuration is stored in Kubernetes secrets.
  • Non-sensitive log - It does not write sensitive information to app logs (STDOUT/STDERR).

📋 Accessibility

  • Design Doc - Its design doc is up to date with the implementation.
  • Description - It has service description.
  • Contact - It has contact info about the owners.
  • Source repo - It has links to source repo.
  • Docs - It has links to docs for users.
  • SLOs - Its dashboard shows SLOs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant