Administering our production environment

Install tools

Install Kubernetes: https://kubernetes.io/docs/tasks/tools/install-kubectl/
Install GCloud: https://cloud.google.com/sdk/docs/downloads-interactive
Install jq and make sure it's in your $PATH: https://stedolan.github.io/jq/
Run gcloud init to authenticate with Google Cloud
Run gcloud container clusters get-credentials workbench --zone us-central1-b --project workbenchdata-production to make kubectl work

To test that everything is well-wired: kubectl logs -lapp=frontend-app should show you the latest log messages.

Deploy a new feature

Code, test, integration-test, commit and push the feature to master.
Wait for tests to pass and for auto-deploy to staging -- https://github.com/CJWorkbench/cjworkbench/commits shows latest successes and failures with dots, checkmarks and Xs.
Test the feature on staging: https://app.workbenchdata-staging.com
Run deploy/update-production-to-staging to make production match staging.

Revert a deployment

In case of disaster: deploy/advanced-deploy production [SHA1] to revert to a previous version. But we don't revert database migrations, so anticipate chaos if you revert to a SHA1 before a migration that breaks old code.

List running pods

kubectl get pods or kubectl get pods

Reboot a server

Use the provided script:

deploy/restart-because-you-know-what-you-are-doing ENV SERVICE

where ENV is staging or production and service is cron,fetcher,frontend or renderer

To do this manually, from the Google Cloud project console, navigate to the pod (not deployment) that is having problems, and delete the pod (not the deployment).

To reboot the Wordpress database, gcloud compute ssh [wordpress]; systemctl restart apache2; systemctl restart mysql

Clear the render cache

Run deploy/clear-render-cache to clear the render cache.

This will force a re-execute on every workflow. That can get ... expensive. Users will notice a slowdown for a few minutes.

View logs

Logs Explorer

Or to view at the console, kubectl logs -f [pod id] [container name]

To view many containers that run in the same pod, e.g. all web servers, kubectl logs -f -lapp=frontend-app frontend

Get a database shell

kubectl exec -it frontend-deployment-[tab-complete] -- python ./manage.py dbshell

Change environment variables

Environment variables are set per-pod in individual yaml files, e.g. frontend-deployment.yaml

To view current values, you can do kubectl -n production exec -it frontend-deployment-644866b6d4-fcmh7 env for a particular pod.

Many environment variables are secrets, which can be set through kubectl like this: kubectl edit secret cjw-intercom-secret --namespace production or through Google Cloud console

Architecture Notes

Each namespace has these services: frontend (the website), cron, fetcher, renderer, database (Postgresql, for most data), rabbitmq (which powers Websockets and fetch+render queues), and minio (which stores all files -- on Google Cloud Storage).
Images are stored publicly at gcr.io/workbenchdata-ci and tagged by Git commit sha1.
We only publish images after they pass integration tests. We only deploy images that have been published. We auto-deploy to staging.
"deploy" means:
1. Run migrate
2. Rolling-deploy frontend, fetcher and renderer. Kill-deploy cron (because it's a singleton).
3. Wait for kubernetes rolling deploys to finish.
There's a race: migrate runs while old versions of frontend, cron, renderer and fetcher are reading and writing the database. Approaches for handling the race:
1. When deleting columns/tables, try a two-phase deploy:
  1. Commit and deploy code without a migration that will work both before and after the migration is applied. For instance, if the migration deletes a table, deploy code that ignores the table.
  2. Commit and deploy the migration.
2. When adding columns with NOT NULL, make sure they're optional for a while:
  1. Commit and deploy a migration and code that allow NULL in the column. The old code can ignore the migration; the new code won't write NULLs.
  2. Commit and deploy a migration that rewrites NULL in the column. The code from the previous step won't misbehave.
3. Alternatively, test very carefully and plan for the downtime. (It may only last a few seconds.)

RabbitMQ stuck (seen 2019-01-25)

RabbitMQ runs on a high-availability cluster, with three nodes. Soon after we deployed on staging (but not production), one of these nodes became "stuck" (deadlocked) on 2019-01-25.

"Stuck" means:

One RabbitMQ node's heartbeat checks continued to succeed.
It accepted TCP connections, but it did not complete auth attempts.
It did not log any new messages. (As of 2019-02-06, the last log message was from 2019-01-25. It was a long week.)
kubectl -n staging exec -it rabbitmq-1-rabbitmq-0 -- rabbitmq-diagnostics maybe_stuck revealed thousands of stuck processes.
Workbench did not try to reconnect, because the TCP connection never closed.
Deleting the pod (kubectl -n staging delete pod rabbitmq-1-rabbitmq-0) caused it to restart and solved the issue. (Workbench reconnected correctly.)

Should this appear to happen again, diagnose it with the maybe_stuck command above on any of the three rabbitmq nodes: rabbitmq-1-rabbitmq-0, rabbitmq-1-rabbitmq-1, rabbitmq-1-rabbitmq-2, in the environment in question (production or staging); only after confirming the pods are indeed stuck should you delete the pod with the next kubectl command.

How we set up Billing (on Stripe)

Sign in at https://dashboard.stripe.com
Create an account, named after the company
Create a Product (Premium Plan), and a monthly Price.
Go to Settings -> Customer Portal. Allow customers to view their billing history; Allow customers to update their billing address; Allow customers to update their payment methods; Allow customers to cancel subscriptions -> Cancel Immediately -> Prorate canceled subscriptions; set Headline and set links to https://workbenchdata.com/terms-of-service and https://workbenchdata.com/privacy; set default redirect link to https://app.workbenchdata.com/settings/billing (well, https://app.workbenchdata-staging.com/settings/billing on staging)
Go to Settings -> Branding. Adjust.
Go to Settings -> Invoice Template. Adjust.
Go to Settings -> Subscriptions and Emails. Send emails about expiring cards; Use Smart Retries; Send emails when card payment fails; Send a Stripe-hosted link for cardholders to authenticate when required; Send reminders after 3, 5, 7 days; don't send invoices to customers; click lots of Save buttons
Go to Settings -> Emails. Add "workbenchdata.com" and verify it. Email customers about Successful payments and Refunds.
Copy everything to production.
Configure staging and production secrets in Kubernetes deployments and redeploy. (Staging secrets are Stripe's "test mode"; Production secrets are its non-test mode.)
1. On https://dashboard.stripe.com/test/webhooks (or non-"test" in production), add the endpoint https://app.workbenchdata.com/stripe/webhook. Make the description point to this wiki page. For "Events to send", see the docstrings in cjworkbench/views/stripe.py.
2. Look up the signing secret of the webhook. Let's call it $STRIPE_WEBHOOK_SIGNING_SECRET
3. At https://dashboard.stripe.com/test/apikeys (or non-"test" in production), copy/paste $STRIPE_PUBLIC_API_KEY and $STRIPE_API_KEY (the "Publishable key" and "Secret key", respectively)
4. kubectl --context="$KUBECTL_CONTEXT" create secret generic cjw-stripe-secret --from-literal=STRIPE_PUBLIC_API_KEY="$STRIPE_PUBLIC_API_KEY" --from-literal=STRIPE_API_KEY="$STRIPE_API_KEY" --from-literal=STRIPE_WEBHOOK_SIGNING_SECRET="$STRIPE_WEBHOOK_SIGNING_SECRET"
5. Restart the frontend pods with the new environment variables (derived from the secret)
Synchronize: kubectl exec -it SOME_FRONTEND frontend python ./manage.py import-plans-from-stripe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly