dcp-monitoring
configures the following for component applications of the Human Cell Atlas Data Coordination Platform (DCP).
- Health checks
- Log-based metric configuration
- Alerts (AKA Alarms)
- Grafana metrics data sources (for more see https://github.com/HumanCellAtlas/metrics)
- Metric dashboards
Further, this repository templates all of this configuration using terraform and fogg
to generalize this configuration for multiple clouds (AWS, GCP) and deployment environments (dev
, integration
, staging
, and prod
).
The term "monitoring" is used here as defined in the Google SRE handbook.
dcp-monitoring
defines terraform modules that are templated on two dimensions: cloud accounts and environments.
DCP's cloud accounts are the following.
hca-id
hca
orhumancellatlas
- the dev accounthca-prod
- the production account
DCP's deployment environments are dev
, integration
, staging
, and prod
.
./fogg.json
defines which modules are deployed to which accounts and environments.
The configuration for health checks, log-based metrics, alerts (alarms) and Grafana dashboards are located in terraform modules in the terraform/modules
directory.
Modules that are prefixed with account
are deployed once per cloud account
account-health-checks
configures AWS Route53 health checks (e.g. "The logs system is deployed once per AWS account. Pinghttp://logs.data.humancellatlas.org/health
once every 30 seconds from around the world and tell me if it's healthy")account-alerts
configures alerts (e.g. "Raise an alert if we have lambda errors over X in account 123")account-dashboards
generates JSON templates for Grafana dashboards (e.g. the Account Status dashboard)account-metrics
log-based metrics that you can use in CloudWatch Metrics or Grafana
Modules that are prefixed with env
are deployed once per deployment environment. The deployment environments are dev
, integration
, staging
, and prod
.
env-health-checks
configures AWS Route53 health checks (e.g. "DSS is deployed once per development environment. Pinghttps://dss.data.humancellatlas.org/internal/health
once every 30 seconds from around the world and tell me if it's healthy")env-alerts
configures alerts (e.g. "Raise an alert if we have APIGateway errors on DSS over X per minute.")env-dashboards
generates JSON templates for Grafana dashboards (e.g. the DSS dashboard)env-metrics
log-based metrics that you can use in CloudWatch Metrics or Grafana
Once you've specified that a terraform code for deployments be generated in the terraform/envs
directory for the environments you've specified, you must parameterize each module for each environment by filling out the variables.tf
file.
Once this is complete you can deploy into that environment with the following steps:
cd terraform/envs/<deployment_env>/<your_module>
to select the environment and module you want to deploymake apply
to deploy/apply the changes to that environment
- If this dashboard will be deployed once per account,
cd
toterraform/modules/account-health-checks
- If this dashboard will be deployed once per environment,
cd
toterrraform/modules/env-health-checks
- Create a file of the format
<application_name>.tf
- Add your health checks using the
aws_route53_health_check
terraform resource - Add your health check's id to
outputs.tf
following the format of the other outputs - If your application is user facing or supports user-facing features of DCP, add your health check's id to the
child_healthchecks
aggregation indcp.tf
cd
back to the project root directory- Follow the instructions outlined in the Deploying a module into an environment section to deploy your health check
- If this dashboard will be deployed once per account,
cd
toterraform/modules/account-alerts
- If this dashboard will be deployed once per environment,
cd
toterrraform/modules/env-alerts
- Create a file of the format
<application_name>.tf
- Add your alerts using the
aws_cloudwatch_metric_alarm
terraform resource. Each of your alarms must follow the alarm template below. cd
back to the root project directory andfogg apply
- Follow the instructions outlined in the Deploying a module into an environment section to deploy your alert
resource "aws_cloudwatch_metric_alarm" "matrix" {
alarm_name = "<application_name>-<optional alert name>-${<'var.aws_profile' for account alerts 'var.env' for env alerts>}"
...
alarm_description = <<EOF
{
"slack_channel": "<your slack channel>",
"description": "<description of failure state>"
}
EOF
alarm_actions = ["${data.aws_sns_topic.alarms.arn}"]
ok_actions = ["${data.aws_sns_topic.alarms.arn}"]
}
If you are alerting on a health check use the following dimension template.
dimensions {
HealthCheckId = "${var.<your health check's id>}"
}
- Add your health check's id to
outputs.tf
following the format of the other outputs - If your application is user facing or supports user-facing features of DCP, add your health check's id to the
child_healthchecks
aggregation indcp.tf
cd
back to the project root directory- Follow the instructions outlined in the Deploying a module into an environment section to deploy your health check
Use dcp-monitoring
if you would like to template your dashboard across deployment environments and accounts and upload them to our Grafana deployments (dev or prod.
Note, you don't need to use dcp-monitoring
to develop dashboards. You can simply edit them by hand in either the of Grafana deployments. The changes you make on Grafana will not appear in this repo and be reusable to other deployments of your software, however.
Here are the steps to deploying a dashboard via dcp-monitoring
.
- If this dashboard will be deployed once per account,
cd
toterraform/modules/account-dashboards
- If this dashboard will be deployed once per environment,
cd
toterrraform/modules/env-dashboards
- Create a file of the format
<sys_name>-dashboard.tf
- Go to Grafana and create a dashboard for one account or environment
- Go to your dashboard on Grafana, click settings, then JSON Model and copy the JSON into your terraform file
- Put the JSON in a
locals
variable in<sys_name>-dashboard.tf
; remove the"id"
key; for environment dashboards, namespace the"uid"
key withvar.env
and the"title"
key with the suffix[${upper(var.env)}]
- Add your dashboard JSON to the
"dashboards"
output inoutputs.tf
- Define the data sources you need that you might need according to the Grafana data source API in
datasources.tf
; define one variable for the name of the datasource of the format<cloud>_<name>_datasource_name
and one variable of the format<cloud>_<name>_datasource
with the JSON for the data source - Replace
datasource
keys in your dashboard json with the name of your new data sources - Add your datasource json to the
datasources
output array and the<cloud>_<name>_datasource_name
variable as an output inoutputs.tf
cd
to the project root directoryfogg apply
- Follow the instructions outlined in the Deploying a module into an environment section to generate the templated JSON for your dashboard deployment
cd
back to the project root directory and follow the steps oulined in the Upload to Grafana section.
In order to let Grafana fetch data from the Google Cloud projects, you need to give the grafana service account Monitoring Viewer
permissions to that Google Project. Note that the grafana service account should already exists in another DCP Google project and it just has to be given the right level of permissions to whichever project you want to connect.
Contact the DCP OPS team to get the specific service account email.
In order to give it Monitoring Viewer permission, go to GCloud console and ADD
the member through IAM & admin
-> IAM
section from the side bar.
The following section shows an example dashboard configuration using an AWS CloudWatch
Here is where you write your templated dashboard JSON.
locals {
dcp_dashboard = <<EOF
{
"panels": [
{
...
"datasource": "${var.aws_cloudwatch_data_source_name}",
...
},
...
]
...
"title": "DCP Health [${upper(var.env)}]",
"uid": "dcp-health-${var.env}",
}
EOF
}
This datasource configuration is specific to AWS CloudWatch. See the Grafana documentation for other data sources.
locals {
aws_cloudwatch_data_source_name = "account-cloudwatch"
aws_cloudwatch_datasource = <<EOF
{
"name": "${local.aws_cloudwatch_data_source_name}",
"type": "cloudwatch",
"url": "http://monitoring.${var.region}.amazonaws.com",
"access": "proxy",
"jsonData": {
"authType": "keys",
"defaultRegion": "${var.region}"
},
"secureJsonData": {
"accessKey": "${aws_iam_access_key.grafana_datasource.id}",
"secretKey": "${aws_iam_access_key.grafana_datasource.secret}"
},
"readOnly": false
}
EOF
}
output "dashboards" {
value = <<EOF
[
${local.myapp_dashboard},
...
]
EOF
}
output "aws_cloudwatch_data_source_name" {
value = "${local.aws_cloudwatch_data_source_name}"
}
...
output "datasources" {
value = <<EOF
[
${local.aws_cloudwatch_datasource},
...
]
EOF
}
To upload modules and datasources to grafana, use ./upload-to-grafana
. Execute ./upload-to-grafana -h
for usage details.
Copyright 2017-2018, The Human Cell Atlas Consortium
For license, see LICENSE.