diff --git a/README.md b/README.md index d2f7ab3e..05429d85 100644 --- a/README.md +++ b/README.md @@ -128,7 +128,7 @@ formal writing that it has come to represent.) | publish | [RFD 83 Triton `http_proxy` support](./rfd/0083/README.md) | | predraft | [RFD 84 Providing Manta access on multiple networks](./rfd/0084/README.md) | | draft | [RFD 85 Tactical improvements for Manta alarms](./rfd/0085/README.md) | -| predraft | [RFD 86 ContainerPilot 3](./rfd/0086/README.md) | +| draft | [RFD 86 ContainerPilot 3](./rfd/0086/README.md) | | predraft | [RFD 87 Docker Events for Triton](./rfd/0087/README.md) | | draft | [RFD 88 DC and Hardware Management Futures](./rfd/0088/README.md) | draft | [RFD 89 Project Tiresias](./rfd/0089/README.md) diff --git a/rfd/0086/config.md b/rfd/0086/config.md index 3fbda400..dca93fdb 100644 --- a/rfd/0086/config.md +++ b/rfd/0086/config.md @@ -2,174 +2,250 @@ JSON as a configuration language leaves much to be desired. It has no comments, is unforgiving in editing (we've specifically written code in ContainerPilot to point out extraneous trailing commas in the config!). Any configuration language change should also support those users we know who are generating ContainerPilot configurations automatically. -We will abandon JSON in favor of the more human-friendly YAML configuration language. It has a particular advantage for those users who are generating configuration because of the ubiquity of YAML-generating libraries (this is its major advantage over Hashicorp HCL). +We will abandon JSON in favor of the somewhat more human-friendly [JSON5](https://github.com/json5/json5) configuration language. It has a particular advantage for those users who are generating configuration because JSON documents are valid JSON5 documents. YAML and [Hashicorp's HCL](https://github.com/hashicorp/hcl) were possible alternatives but feedback from the community and resulted in pushback on both due to either difficulty of correctly hand-writing (in the case of YAML) or lack of library support (in the case of HCL). The `CONTAINERPILOT` environment variable and `-config` command line flag will no longer support passing in the contents of the configuration file as a string. Instead they will now indicate the directory location for configuration files, with a default value of `/etc/containerpilot.d` (note that we're removing the `file://` prefix as well). During ContainerPilot configuration loading, we can check for files in the config directory and merge them together. The merging process is as follows: - Lexigraphically sort all the config files. -- Multiple `service`, `health`, `sensor` blocks are unioned. +- Multiple `jobs` (formerly `services`), `health`, `sensor` blocks are unioned. - Keys with the same name replace those that occurred previously. -For example consider the two YAML files below. - -1.yml +For example consider the two JSON5 files below. + +1.json5 +```json5 +{ + consul: { + host: "localhost:8500" + }, + jobs: [ + { + name: "nginx", + port: 80 + }, + { + name: "appA", + port: 8000 + } + ], + health: [ + { + name: "checkA", + job: "nginx" + exec: "curl -s --fail localhost/health" + } + ] +} ``` -consul: - host: localhost:8500 - -services: - nginx: - port: 80 - - app-A: - port: 8000 - -health: - nginx: - check-A: - command: curl -s --fail localhost/health -``` - -2.yml -``` -consul: - host: consul.svc.triton.zone - -services: - app-A: - port: 9000 - - app-B: - port: 8000 - -health: - nginx: - check-B: - command: curl -s --fail localhost/otherhealth +2.json5 +```json5 +{ + consul: { + host: "consul.svc.triton.zone:8500" + }, + jobs: [ + { + name: "appA", + port: 9000 + }, + { + name: "appB", + port: 8000 + } + ], + health: [ + { + name: "checkB", + job: "nginx" + exec: "curl -s --fail localhost/otherhealth" + } + ] +} ``` These will be merged as follows: -``` -consul: - host: consul.svc.triton.zone - -services: - nginx: - port: 80 - - app-A: - port: 9000 - - app-B: - port: 8000 - -health: - nginx: - check-A: - command: curl -s --fail localhost/health - check-B: - command: curl -s --fail localhost/otherhealth - +```json5 +{ + consul: { + host: "consul.svc.triton.zone:8500" + }, + jobs: [ + { + name: "nginx", + port: 80 + }, + { + name: "appA", + port: 9000 + }, + { + name: "appB", + port: 8000 + } + ], + health: [ + { + name: "checkA", + job: "nginx" + exec: "curl -s --fail localhost/health" + }, + { + name: "checkB", + job: "nginx" + exec: "curl -s --fail localhost/otherhealth" + } + ] +} ``` The full example configuration for ContainerPilot found in the existing docs would look like the following: -``` -consul: - host: localhost:8500 - -logging: - level: INFO - format: default - output: stdout - -service: - app: - port: 80 - heartbeat: 5 - tll: 10 - tags: - - app - - prod - interfaces: - - eth0 - - eth1[1] - - 192.168.0.0/16 - - 2001:db8::/64 - - eth2:inet - - eth2:inet6 - - inet - - inet6 - - static:192.168.1.100" - - stopTimeout: 5 - preStop: /usr/local/bin/preStop-script.sh - postStop: /usr/local/bin/postStop-script.sh - - depends: - setup: - wait: success - nginx: - onChange: /usr/local/bin/reload-nginx.sh - poll: 30 - timeout: "30s" - consul-agent: - wait: healthy - app: - onChange: /usr/local/bin/reload-app.sh - poll: 10 - timeout: "10s" - - setup: - command: /usr/local/bin/preStart-script.sh {{.ENV_VAR_NAME}} - advertise: false - restart: never - - consul-agent: - port: 8500 - command: consul -agent -join consul - advertise: false - restart: always - - consul-template: - command: > - consul-template -consul consul - -template /tmp/template.ctmpl:/tmp/result - advertise: false - restart: always - - task1: - command: /usr/local/bin/tash.sh arg1 - frequency: 1500ms - timeout: 100ms - advertise: false - - -health: - nginx: - check-A: - command: > - /usr/bin/curl --fail -s -o /dev/null http://localhost/app +```json5 +{ + consul: { + host: "localhost:8500" + }, + logging: { + level: "INFO", + format: "default", + output: "stdout" + }, + jobs: [ + { + name: "app", + // we want to start this job when the "setup" job has exited + // with success but give up after 60 sec + when: { + source: "setup", + event: "exitSuccess", + timeout: "60s" + }, + exec: "/bin/app", + restart: "never", + port: 80, + heartbeat: 5, + tll: 10, + stopTimeout: 5, + tags: [ + "app", + "prod" + ], + interfaces: [ + "eth0", + "eth1[1]", + "192.168.0.0/16", + "2001:db8::/64", + "eth2:inet", + "eth2:inet6", + "inet", + "inet6", + "static:192.168.1.100", // a trailing comma isn't an error! + ] + }, + { + name: "setup", + // we can create a chain of "prestart" events + when: { + source: "consul-agent", + event: "healthy" + }, + exec: "/usr/local/bin/preStart-script.sh", + restart: "never" + }, + { + name: "preStop", + when: { + source: "app", + event: "stopping" + }, + exec: "/usr/local/bin/preStop-script.sh", + restart: "never", + }, + { + name: "postStop", + when: { + source: "app", + event: "stopped" + }, + exec: "/usr/local/bin/postStop-script.sh", + }, + { + // a service that doesn't have a "when" field starts up on the + // global "startup" event by default + name: "consul-agent", + // note we don't have a port here because we don't intend to + // advertise one to the service discovery backend + exec: "consul -agent -join consul", + restart: "always" + }, + { + name: "consul-template", + exec: ["consul-template", "-consul", "consul", + "-template", "/tmp/template.ctmpl:/tmp/result"], + restart: "always", + }, + { + name: "task1", + exec: "/usr/local/bin/tash.sh arg1", + frequency: "1500ms", + timeout: "100ms", + }, + { + name: "reload-app", + when: "watch.app changes", + exec: "/usr/local/bin/reload-app.sh", + }, + { + name: "reload-nginx", + when: "watch.nginx changes", + exec: "/usr/local/bin/reload-nginx.sh", + } + ], + health: { + { + name: "checkA", + job: "nginx", + exec: "/usr/bin/curl --fail -s -o /dev/null http://localhost/app", + poll: 5, + timeout: "5s", + } + } + // see "multiprocess.md" for more details on this section + watches: { + { + name: "app", + poll: 10, + timeout: "10s" + }, + { + name: "nginx", + poll: 30, + timeout: "30s", + } + }, + control: { + // see "mariposa.md" for details on this section + socket: "/var/run/containerpilot.socket" + }, + telemetry: { + port: 9090, + interfaces: "eth0" + }, + sensors: [ + { + name: "metric_id" + help: "help text" + type: "counter" poll: 5 - timeout: "5s" - -telemetry: - port: 9090 - interfaces: - - eth0 - -sensor: - name: metric_id - help: help text - type: counter - poll: 5 - check: /usr/local/bin/sensor.sh - + exec: "/usr/local/bin/sensor.sh" + } + ] +} ``` _Related GitHub issues:_ diff --git a/rfd/0086/multiprocess.md b/rfd/0086/multiprocess.md index c962c899..beaa25dd 100644 --- a/rfd/0086/multiprocess.md +++ b/rfd/0086/multiprocess.md @@ -1,44 +1,52 @@ ## First-class support for multi-process containers -ContainerPilot currently "shims" a single application -- it blocks until this main application exits and spins up concurrent threads to perform the various lifecycle hooks. ContainerPilot was originally dedicated to the control of a single persistent application, and this differentiated it from a typical init system. +ContainerPilot v2 "shims" a single application -- it blocks until this main application exits and spins up concurrent threads to perform the various lifecycle hooks. ContainerPilot was originally dedicated to the control of a single persistent application, and this differentiated it from a typical init system. -In v2 we expanded the polling behaviors of `health` checks and `onChange` handlers to include periodic `tasks` [#27](https://github.com/joyent/containerpilot/issues/27). Correctly supporting Consul or etcd on Triton (or any CaaS/PaaS) means requiring an agent running inside the container, and so we also expanded to include persistent `coprocesses`. Users have reported problems with ContainerPilot (via GitHub issues) that stem from two areas of confusion around these features: +Later in v2 we expanded the polling behaviors of `health` checks and `onChange` handlers to include periodic `tasks` [#27](https://github.com/joyent/containerpilot/issues/27). Correctly supporting Consul or etcd on Triton (or any CaaS/PaaS) means requiring an agent running inside the container, and so we also expanded to include persistent `coprocesses`. Users have reported problems with ContainerPilot (via GitHub issues) that stem from three areas of confusion around these features: - The configuration of the main process is via the command line whereas the configuration of the various supporting processes is via the config file. - The timing of when each process starts is specific to which configuration block it's in (main vs `task` vs `coprocess`) rather than its dependencies. +- The timing of `preStart` relative to `task` and `coprocess`. #### Multiple services In v3 we'll eliminate the concept of a "main" application and embrace the notion that ContainerPilot is an init system for running inside containers. Each process will have its own health check(s), dependencies, frequency of run or restarting ("run forever", "run once", "run every N seconds"), and lifecycle hooks for startup and shutdown. -For each application managed, the command will be included in the ContainerPilot `services` block. This change eliminates the `task` and `coprocess` config sections. For running the applications we can largely reuse the existing process running code, which includes `SIGCHLD` handlers for process reaping. Below is an example configuration, assuming we use YAML rather than JSON (see [config updates](config.md) for more details). - -```yml -services: - nginx: - command: nginx -g daemon off; - port: 80 - heartbeat: 5 - ttl: 10 - interfaces: - - eth0 - - eth1 - depends: - - consul_agent - - consul_agent: - command: consul -agent - port: 8500 - interfaces: - - localhost - advertise: false - -health: - nginx: - check: curl --fail http://localhost/app - poll: 5 +For each application managed, the command will be included in the ContainerPilot `jobs` block. This change merges the `services`, `task`, and `coprocess` config sections. For running the applications we can largely reuse the existing process running code, which includes `SIGCHLD` handlers for process reaping. Below is an example configuration, using the JSON5 configuration syntax described in the [config updates](config.md) section. + +```json5 +{ + jobs: [ + { + name: "nginx", + when: { + source: "consul-agent", + event: "healthy" + }, + exec: "nginx", + port: 80, + heartbeat: 5, + ttl: 10, + interfaces: [ + "eth0", + "eth1", + ] + }, + { + name: "consul-agent", + exec: "consul -agent", + port: 8500, + interfaces: ["localhost"], + }, + ], + health: [ + { + name: "nginx", + exec: "curl --fail http://localhost/app", + poll: 5, timeout: "5s" - + } +} ``` _Related GitHub issues:_ @@ -53,35 +61,44 @@ _Related GitHub issues:_ #### Multiple health checks -ContainerPilot 3 will support the ability for a service to have multiple health checks. All health checks for a service must be in a passing state before a service can be marked as healthy. The service definition will be separated from the health check definition in order to support proposed [configuration improvements](config.md). +ContainerPilot 3 will support the ability for a job to have multiple health checks. All health checks for a job must be in a passing state before a job can be marked as healthy. The job definition will be separated from the health check definition in order to support proposed [configuration improvements](config.md). We'll make the following changes: -- ContainerPilot will maintain state for all services, which can be either `healthy` or `unhealthy`, and all health checks, which can be either `passing` or `failing`. -- All health checks must be marked `passing` before ContainerPilot will mark the service a `healthy`. -- A service in a `healthy` state will send heartbeats (a `Pass` message) to the discovery backend every `heartbeat` seconds with a TTL of `ttl`. -- A health check will poll every `poll` seconds, with a timeout of `timeout`. If any health check fails (returns a non-zero exit code or times out), the associated service is marked `unhealthy` and a `Fail` message is sent to the discovery service. -- Once any health check fails, _all_ health checks need to pass before the service will be marked healthy again. This is required to avoid service flapping. - -**Important note:** end users should not provide a health check with a long polling time to perform some supporting task like backends. This will cause "slow startup" as their service will not be marked healthy until the first polling window expires. Instead they should create another service for this task. - -``` -consul: - host: consul.svc.triton.zone - -services: - nginx: - port: 80 - heartbeat: 5 - ttl: 10 - -health: - nginx: - check-A: - command: curl -s --fail localhost/health - check-B: - command: curl -s --fail localhost/otherhealth - +- ContainerPilot will maintain state for all jobs, which can be either `healthy` or `unhealthy`, and all health checks, which can be either `passing` or `failing`. +- All health checks must be marked `passing` before ContainerPilot will mark the job as `healthy`. +- A job in a `healthy` state will send heartbeats (a `Pass` message) to the discovery backend every `heartbeat` seconds with a TTL of `ttl`. +- A health check will poll every `poll` seconds, with a timeout of `timeout`. If any health check fails (returns a non-zero exit code or times out), the associated job is marked `unhealthy` and a `Fail` message is sent to the discovery job. +- Once any health check fails, _all_ health checks need to pass before the job will be marked healthy again. This is required to avoid service flapping. + +**Important note:** end users should not provide a health check with a long polling time to perform some supporting task like backends. This will cause "slow startup" as their job will not be marked healthy until the first polling window expires. Instead they should create another job for this task. + +```json5 +{ + consul: { + host: "consul.svc.triton.zone" + }, + jobs: [ + { + name: "nginx", + port: 80, + heartbeat: 5, + ttl: 10 + } + ], + health: [ + { + name: "check-A", + job: "nginx", + exec: "curl -s --fail localhost/health" + }, + { + name: "check-B", + job: "nginx", + exec: "curl -s --fail localhost/otherhealth" + } + ] +} ``` @@ -90,9 +107,9 @@ _Related GitHub issues:_ - [Allow multiple health checks per service](https://github.com/joyent/containerpilot/issues/245) -#### "No advertise" +#### Non-advertising jobs -Some applications are intended only for internal consumption by other applications in the container not should not be advertised to the discovery layer (ex. Consul agent). These applications can mark themselves as "no advertise." ContainerPilot will track the state of non-advertising applications but not register them with service discovery, so it can fire `onChange` handlers for other services that depend on it. In the example above, the Consul Agent will not be advertised but Nginx can still have it marked as a dependency. +Some applications are intended only for internal consumption by other applications in the container and not should not be advertised to the discovery backend (ex. Consul agent). These applications can mark themselves as "non-advertising" simply by not providing a `port` configuration. ContainerPilot will track the state of non-advertising applications but not register them with service discovery, so it can fire `watch` handlers for other jobs that depend on it. In the example above, the Consul Agent will not be advertised but Nginx can still have it marked as a dependency. _Related GitHub issues:_ - [Coprocess hooks](https://github.com/joyent/containerpilot/issues/175) @@ -117,54 +134,136 @@ ContainerPilot hasn't eliminated the complexity of dependency management -- that That being said, a more expressive configuration of event handlers may more gracefully handle all the above situations and reduce the end-user confusion. Rather than surfacing just changes to dependency membership lists, we'll expose changes to the overall state as ContainerPilot sees it. -ContainerPilot will provide the following events: +ContainerPilot will provide events and each service can opt-in to starting on a `when` condition on one of these events. Because the life-cycle of each service triggers new events, the user can create a dependency chain among all the services in a container (and their external dependencies). This effectively replaces the `preStart`, `preStop`, and `postStop` behaviors. + +#### "When" + +The configuration for `when` includes an `event`, a sometimes-optional `source`, and an optional `timeout`. ContainerPilot will provide the following events: -- `onSuccess`: when a service exits with exit code 0. -- `onFail`: when a service exits with a non-zero exit code. -- `onHealthy`: when ContainerPilot receives notice that the dependency has been marked healthy. This is only triggered when there were previously no healthy instances (typically when the application first starts but also if all instances have previously failed). -- `onChange`: when ContainerPilot receives notice of a change to the membership of a service. -- `onUnhealthy`: when ContainerPilot receives notices that the dependency has been marked unhealthy (no instances available of a previously healthy service). +- `startup`: when ContainerPilot has completed all configuration and has started its telemetry server and control socket. This event also signals the start of all timers used for the optional timeout. This event may not have an event source. If no `start` is configured for a service, starting on this event is the default behavior. +- `exitSuccess`: when a service exits with exit code 0. This event requires an event source. +- `exitFailed`: when a service exits with a non-zero exit code. This event requires an event source. +- `healthy`: when ContainerPilot determines that a dependency has been marked healthy. This can be determined by either a `watch` for an external service (registered with the discovery backend) or by a passing health check for another service in the same container. This is only triggered when there were previously no healthy instances (typically when the application first starts but also if all instances have previously failed). This event requires an event source. +- `changed`: when ContainerPilot receives notice of a change to the membership of a service. This event requires an event source. +- `unhealthy`: when ContainerPilot receives notices that the dependency has been marked unhealthy (no instances available of a previously healthy service). This event requires an event source. +- `stopping`: when a service in the same container receives SIGTERM. This event requires an event source. +- `stopped`: when the process for a service in the same container exits. This event requires an event source. -Services can have additional options for their dependencies: +The optional event source is the service that is emitting the event. The optional timeout (in the format `timeout 60s`) indicates that ContainerPilot will give up on starting this service if the timeout elapses. -- `wait`: do not start the service until the desired state has been reached. Fire any other event handlers associated with the state before starting the application. -- `timeout`: if the dependency is unhealthy for this period of time, mark this service as failed as well. +Some example `start` configurations: -In the example below, we have a Node.js service `app`. The Node app needs configuration from the environment before it can start, so it must wait until a one-time service named `setup` has completed successfully. It will wait until `consul_agent` and `database` have been marked healthy, and will automatically reconfigure itself on changes to the `database` and `redis`. +- `when: {event: "startup"}`: start immediately (this is the default behavior if unspecified). +- `when: {source: "myPreStart", event: "exitSuccess", timeout: "60s"}`: wait up to 60 seconds for the service in the same container named `myPreStart` to exit successfully. +- `when: {source: "myDb", event: "healthy"}`: wait forever, until the service `myDb` has a healthy instance. This service could be in the same container or external. +- `when: {source: "myDb", event: "stopped"}`: start after the service in this container named `myDb` stops. This could be useful for copying a backup of the data off the instance. -```yml -services: - app: - depends: +#### Watches - # `app` will not start if `setup` fails - setup: - wait: onSuccess +ContainerPilot will generate events for all jobs internally, but the user can create a `watch` to query service discovery periodicially and generate `changed` events. This replaces the existing `backends` feature, except that `watches` don't fire their own executables. Instead the user should create a job that watches for events that the `watch` fires. A `watch` event source will be named `"watch."` to differentiate it from job events with the same name. - # ContainerPilot will give the agent 60 sec to become healthy, - # otherwise mark `app` as failed - consul_agent: - wait: onHealthy - timeout: "60s" +#### Ordering - # `app` needs this DB and also needs to configure itself when the DB - # is healthy. It reloads its config if the DB changes. - database: - wait: onHealthy - timeout: "60s" - onHealthy: configure-db-connnection.sh - onChange: reload-db-connections.sh +Each job, health check, or watch handler runs independently (in its own goroutine) and publishes its own events. Events are broadcast to all handlers and a handler will handle events in the order they are received (buffering events as necessary). This means events from multiple publishers can be interleaved, but events for a single publisher will arrive in the order they were sent; e.g. a handler won't receive a `stopped` before a `stopping`. (In practice, handlers will receive messages in the same order as all other handlers but this isn't going to be an invariant of the system in case we need to change the internals later.) + +#### Example + +In the example below, we have a Node.js service `app`. It needs to get some configuration data from the environment in a one-time `setup` job. The Node app has to make requests to redis and a database. The app can gracefully handle a missing redis but can't safely start without the database (this is an intentionally arbitrary example). We also need a consul-agent job to be running so that we can get the configuration for all of the above. + +The diagram below roughly describes the dependencies we have. - # `app` gracefully handles missing redis so we just update the config - # whenever the list of members changes - redis: - onChange: reload-configuration.sh +``` + +----> setup ------+ + | | +app ---+----> database ---+----> consul-agent ----> container start + | | + +~~~~> redis ------+ ``` -Note that to avoid stalled containers, ContainerPilot will need to automatically detect unresolvable states. In the example above, if `setup` fails ContainerPilot will never start `app`, so it should mark the `app` service state as failed as well. If ContainerPilot reaches a state where no services can continue, it will exit. +The configuration syntax for `when` doesn't permit multiple dependencies, but we can describe a chain of dependencies by forcing one of the hard dependencies to depend on the other as shown in the configuration below. The user could also accomplish the same dependency chain by merging the `configure-db-connection` and `setup` service executables, or even by having the `setup` executable poll the ContainerPilot status endpoint for more fine-grained control. + +```json5 +{ + jobs: [ + { + name: "consul-agent", + when: { + event: "startup" // this is the default + }, + exec: "consul -agent -join {{ .CONSUL }}" + }, + { + // the db is a hard dependency for 'app' + // onHealthy for an external db implies we are also waiting for + // consul-agent so we meet that requirement too + name: "configure-db-connection", + when: { + source: "database", + event: "healthy" + } + exec: "/bin/configure-db.sh" + }, + { + name: "setup", + when: { + source: "configure-db-connection", + event: "exitSuccess" + }, + exec: "/bin/setup.sh" + }, + { + name: "app", + // 'app' will never start if 'setup' fails + when: { + source: "setup", + event: "exitSuccess" + }, + exec: "node /bin/app.js" + }, + { + name: "reconfigure-db-connection.sh", + when: { + source: "watch.database", + event: "changed" + }, + exec: "reconfigure-db-connection.sh", + }, + { + name: "reconfigure-redis-connection.sh", + when: { + source: "watch.redis", + event: "changed" + }, + exec: "reconfigure-redis-connection.sh", + } + ], + health: [ + { + // because we haven't provided a port for consul-agent this health check + // won't result in it registering with service discovery, but we still + // get its health events inside this container + name: "consul-agent-check", + job: "consul-agent", + exec: "consul info | grep peers", + poll: 5, + timeout: "5s" + } + ] + watches: [ + { + name: "database", + }, + { + // we gracefully handle missing redis so this is a soft dependency. + // we update the config whenver the list of members changes + name: "redis", + } + ] +} +``` -This change eliminates the current `preStart` configuration. Instead a user will have a one-time service that all other services depend on (like `setup` above). With multi-application support we can create a chain of dependencies. +In the example above, if `setup` fails ContainerPilot will never start `app`, so it should mark the `app` job state as failed as well. If ContainerPilot reaches a state where no jobs can continue, it should exit. To avoid stalled containers, ContainerPilot will need a way to automatically detect unresolvable states. _Related GitHub issues:_ - [Startup dependency sequencing](https://github.com/joyent/containerpilot/issues/273)