Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD86 updates: use json5 and enhance dependency resolution #31

Merged
merged 7 commits into from
Apr 5, 2017

Conversation

tgross
Copy link
Contributor

@tgross tgross commented Apr 3, 2017

These changes follow discussion at last week's meetings around RFD36 and friends, as well as the previous week's work on multi-process support and some of the discoveries along the way of implementing that.

In this PR:

  • following a lot of negative reaction to YAML, I've changed the config language update to use JSON5
  • removed explicit "non-advertise" clause because we get that for free just by omitting the info required to register with Consul (i.e. the port)
  • removed the clunky "depends" arrays in lieu of a nicer upstart-like syntax which will be more flexible for us in future development
  • pulled watches out into their own config section because the external services we're watching and the events that services are waiting for are actually separate concerns -- we can respond to either internal services or external ones.

The big conceptual change is the realization that by merging all the various service/prestart/task/coprocess together, that we have non-advertised services for which we want to respond to events inside the same container. We might want to be able to health check the Consul agent without advertising it, for example.

One item I'd like to discuss again is whether "services" is the right word for the config now. It includes periodic tasks, non-advertised processes, and prestart/poststop/etc. The term is also overloaded with "services" as Consul sees them and "services" as an RFD36 scheduler might see them. (resolved in 4bea5a4)

cc @jasonpincin @misterbisson @geek

watches: [
{
name: "database",
exec: "reconfigure-db-connection.sh",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sidebar discussion with @jasonpincin and @geek, we've decided that it would make more sense to move the exec into its own job. This way a watch merely watches and fires events, and then jobs react to those events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, watches will produce events in the form watch.SomeWatchName, and will have an explicit field for their service (or, in the future, key) they want to watch.


```json5
{
services: [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a discussion with @jasonpincin and @geek, we decided that jobs would be a better terminology here as we're including non-advertised and one-off tasks.

@@ -117,54 +131,101 @@ ContainerPilot hasn't eliminated the complexity of dependency management -- that

That being said, a more expressive configuration of event handlers may more gracefully handle all the above situations and reduce the end-user confusion. Rather than surfacing just changes to dependency membership lists, we'll expose changes to the overall state as ContainerPilot sees it.

ContainerPilot will provide the following events:
ContainerPilot will provide events and each service can opt-in to having a `start` condition on one of these events. Because the life-cycle of each service triggers new events, the user can create a dependency chain among all the services in a container (and their external dependencies). This effectively replaces the `preStart`, `preStop`, and `postStop` behaviors.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sidebar discussion we've discussed that when is a more ergonomic than start for jobs that run more than once on on multiple triggered events. This gives us syntax like:

  • when: "startup"
  • when: "myPreStart exitSuccess timeout 60s"
  • when: "myDb healthy"
  • when: "myDb stopped"

health: [
{
name: "checkA",
service: "nginx"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to match the job name above? Should the field name be job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does have to match the job, so yes that field name should probably be job too.

// this is upstart-like syntax indicating we want to start this
// service when the "setup" service has exited with success but
// give up after 60 sec
when: "setup exitSuccess timeout 60s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where should I look to see the syntax for the events described? Specifically, I'm looking for more background on the choice of this syntax:

when: "setup exitSuccess timeout 60s"

Over something like:

when: {
    job: "setup",
    event: "exitSuccess",
    timeout: "60s",
}

I'm not sure I have an opinion, I was just trying to better understand the the reasoning to use different syntax for defining those components, rather than the JSON5 we're using elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no background other than existing systems like upstart which handle this nicely. Having "stringly-typed" events is an unfortunate problem in either case but one we're stuck with either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually on further reflection the proposed more clever syntax will just get screwed up by end users, which is problematic given that most of these changes are designed to fix poorly-considered config. So let's just make it as straightforward as humanly possible and just use the field names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although rather than "job" we should use "source" here as it's likely we'll have non-job sources of events (like watches).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout: "5s",
}
}
watches: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a comment here explaining that watches replace upstreams from v2, and they trigger events that can be hooked with jobs defined above.

There's a lot of depth in TritonDataCenter/containerpilot#227, but summarizing it here (which will probably be the basis of our docs) might help clarify the conclusions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It replaces backends but yeah, agreed. There's some commentary in the multiprocess.md file already but it couldn't hurt to repeat it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which will probably be the basis of our docs

This doc is largely about why we're changing things, which I'm not sure belongs in the shipped docs.

- A health check will poll every `poll` seconds, with a timeout of `timeout`. If any health check fails (returns a non-zero exit code or times out), the associated job is marked `unhealthy` and a `Fail` message is sent to the discovery job.
- Once any health check fails, _all_ health checks need to pass before the job will be marked healthy again. This is required to avoid service flapping.

**Important note:** end users should not provide a health check with a long polling time to perform some supporting task like backends. This will cause "slow startup" as their job will not be marked healthy until the first polling window expires. Instead they should create another job for this task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my understanding: health checks should be short, and end users should never mix concerns between health checks and cron events. This is a good principle, but it's a also a practical limitation because a job will not be marked as healthy if the job runs long doing other work unrelated to the healthiness of the app. Yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!

jobs: [
{
name: "app",
// this is upstart-like syntax indicating we want to start this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still want to call this "upstart-like syntax"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed that! Fixed!

// service when the "setup" service has exited with success but
// give up after 60 sec
when: {
source: "setup",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a source can match a job or a watch name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

// give up after 60 sec
when: {
source: "setup",
event: "exitSuccess",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the whole thing yet, but I'll ask: is the dictionary of events enumerated here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in multiprocess.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants