-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD86 updates: use json5 and enhance dependency resolution #31
Conversation
rfd/0086/multiprocess.md
Outdated
watches: [ | ||
{ | ||
name: "database", | ||
exec: "reconfigure-db-connection.sh", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sidebar discussion with @jasonpincin and @geek, we've decided that it would make more sense to move the exec
into its own job. This way a watch merely watches and fires events, and then jobs react to those events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, watches will produce events in the form watch.SomeWatchName
, and will have an explicit field for their service (or, in the future, key) they want to watch.
rfd/0086/multiprocess.md
Outdated
|
||
```json5 | ||
{ | ||
services: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a discussion with @jasonpincin and @geek, we decided that jobs
would be a better terminology here as we're including non-advertised and one-off tasks.
rfd/0086/multiprocess.md
Outdated
@@ -117,54 +131,101 @@ ContainerPilot hasn't eliminated the complexity of dependency management -- that | |||
|
|||
That being said, a more expressive configuration of event handlers may more gracefully handle all the above situations and reduce the end-user confusion. Rather than surfacing just changes to dependency membership lists, we'll expose changes to the overall state as ContainerPilot sees it. | |||
|
|||
ContainerPilot will provide the following events: | |||
ContainerPilot will provide events and each service can opt-in to having a `start` condition on one of these events. Because the life-cycle of each service triggers new events, the user can create a dependency chain among all the services in a container (and their external dependencies). This effectively replaces the `preStart`, `preStop`, and `postStop` behaviors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sidebar discussion we've discussed that when
is a more ergonomic than start
for jobs that run more than once on on multiple triggered events. This gives us syntax like:
when: "startup"
when: "myPreStart exitSuccess timeout 60s"
when: "myDb healthy"
when: "myDb stopped"
rfd/0086/config.md
Outdated
health: [ | ||
{ | ||
name: "checkA", | ||
service: "nginx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to match the job name above? Should the field name be job
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does have to match the job, so yes that field name should probably be job
too.
rfd/0086/config.md
Outdated
// this is upstart-like syntax indicating we want to start this | ||
// service when the "setup" service has exited with success but | ||
// give up after 60 sec | ||
when: "setup exitSuccess timeout 60s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where should I look to see the syntax for the events described? Specifically, I'm looking for more background on the choice of this syntax:
when: "setup exitSuccess timeout 60s"
Over something like:
when: {
job: "setup",
event: "exitSuccess",
timeout: "60s",
}
I'm not sure I have an opinion, I was just trying to better understand the the reasoning to use different syntax for defining those components, rather than the JSON5 we're using elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no background other than existing systems like upstart which handle this nicely. Having "stringly-typed" events is an unfortunate problem in either case but one we're stuck with either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually on further reflection the proposed more clever syntax will just get screwed up by end users, which is problematic given that most of these changes are designed to fix poorly-considered config. So let's just make it as straightforward as humanly possible and just use the field names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although rather than "job" we should use "source" here as it's likely we'll have non-job sources of events (like watches).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further question moved to https://github.com/joyent/rfd/pull/31/files#r109735602
rfd/0086/config.md
Outdated
timeout: "5s", | ||
} | ||
} | ||
watches: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a comment here explaining that watches
replace upstreams
from v2, and they trigger events that can be hooked with jobs
defined above.
There's a lot of depth in TritonDataCenter/containerpilot#227, but summarizing it here (which will probably be the basis of our docs) might help clarify the conclusions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It replaces backends
but yeah, agreed. There's some commentary in the multiprocess.md file already but it couldn't hurt to repeat it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which will probably be the basis of our docs
This doc is largely about why we're changing things, which I'm not sure belongs in the shipped docs.
- A health check will poll every `poll` seconds, with a timeout of `timeout`. If any health check fails (returns a non-zero exit code or times out), the associated job is marked `unhealthy` and a `Fail` message is sent to the discovery job. | ||
- Once any health check fails, _all_ health checks need to pass before the job will be marked healthy again. This is required to avoid service flapping. | ||
|
||
**Important note:** end users should not provide a health check with a long polling time to perform some supporting task like backends. This will cause "slow startup" as their job will not be marked healthy until the first polling window expires. Instead they should create another job for this task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my understanding: health checks should be short, and end users should never mix concerns between health checks and cron events. This is a good principle, but it's a also a practical limitation because a job will not be marked as healthy if the job
runs long doing other work unrelated to the healthiness of the app. Yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct!
rfd/0086/config.md
Outdated
jobs: [ | ||
{ | ||
name: "app", | ||
// this is upstart-like syntax indicating we want to start this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still want to call this "upstart-like syntax"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed that! Fixed!
rfd/0086/config.md
Outdated
// service when the "setup" service has exited with success but | ||
// give up after 60 sec | ||
when: { | ||
source: "setup", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, a source
can match a job or a watch name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
rfd/0086/config.md
Outdated
// give up after 60 sec | ||
when: { | ||
source: "setup", | ||
event: "exitSuccess", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read the whole thing yet, but I'll ask: is the dictionary of events enumerated here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in multiprocess.md
These changes follow discussion at last week's meetings around RFD36 and friends, as well as the previous week's work on multi-process support and some of the discoveries along the way of implementing that.
In this PR:
port
)watches
out into their own config section because the external services we're watching and the events that services are waiting for are actually separate concerns -- we can respond to either internal services or external ones.The big conceptual change is the realization that by merging all the various service/prestart/task/coprocess together, that we have non-advertised services for which we want to respond to events inside the same container. We might want to be able to health check the Consul agent without advertising it, for example.
One item I'd like to discuss again is whether "services" is the right word for the config now. It includes periodic tasks, non-advertised processes, and prestart/poststop/etc. The term is also overloaded with "services" as Consul sees them and "services" as an RFD36 scheduler might see them.(resolved in 4bea5a4)cc @jasonpincin @misterbisson @geek