-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design document #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments but overall looks like a solid start!
} | ||
|
||
message JobId { | ||
string id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the job id going to be generated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go with UUIDs probably.
// The job exited normally (with any exit code), or was terminated via signal. | ||
Completed = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would it make sense to let clients differentiate between jobs that completed themselves vs were stopped by a user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it would be reasonable to include a field in the job status that would indicate if it was stopped by the user.
docs/rfd.md
Outdated
// Optional additional environment variables to set for the command. | ||
// These will be merged with the job server's environment variables. | ||
repeated string env = 3; | ||
// The working directory for the command. If not specified, the job server's | ||
// working directory will be used. | ||
optional string cwd = 4; | ||
// Optional user id to run the command as. | ||
// If not specified, the job server's user id will be used. | ||
optional int32 uid = 5; | ||
// Optional group id to run the command as. | ||
// If not specified, the job server's group id will be used. | ||
optional int32 gid = 6; | ||
// Optional data to be written to the process's stdin pipe when it starts. | ||
// If not specified, the process's stdin will be set to /dev/null. | ||
optional bytes stdin = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let omit these from the scope. Command and arguments are enough for this challenge 👍
Authorization uses a very simple RBAC system. The RBAC configuration is defined as follows: | ||
|
||
```protobuf | ||
syntax = "proto3"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these need to be protobuf objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't, but (hypothetically, if this were a real project) defining these types as protos would make it much easier to create an admin API to allow users to update these rules on the fly, e.g.
service RBAC {
rpc GetConfig(google.protobuf.Empty) returns (rbac.v1.Config);
rpc UpdateConfig(rbac.v1.Config) returns (google.protobuf.Empty);
rpc ListRoles(google.protobuf.Empty) returns (RoleList);
// etc...
}
|
||
``` | ||
|
||
The RBAC configuration will be simply loaded from a config file on server startup. An example config file might look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's ok to just hardcode this structure in the code to reduce the scope a little bit.
docs/rfd.md
Outdated
roles: | ||
- id: adminRole | ||
service: job.v1.Job | ||
allowedMethods: | ||
- Start | ||
- Stop | ||
- Status | ||
- List | ||
- Output | ||
roleBindings: | ||
- id: adminRoleBinding | ||
roleId: adminRole | ||
subjects: | ||
- admin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the server determine the client's permissions? You mention X.509 certificate common name below in authz section - is that what "subjects" here is referring to?
docs/rfd.md
Outdated
|
||
# starting a job: | ||
# for commands that don't require flag args, they can be passed as-is: | ||
$ jobserver run kubectl logs pod/example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's a little strange that both server and client are the same binary jobserver
. Think it would make sense to have a separate tool like jobctl
for the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that. 👍 Building everything into one binary is usually my personal preference (although it is a bit unorthodox), but it probably doesn't make much of a difference in this case.
|
||
#### Process Lifecycle | ||
|
||
For jobs that do not specify resource limits, they will be run as a child process of the job server itself without any additional isolation. For jobs that do specify resource limits, they will be run in their own cgroup with all of the requested limits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's simplify and just run any job in a cgroup with some default limits. You can even simplify the API and remove ability for clients to provide custom limits via API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works, I can have all jobs be run in cgroups. The defaults could just be the (unlimited) inherited defaults from the parent.
docs/rfd.md
Outdated
|
||
#### Authentication and Authorization | ||
|
||
Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you specify what exactly will be encoded in the subject? I think it's pretty clear that it will be the "subject" from the authz map but I didn't see it explicitly mentioned anywhere so wanted to make sure my understanding is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if that answers #1 (comment) as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use a different word since the "subject" in X.509 is the whole RDNSequence of fields, not just the string in the CN. Maybe "user" or "username"?
docs/rfd.md
Outdated
$ jobserver list | ||
JOB ID COMMAND CREATED STATUS | ||
<id-1> kubectl 2006-01-02... Running | ||
<id-2> go 2006-01-02... Completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Right now the list
API returns only JobIdList
(rpc List(google.protobuf.Empty) returns (JobIdList)
)
where additional information needs to be fetch sperattly from Status endpoint (rpc Status(JobId) returns (JobStatus)
)
Does it make sense to return basic job information about a job from the List
endpoint ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could work, but it would mean the List and Status methods have overlapping functionality. Also, if there are a very large number of jobs, calling List() could potentially become a very expensive operation for the server, and would make rate limiting less effective if that were to be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right. lets leave the current version.
docs/rfd.md
Outdated
roles: | ||
- id: adminRole | ||
service: job.v1.Job | ||
allowedMethods: | ||
- Start | ||
- Stop | ||
- Status | ||
- List | ||
- Output | ||
roleBindings: | ||
- id: adminRoleBinding | ||
roleId: adminRole | ||
subjects: | ||
- admin | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RBAC model design appears to be oriented around API methods. Is it possible to grant a user the permissions to read, stop, and view jobs that they have started without granting permissions for other users jobs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a great idea, do you think I should add that capability to the existing design, or replace the existing design altogether (if that would still count towards the requirement)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about extending the current design to include the ability to restrict 'allowedMethod' either to the job owner or to all jobs. The current design should be easy to extend by this capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must say that this docs appears both brief and solid.
Made a one comment about RBAC model that want to clarify: RBAC
otherwise, LGTM once the comments are addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @r0mant, solid design overall.
int32 pid = 3; | ||
// The exit code of the job's process. | ||
// Only present if the job is in the Completed state. | ||
optional int32 exitCode = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional
scalar fields in proto3 are a bit awkward, i'd just document that these fields are only meaningful under certain conditions instead.
docs/rfd.md
Outdated
1. Ensure its euid is 0 so it can manage cgroups, and ensure that cgroups v2 is enabled. | ||
2. Create a parent cgroup in `/user.slice/user-<uid>.slice/jobserver` if it doesn't exist. The specific path can be determined by first identifying its own cgroup from `/proc/self/cgroup` (which, if run as `sudo jobserver serve`, will be something like `/user.slice/user-<uid>.slice/<terminal emulator's cgroup>/vte-spawn-<some uuid>.scope`), then walking the tree up until it finds a cgroup matching `user-*.slice`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we seem to be doing systemd things I can't help but mention that systemd-run --user --scope
will execute a process in a dedicated systemd scope (unmanaged cgroup), and you can configure systemd to delegate cgroup management to your specific user, so you don't actually need to be uid 0 for this.
I'd just work inside the cgroup that the process is spawned in rather than get locked into systemd-isms -- or just use the hierarchy rooted at /kralicky-jobserver-<nonce>/
instead, if you require superuser privileges anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. I thought about using the server's own cgroup as the parent but it seems like (at least on my system, not sure if this is always the case?) the only controllers enabled in it are memory and pids, all the way up to /user@<uid>.service
which adds cpuset, cpu, and io.
I think the top level cgroup idea is good, it would be a simple solution and works fine for the purposes of this challenge 👍
docs/rfd.md
Outdated
2. The server will write the requested resource limits to the new cgroup's `cpu.max`, `memory.max`, and `io.max` files (for whichever limits were specified). | ||
3. The server will create a file descriptor to the new cgroup by opening its path (`/sys/fs/cgroup/user.slice/user-<uid>.slice/jobserver/<job-id>`). | ||
4. The server will configure the `exec.Cmd`'s `SysProcAttr` with `CgroupFd` set to the file descriptor. When this is set, Go will call `clone3` with `CLONE_INTO_CGROUP` so that the new child process is started directly in the new cgroup. | ||
5. When the process exits, the server will close the file descriptor and delete the job's cgroup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the "job" is tied to the continued existence of the first child, right? Could you elaborate on what is the plan to deal with other processes in the cgroup and/or processes that have inherited the same stdio pipes (and thus are holding them open)?
(subprocesses that deliberately escape the cgroup are out of scope for the challenge)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point - when the first child exits, we'll need to clean up any other processes in the cgroup since the cgroups won't have their own pid namespaces. So step 5 could instead be:
- When the process exits, the server will, in order:
- close the file descriptor
- kill any remaining processes in the job's cgroup by writing "1" to itscgroup.kill
file
- remove the job's cgroup directory
Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that looks correct! 👍
int32 pid = 3; | ||
// The exit code of the job's process. | ||
// Only present if the job is in the Completed state. | ||
optional int32 exitCode = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the raw exit code is technically platform-dependent, so it might be worth splitting into actual exit code and signal exit instead
// Returns a list of all job IDs that are currently known to the server. | ||
// | ||
// All jobs, regardless of state, are included. No guarantees are made about | ||
// the order of ids in the list. | ||
rpc List(google.protobuf.Empty) returns (JobIdList); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in the spec of the challenge - it's easy to implement from an API perspective but getting a table printed out well in a terminal is fraught with peril. I'd drop this 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, I went back and forth about whether or not to include this, it was just slightly ambiguous enough in the spec where I thought it could have been deliberately left out as an 'exercise to the reader' so to speak 😆
For printing tables, I am a fan of "github.com/jedib0t/go-pretty/v6/table"
, the implementation would look like:
list, err := client.List(ctx, &emptypb.Empty{})
if err != nil {
return err
}
tab := table.NewWriter()
tab.AppendHeader(table.Row{"JOB ID", "COMMAND", "CREATED", "STATUS"})
for _, id := range list.Items {
stat, err := client.Status(ctx, id)
// handle err
tab.AppendRow(table.Row{
id.GetId(),
stat.GetSpec().GetCommand().GetCommand(),
stat.GetStartTime().AsTime(),
stat.GetMessage(),
})
}
fmt.Println(tab.Render())
I can omit List if you want, but I do think it's rather important to the overall UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's your call, it's not required for the challenge but it will be reviewed if it's in there. 😇
docs/rfd.md
Outdated
|
||
#### Authentication and Authorization | ||
|
||
Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use a different word since the "subject" in X.509 is the whole RDNSequence of fields, not just the string in the CN. Maybe "user" or "username"?
docs/rfd.md
Outdated
To authorize a user for a specific api method, the server will match the client's subject name against the rbac rules it loaded from the config file, | ||
then check to see if that subject is named in a role binding for which the associated role allows access to the method the client is calling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just food for thought (as https://github.com/kralicky/jobserver/pull/1/files#r1425022556 will likely lead to a simplification of this model anyway): is there some other source of trusted information already in play that could be used here to decide which roles any given client is supposed to have, rather than having a dedicated configuration mechanism tied to the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean encoding role information into certificates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! 👍 You're delegating authn for the identity to the issuer of the client certs, why not also roles?
|
||
message IODeviceLimits { | ||
// Limits for individual devices | ||
repeated DeviceLimit devices = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to have a single I/O limit on a hardcoded/configurable device ID here (it would be strange for the client would have that information anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - what if the client provides a path like /dev/xyz and the server looks up the ID itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just go for a static device ID (for this challenge, at least).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that work, since the IDs are hardware specific?
docs/rfd.md
Outdated
// If 'follow' is set to true, the output will continue to be be streamed in | ||
// real-time until the job completes. Otherwise, only the currently accumulated | ||
// output will be written, and the stream will be closed even if the job is | ||
// still running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the challenge text I think that we should clarify that it's only necessary to implement the "follow" mode of getting the output - it will likely not change much in the implementation anyway, so it'd just be added noise.
docs/rfd.md
Outdated
Output streaming will be implemented as follows: | ||
|
||
1. When a job is started, we will pipe the combined stdout and stderr of its process to a simple in-memory buffer. | ||
2. When a client initiates a stream by calling `Output()`, we will write the current contents of the buffer to the stream in 4MB chunks. Then, if the client requested to follow the output, the server will keep the stream open, and any subsequent reads from the process's output will be duplicated to all clients that are following the stream, in addition to the server's own buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4MiB is the default message size limit but it's far larger than the recommended size for messages in a stream; "oral tradition" puts that at 16-64KiB (grpc/grpc.github.io#371) apparently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message size limits are tricky. The way I understand it is that regardless of grpc message size, the underlying http/2 transport will chunk it into frames (16KiB is the default frame size and as far as I know grpc doesn't change it), but grpc itself adjusts the transport's flow control windows (unless you change the default window size manually)
From my own testing, using larger messages results in significantly higher throughput, ostensibly because reading and writing larger messages triggers grpc to increase the size of the flow control windows, which results in fewer window update frames required per rpc.
This is all undocumented of course, and I've never been able to find a comprehensive guide on this topic, but that's what I have gathered by looking through the grpc and net/http2 sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty neat, although 4MiB still feels uncomfortably large for a buffer that will be handled as an atomic entity for network purposes - and you're also going to be sacrificing latency for other requests in the same h2 connection. It would be interesting (but definitely out of scope for this) to figure out the size at which the returns in throughput start diminishing: I can believe that the fragmentation at 16KiB makes the overhead noticeable, but is that still the case at 64? at 256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with you there. This topic interests me a lot - so just for fun, some graphs! Each run is 60 seconds of a stream (with 1 client) sending messages as fast as it can, and the server echoing the response back. Would need way more data to get any real conclusions of course, but looks like maybe the best size could lie somewhere between 512KiB and 1 MiB. 1MiB is about where things drop off, the difference between 1 and 4 MiB is pretty small too.
Payload Size (KiB) | Mbps | msgs/s |
---|---|---|
16 | 580.7 | 278731 |
24 | 672.7 | 215249 |
32 | 844 | 202569 |
48 | 901.1 | 144174 |
64 | 996.9 | 119625 |
96 | 1266.2 | 101294 |
128 | 1300.5 | 78030 |
192 | 1431.7 | 57267 |
256 | 1665.3 | 49959 |
384 | 1782.5 | 35650 |
512 | 1968.7 | 29530 |
768 | 2373 | 23730 |
1024 | 2364.1 | 17731 |
2048 | 2582.4 | 9684 |
4096 | 2686.9 | 5038 |
5120 | 2723.3 | 4085 |
6144 | 2771.2 | 3464 |
7168 | 2685.2 | 2877 |
8192 | 2621.9 | 2458 |
16384 | 2596.3 | 1217 |
24576 | 2556.8 | 799 |
32768 | 2781.9 | 652 |
49152 | 2796.8 | 437 |
65536 | 3029.3 | 355 |
// Stops a running job. Once stopped, this method will wait until the job has | ||
// completed before returning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it makes things easier feel free to have Stop
mean "begin stopping the job".
// This will first attempt to stop the job's process using SIGTERM, but if the | ||
// process does not exit within a short grace period, it will be forcefully | ||
// killed with SIGKILL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave this in if you feel like it'd be an interesting addition code-wise, otherwise just kill the process immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this in here to account for being able to run jobs that run forever until you send SIGINT or SIGTERM (tail -f, journalctl -f, etc) without needing to always kill them. But it might end up being more trouble than it's worth, so I will keep that in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT, just reminder to update the RBAC model: https://github.com/kralicky/jobserver/pull/1/files#r1425022556 but I don't thin this this is a blocker.
I based the format on the RFD template, omitted some sections that didn't seem relevant though. I think everything should be covered here, but let me know if you would like additional clarification on any topics.
Also, let me know if any of the things I have assumed to be out of scope are not in fact out of scope. e.g. I can implement cpuset limits if you like, or add some of the listed possible optimizations.