Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design document #1

Merged
merged 2 commits into from
Dec 14, 2023
Merged

Design document #1

merged 2 commits into from
Dec 14, 2023

Conversation

kralicky
Copy link
Owner

I based the format on the RFD template, omitted some sections that didn't seem relevant though. I think everything should be covered here, but let me know if you would like additional clarification on any topics.

Also, let me know if any of the things I have assumed to be out of scope are not in fact out of scope. e.g. I can implement cpuset limits if you like, or add some of the listed possible optimizations.

Copy link

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but overall looks like a solid start!

}

message JobId {
string id = 1;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the job id going to be generated?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go with UUIDs probably.

Comment on lines +137 to +138
// The job exited normally (with any exit code), or was terminated via signal.
Completed = 3;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would it make sense to let clients differentiate between jobs that completed themselves vs were stopped by a user?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it would be reasonable to include a field in the job status that would indicate if it was stopped by the user.

docs/rfd.md Outdated
Comment on lines 171 to 185
// Optional additional environment variables to set for the command.
// These will be merged with the job server's environment variables.
repeated string env = 3;
// The working directory for the command. If not specified, the job server's
// working directory will be used.
optional string cwd = 4;
// Optional user id to run the command as.
// If not specified, the job server's user id will be used.
optional int32 uid = 5;
// Optional group id to run the command as.
// If not specified, the job server's group id will be used.
optional int32 gid = 6;
// Optional data to be written to the process's stdin pipe when it starts.
// If not specified, the process's stdin will be set to /dev/null.
optional bytes stdin = 7;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let omit these from the scope. Command and arguments are enough for this challenge 👍

Authorization uses a very simple RBAC system. The RBAC configuration is defined as follows:

```protobuf
syntax = "proto3";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be protobuf objects?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't, but (hypothetically, if this were a real project) defining these types as protos would make it much easier to create an admin API to allow users to update these rules on the fly, e.g.

service RBAC {
  rpc GetConfig(google.protobuf.Empty) returns (rbac.v1.Config);
  rpc UpdateConfig(rbac.v1.Config) returns (google.protobuf.Empty);
  rpc ListRoles(google.protobuf.Empty) returns (RoleList);
  // etc...
}


```

The RBAC configuration will be simply loaded from a config file on server startup. An example config file might look like:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's ok to just hardcode this structure in the code to reduce the scope a little bit.

docs/rfd.md Outdated
Comment on lines 295 to 308
roles:
- id: adminRole
service: job.v1.Job
allowedMethods:
- Start
- Stop
- Status
- List
- Output
roleBindings:
- id: adminRoleBinding
roleId: adminRole
subjects:
- admin
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the server determine the client's permissions? You mention X.509 certificate common name below in authz section - is that what "subjects" here is referring to?

docs/rfd.md Outdated

# starting a job:
# for commands that don't require flag args, they can be passed as-is:
$ jobserver run kubectl logs pod/example
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's a little strange that both server and client are the same binary jobserver. Think it would make sense to have a separate tool like jobctl for the client?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that. 👍 Building everything into one binary is usually my personal preference (although it is a bit unorthodox), but it probably doesn't make much of a difference in this case.


#### Process Lifecycle

For jobs that do not specify resource limits, they will be run as a child process of the job server itself without any additional isolation. For jobs that do specify resource limits, they will be run in their own cgroup with all of the requested limits.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify and just run any job in a cgroup with some default limits. You can even simplify the API and remove ability for clients to provide custom limits via API.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works, I can have all jobs be run in cgroups. The defaults could just be the (unlimited) inherited defaults from the parent.

docs/rfd.md Outdated

#### Authentication and Authorization

Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can you specify what exactly will be encoded in the subject? I think it's pretty clear that it will be the "subject" from the authz map but I didn't see it explicitly mentioned anywhere so wanted to make sure my understanding is correct.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, the client certificate's Common Name is used as the subject name for authz purposes. So the user "admin" will authenticate with a client cert that looks like:
image

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if that answers #1 (comment) as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use a different word since the "subject" in X.509 is the whole RDNSequence of fields, not just the string in the CN. Maybe "user" or "username"?

docs/rfd.md Outdated
Comment on lines 413 to 416
$ jobserver list
JOB ID COMMAND CREATED STATUS
<id-1> kubectl 2006-01-02... Running
<id-2> go 2006-01-02... Completed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Right now the list API returns only JobIdList (rpc List(google.protobuf.Empty) returns (JobIdList))
where additional information needs to be fetch sperattly from Status endpoint (rpc Status(JobId) returns (JobStatus))

Does it make sense to return basic job information about a job from the List endpoint ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could work, but it would mean the List and Status methods have overlapping functionality. Also, if there are a very large number of jobs, calling List() could potentially become a very expensive operation for the server, and would make rate limiting less effective if that were to be added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. lets leave the current version.

docs/rfd.md Outdated
Comment on lines 295 to 309
roles:
- id: adminRole
service: job.v1.Job
allowedMethods:
- Start
- Stop
- Status
- List
- Output
roleBindings:
- id: adminRoleBinding
roleId: adminRole
subjects:
- admin
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RBAC model design appears to be oriented around API methods. Is it possible to grant a user the permissions to read, stop, and view jobs that they have started without granting permissions for other users jobs ?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a great idea, do you think I should add that capability to the existing design, or replace the existing design altogether (if that would still count towards the requirement)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about extending the current design to include the ability to restrict 'allowedMethod' either to the job owner or to all jobs. The current design should be easy to extend by this capability.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍

Copy link
Collaborator

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must say that this docs appears both brief and solid.

Made a one comment about RBAC model that want to clarify: RBAC

otherwise, LGTM once the comments are addressed.

Copy link

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @r0mant, solid design overall.

int32 pid = 3;
// The exit code of the job's process.
// Only present if the job is in the Completed state.
optional int32 exitCode = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional scalar fields in proto3 are a bit awkward, i'd just document that these fields are only meaningful under certain conditions instead.

docs/rfd.md Outdated
Comment on lines 434 to 435
1. Ensure its euid is 0 so it can manage cgroups, and ensure that cgroups v2 is enabled.
2. Create a parent cgroup in `/user.slice/user-<uid>.slice/jobserver` if it doesn't exist. The specific path can be determined by first identifying its own cgroup from `/proc/self/cgroup` (which, if run as `sudo jobserver serve`, will be something like `/user.slice/user-<uid>.slice/<terminal emulator's cgroup>/vte-spawn-<some uuid>.scope`), then walking the tree up until it finds a cgroup matching `user-*.slice`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we seem to be doing systemd things I can't help but mention that systemd-run --user --scope will execute a process in a dedicated systemd scope (unmanaged cgroup), and you can configure systemd to delegate cgroup management to your specific user, so you don't actually need to be uid 0 for this.

I'd just work inside the cgroup that the process is spawned in rather than get locked into systemd-isms -- or just use the hierarchy rooted at /kralicky-jobserver-<nonce>/ instead, if you require superuser privileges anyway.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I thought about using the server's own cgroup as the parent but it seems like (at least on my system, not sure if this is always the case?) the only controllers enabled in it are memory and pids, all the way up to /user@<uid>.service which adds cpuset, cpu, and io.

I think the top level cgroup idea is good, it would be a simple solution and works fine for the purposes of this challenge 👍

docs/rfd.md Outdated
2. The server will write the requested resource limits to the new cgroup's `cpu.max`, `memory.max`, and `io.max` files (for whichever limits were specified).
3. The server will create a file descriptor to the new cgroup by opening its path (`/sys/fs/cgroup/user.slice/user-<uid>.slice/jobserver/<job-id>`).
4. The server will configure the `exec.Cmd`'s `SysProcAttr` with `CgroupFd` set to the file descriptor. When this is set, Go will call `clone3` with `CLONE_INTO_CGROUP` so that the new child process is started directly in the new cgroup.
5. When the process exits, the server will close the file descriptor and delete the job's cgroup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the "job" is tied to the continued existence of the first child, right? Could you elaborate on what is the plan to deal with other processes in the cgroup and/or processes that have inherited the same stdio pipes (and thus are holding them open)?

(subprocesses that deliberately escape the cgroup are out of scope for the challenge)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point - when the first child exits, we'll need to clean up any other processes in the cgroup since the cgroups won't have their own pid namespaces. So step 5 could instead be:

  1. When the process exits, the server will, in order:
    - close the file descriptor
    - kill any remaining processes in the job's cgroup by writing "1" to its cgroup.kill file
    - remove the job's cgroup directory

Would that work?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that looks correct! 👍

int32 pid = 3;
// The exit code of the job's process.
// Only present if the job is in the Completed state.
optional int32 exitCode = 4;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the raw exit code is technically platform-dependent, so it might be worth splitting into actual exit code and signal exit instead

Comment on lines +95 to +99
// Returns a list of all job IDs that are currently known to the server.
//
// All jobs, regardless of state, are included. No guarantees are made about
// the order of ids in the list.
rpc List(google.protobuf.Empty) returns (JobIdList);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not in the spec of the challenge - it's easy to implement from an API perspective but getting a table printed out well in a terminal is fraught with peril. I'd drop this 😅

Copy link
Owner Author

@kralicky kralicky Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I went back and forth about whether or not to include this, it was just slightly ambiguous enough in the spec where I thought it could have been deliberately left out as an 'exercise to the reader' so to speak 😆

For printing tables, I am a fan of "github.com/jedib0t/go-pretty/v6/table", the implementation would look like:

list, err := client.List(ctx, &emptypb.Empty{})
if err != nil {
	return err
}
tab := table.NewWriter()
tab.AppendHeader(table.Row{"JOB ID", "COMMAND", "CREATED", "STATUS"})
for _, id := range list.Items {
	stat, err := client.Status(ctx, id)
    // handle err
	tab.AppendRow(table.Row{
		id.GetId(),
		stat.GetSpec().GetCommand().GetCommand(),
		stat.GetStartTime().AsTime(),
		stat.GetMessage(),
	})
}
fmt.Println(tab.Render())

I can omit List if you want, but I do think it's rather important to the overall UX.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's your call, it's not required for the challenge but it will be reviewed if it's in there. 😇

docs/rfd.md Outdated

#### Authentication and Authorization

Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use a different word since the "subject" in X.509 is the whole RDNSequence of fields, not just the string in the CN. Maybe "user" or "username"?

docs/rfd.md Outdated
Comment on lines 460 to 461
To authorize a user for a specific api method, the server will match the client's subject name against the rbac rules it loaded from the config file,
then check to see if that subject is named in a role binding for which the associated role allows access to the method the client is calling.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just food for thought (as https://github.com/kralicky/jobserver/pull/1/files#r1425022556 will likely lead to a simplification of this model anyway): is there some other source of trusted information already in play that could be used here to decide which roles any given client is supposed to have, rather than having a dedicated configuration mechanism tied to the server?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean encoding role information into certificates?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! 👍 You're delegating authn for the identity to the issuer of the client certs, why not also roles?


message IODeviceLimits {
// Limits for individual devices
repeated DeviceLimit devices = 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to have a single I/O limit on a hardcoded/configurable device ID here (it would be strange for the client would have that information anyway).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - what if the client provides a path like /dev/xyz and the server looks up the ID itself?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just go for a static device ID (for this challenge, at least).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that work, since the IDs are hardware specific?

docs/rfd.md Outdated
Comment on lines 106 to 109
// If 'follow' is set to true, the output will continue to be be streamed in
// real-time until the job completes. Otherwise, only the currently accumulated
// output will be written, and the stream will be closed even if the job is
// still running.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the challenge text I think that we should clarify that it's only necessary to implement the "follow" mode of getting the output - it will likely not change much in the implementation anyway, so it'd just be added noise.

docs/rfd.md Outdated
Output streaming will be implemented as follows:

1. When a job is started, we will pipe the combined stdout and stderr of its process to a simple in-memory buffer.
2. When a client initiates a stream by calling `Output()`, we will write the current contents of the buffer to the stream in 4MB chunks. Then, if the client requested to follow the output, the server will keep the stream open, and any subsequent reads from the process's output will be duplicated to all clients that are following the stream, in addition to the server's own buffer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4MiB is the default message size limit but it's far larger than the recommended size for messages in a stream; "oral tradition" puts that at 16-64KiB (grpc/grpc.github.io#371) apparently.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message size limits are tricky. The way I understand it is that regardless of grpc message size, the underlying http/2 transport will chunk it into frames (16KiB is the default frame size and as far as I know grpc doesn't change it), but grpc itself adjusts the transport's flow control windows (unless you change the default window size manually)

From my own testing, using larger messages results in significantly higher throughput, ostensibly because reading and writing larger messages triggers grpc to increase the size of the flow control windows, which results in fewer window update frames required per rpc.

This is all undocumented of course, and I've never been able to find a comprehensive guide on this topic, but that's what I have gathered by looking through the grpc and net/http2 sources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty neat, although 4MiB still feels uncomfortably large for a buffer that will be handled as an atomic entity for network purposes - and you're also going to be sacrificing latency for other requests in the same h2 connection. It would be interesting (but definitely out of scope for this) to figure out the size at which the returns in throughput start diminishing: I can believe that the fragmentation at 16KiB makes the overhead noticeable, but is that still the case at 64? at 256?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with you there. This topic interests me a lot - so just for fun, some graphs! Each run is 60 seconds of a stream (with 1 client) sending messages as fast as it can, and the server echoing the response back. Would need way more data to get any real conclusions of course, but looks like maybe the best size could lie somewhere between 512KiB and 1 MiB. 1MiB is about where things drop off, the difference between 1 and 4 MiB is pretty small too.

Figure_1
Figure_2

Payload Size (KiB) Mbps msgs/s
16 580.7 278731
24 672.7 215249
32 844 202569
48 901.1 144174
64 996.9 119625
96 1266.2 101294
128 1300.5 78030
192 1431.7 57267
256 1665.3 49959
384 1782.5 35650
512 1968.7 29530
768 2373 23730
1024 2364.1 17731
2048 2582.4 9684
4096 2686.9 5038
5120 2723.3 4085
6144 2771.2 3464
7168 2685.2 2877
8192 2621.9 2458
16384 2596.3 1217
24576 2556.8 799
32768 2781.9 652
49152 2796.8 437
65536 3029.3 355

Comment on lines +74 to +75
// Stops a running job. Once stopped, this method will wait until the job has
// completed before returning.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it makes things easier feel free to have Stop mean "begin stopping the job".

Comment on lines +77 to +79
// This will first attempt to stop the job's process using SIGTERM, but if the
// process does not exit within a short grace period, it will be forcefully
// killed with SIGKILL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this in if you feel like it'd be an interesting addition code-wise, otherwise just kill the process immediately.

Copy link
Owner Author

@kralicky kralicky Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this in here to account for being able to run jobs that run forever until you send SIGINT or SIGTERM (tail -f, journalctl -f, etc) without needing to always kill them. But it might end up being more trouble than it's worth, so I will keep that in mind.

Copy link
Collaborator

@smallinsky smallinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, just reminder to update the RBAC model: https://github.com/kralicky/jobserver/pull/1/files#r1425022556 but I don't thin this this is a blocker.

@kralicky kralicky merged commit fa1d91a into main Dec 14, 2023
@kralicky kralicky deleted the design-doc branch December 14, 2023 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants