Design document #1

kralicky · 2023-12-11T22:34:45Z

I based the format on the RFD template, omitted some sections that didn't seem relevant though. I think everything should be covered here, but let me know if you would like additional clarification on any topics.

Also, let me know if any of the things I have assumed to be out of scope are not in fact out of scope. e.g. I can implement cpuset limits if you like, or add some of the listed possible optimizations.

r0mant

Left a few comments but overall looks like a solid start!

r0mant · 2023-12-12T23:07:59Z

docs/rfd.md

+}
+
+message JobId {
+  string id = 1;


How is the job id going to be generated?

I'll go with UUIDs probably.

r0mant · 2023-12-12T23:09:01Z

docs/rfd.md

+  // The job exited normally (with any exit code), or was terminated via signal.
+  Completed = 3;


Nit: Would it make sense to let clients differentiate between jobs that completed themselves vs were stopped by a user?

Sure, I think it would be reasonable to include a field in the job status that would indicate if it was stopped by the user.

r0mant · 2023-12-12T23:10:16Z

docs/rfd.md

+  // Optional additional environment variables to set for the command.
+  // These will be merged with the job server's environment variables.
+  repeated string env = 3;
+  // The working directory for the command. If not specified, the job server's
+  // working directory will be used.
+  optional string cwd = 4;
+  // Optional user id to run the command as.
+  // If not specified, the job server's user id will be used.
+  optional int32 uid = 5;
+  // Optional group id to run the command as.
+  // If not specified, the job server's group id will be used.
+  optional int32 gid = 6;
+  // Optional data to be written to the process's stdin pipe when it starts.
+  // If not specified, the process's stdin will be set to /dev/null.
+  optional bytes stdin = 7;


Let omit these from the scope. Command and arguments are enough for this challenge 👍

r0mant · 2023-12-12T23:12:49Z

docs/rfd.md

+Authorization uses a very simple RBAC system. The RBAC configuration is defined as follows:
+
+```protobuf
+syntax = "proto3";


Do these need to be protobuf objects?

They don't, but (hypothetically, if this were a real project) defining these types as protos would make it much easier to create an admin API to allow users to update these rules on the fly, e.g.

service RBAC { rpc GetConfig(google.protobuf.Empty) returns (rbac.v1.Config); rpc UpdateConfig(rbac.v1.Config) returns (google.protobuf.Empty); rpc ListRoles(google.protobuf.Empty) returns (RoleList); // etc... }

r0mant · 2023-12-12T23:13:20Z

docs/rfd.md

+
+```
+
+The RBAC configuration will be simply loaded from a config file on server startup. An example config file might look like:


Nit: It's ok to just hardcode this structure in the code to reduce the scope a little bit.

r0mant · 2023-12-12T23:14:39Z

docs/rfd.md

+roles:
+  - id: adminRole
+    service: job.v1.Job
+    allowedMethods:
+      - Start
+      - Stop
+      - Status
+      - List
+      - Output
+roleBindings:
+  - id: adminRoleBinding
+    roleId: adminRole
+    subjects:
+      - admin


How will the server determine the client's permissions? You mention X.509 certificate common name below in authz section - is that what "subjects" here is referring to?

r0mant · 2023-12-12T23:17:05Z

docs/rfd.md

+
+# starting a job:
+# for commands that don't require flag args, they can be passed as-is:
+$ jobserver run kubectl logs pod/example


Nit: It's a little strange that both server and client are the same binary jobserver. Think it would make sense to have a separate tool like jobctl for the client?

Sure, I can do that. 👍 Building everything into one binary is usually my personal preference (although it is a bit unorthodox), but it probably doesn't make much of a difference in this case.

r0mant · 2023-12-12T23:19:01Z

docs/rfd.md

+
+#### Process Lifecycle
+
+For jobs that do not specify resource limits, they will be run as a child process of the job server itself without any additional isolation. For jobs that do specify resource limits, they will be run in their own cgroup with all of the requested limits.


Let's simplify and just run any job in a cgroup with some default limits. You can even simplify the API and remove ability for clients to provide custom limits via API.

That works, I can have all jobs be run in cgroups. The defaults could just be the (unlimited) inherited defaults from the parent.

r0mant · 2023-12-12T23:24:00Z

docs/rfd.md

+
+#### Authentication and Authorization
+
+Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field.


Nit: Can you specify what exactly will be encoded in the subject? I think it's pretty clear that it will be the "subject" from the authz map but I didn't see it explicitly mentioned anywhere so wanted to make sure my understanding is correct.

That is correct, the client certificate's Common Name is used as the subject name for authz purposes. So the user "admin" will authenticate with a client cert that looks like:

Let me know if that answers #1 (comment) as well.

I'd use a different word since the "subject" in X.509 is the whole RDNSequence of fields, not just the string in the CN. Maybe "user" or "username"?

smallinsky · 2023-12-13T08:25:11Z

docs/rfd.md

+$ jobserver list
+JOB ID      COMMAND       CREATED          STATUS
+<id-1>      kubectl       2006-01-02...    Running
+<id-2>      go            2006-01-02...    Completed


nit:

Right now the list API returns only JobIdList (rpc List(google.protobuf.Empty) returns (JobIdList))
where additional information needs to be fetch sperattly from Status endpoint (rpc Status(JobId) returns (JobStatus))

Does it make sense to return basic job information about a job from the List endpoint ?

It could work, but it would mean the List and Status methods have overlapping functionality. Also, if there are a very large number of jobs, calling List() could potentially become a very expensive operation for the server, and would make rate limiting less effective if that were to be added.

All right. lets leave the current version.

smallinsky · 2023-12-13T08:40:37Z

docs/rfd.md

+roles:
+  - id: adminRole
+    service: job.v1.Job
+    allowedMethods:
+      - Start
+      - Stop
+      - Status
+      - List
+      - Output
+roleBindings:
+  - id: adminRoleBinding
+    roleId: adminRole
+    subjects:
+      - admin
+```


This RBAC model design appears to be oriented around API methods. Is it possible to grant a user the permissions to read, stop, and view jobs that they have started without granting permissions for other users jobs ?

Yeah, that's a great idea, do you think I should add that capability to the existing design, or replace the existing design altogether (if that would still count towards the requirement)?

What do you think about extending the current design to include the ability to restrict 'allowedMethod' either to the job owner or to all jobs. The current design should be easy to extend by this capability.

Good idea 👍

smallinsky

I must say that this docs appears both brief and solid.

Made a one comment about RBAC model that want to clarify: RBAC

otherwise, LGTM once the comments are addressed.

espadolini

I agree with @r0mant, solid design overall.

espadolini · 2023-12-13T09:57:59Z

docs/rfd.md

+  int32 pid = 3;
+  // The exit code of the job's process.
+  // Only present if the job is in the Completed state.
+  optional int32 exitCode = 4;


optional scalar fields in proto3 are a bit awkward, i'd just document that these fields are only meaningful under certain conditions instead.

espadolini · 2023-12-13T10:11:07Z

docs/rfd.md

+1. Ensure its euid is 0 so it can manage cgroups, and ensure that cgroups v2 is enabled.
+2. Create a parent cgroup in `/user.slice/user-<uid>.slice/jobserver` if it doesn't exist. The specific path can be determined by first identifying its own cgroup from `/proc/self/cgroup` (which, if run as `sudo jobserver serve`, will be something like `/user.slice/user-<uid>.slice/<terminal emulator's cgroup>/vte-spawn-<some uuid>.scope`), then walking the tree up until it finds a cgroup matching `user-*.slice`.


Since we seem to be doing systemd things I can't help but mention that systemd-run --user --scope will execute a process in a dedicated systemd scope (unmanaged cgroup), and you can configure systemd to delegate cgroup management to your specific user, so you don't actually need to be uid 0 for this.

I'd just work inside the cgroup that the process is spawned in rather than get locked into systemd-isms -- or just use the hierarchy rooted at /kralicky-jobserver-<nonce>/ instead, if you require superuser privileges anyway.

Good points. I thought about using the server's own cgroup as the parent but it seems like (at least on my system, not sure if this is always the case?) the only controllers enabled in it are memory and pids, all the way up to /user@<uid>.service which adds cpuset, cpu, and io.

I think the top level cgroup idea is good, it would be a simple solution and works fine for the purposes of this challenge 👍

espadolini · 2023-12-13T10:16:01Z

docs/rfd.md

+2. The server will write the requested resource limits to the new cgroup's `cpu.max`, `memory.max`, and `io.max` files (for whichever limits were specified).
+3. The server will create a file descriptor to the new cgroup by opening its path (`/sys/fs/cgroup/user.slice/user-<uid>.slice/jobserver/<job-id>`).
+4. The server will configure the `exec.Cmd`'s `SysProcAttr` with `CgroupFd` set to the file descriptor. When this is set, Go will call `clone3` with `CLONE_INTO_CGROUP` so that the new child process is started directly in the new cgroup.
+5. When the process exits, the server will close the file descriptor and delete the job's cgroup.


So the "job" is tied to the continued existence of the first child, right? Could you elaborate on what is the plan to deal with other processes in the cgroup and/or processes that have inherited the same stdio pipes (and thus are holding them open)?

(subprocesses that deliberately escape the cgroup are out of scope for the challenge)

Ah, good point - when the first child exits, we'll need to clean up any other processes in the cgroup since the cgroups won't have their own pid namespaces. So step 5 could instead be:

When the process exits, the server will, in order:
- close the file descriptor
- kill any remaining processes in the job's cgroup by writing "1" to its cgroup.kill file
- remove the job's cgroup directory

Would that work?

Yeah, that looks correct! 👍

espadolini · 2023-12-13T10:18:18Z

docs/rfd.md

+  int32 pid = 3;
+  // The exit code of the job's process.
+  // Only present if the job is in the Completed state.
+  optional int32 exitCode = 4;


nit: the raw exit code is technically platform-dependent, so it might be worth splitting into actual exit code and signal exit instead

espadolini · 2023-12-13T10:20:15Z

docs/rfd.md

+  // Returns a list of all job IDs that are currently known to the server.
+  //
+  // All jobs, regardless of state, are included. No guarantees are made about
+  // the order of ids in the list.
+  rpc List(google.protobuf.Empty) returns (JobIdList);


This is not in the spec of the challenge - it's easy to implement from an API perspective but getting a table printed out well in a terminal is fraught with peril. I'd drop this 😅

Haha, I went back and forth about whether or not to include this, it was just slightly ambiguous enough in the spec where I thought it could have been deliberately left out as an 'exercise to the reader' so to speak 😆

For printing tables, I am a fan of "github.com/jedib0t/go-pretty/v6/table", the implementation would look like:

list, err := client.List(ctx, &emptypb.Empty{}) if err != nil { return err } tab := table.NewWriter() tab.AppendHeader(table.Row{"JOB ID", "COMMAND", "CREATED", "STATUS"}) for _, id := range list.Items { stat, err := client.Status(ctx, id) // handle err tab.AppendRow(table.Row{ id.GetId(), stat.GetSpec().GetCommand().GetCommand(), stat.GetStartTime().AsTime(), stat.GetMessage(), }) } fmt.Println(tab.Render())

I can omit List if you want, but I do think it's rather important to the overall UX.

It's your call, it's not required for the challenge but it will be reviewed if it's in there. 😇

espadolini · 2023-12-13T10:22:38Z

docs/rfd.md

+
+#### Authentication and Authorization
+
+Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field.


I'd use a different word since the "subject" in X.509 is the whole RDNSequence of fields, not just the string in the CN. Maybe "user" or "username"?

espadolini · 2023-12-13T10:25:26Z

docs/rfd.md

+To authorize a user for a specific api method, the server will match the client's subject name against the rbac rules it loaded from the config file,
+then check to see if that subject is named in a role binding for which the associated role allows access to the method the client is calling.


just food for thought (as https://github.com/kralicky/jobserver/pull/1/files#r1425022556 will likely lead to a simplification of this model anyway): is there some other source of trusted information already in play that could be used here to decide which roles any given client is supposed to have, rather than having a dedicated configuration mechanism tied to the server?

Do you mean encoding role information into certificates?

Yep! 👍 You're delegating authn for the identity to the issuer of the client certs, why not also roles?

espadolini · 2023-12-13T10:26:54Z

docs/rfd.md

+
+message IODeviceLimits {
+  // Limits for individual devices
+  repeated DeviceLimit devices = 1;


Feel free to have a single I/O limit on a hardcoded/configurable device ID here (it would be strange for the client would have that information anyway).

True - what if the client provides a path like /dev/xyz and the server looks up the ID itself?

I'd just go for a static device ID (for this challenge, at least).

Would that work, since the IDs are hardware specific?

espadolini · 2023-12-13T10:30:33Z

docs/rfd.md

+  // If 'follow' is set to true, the output will continue to be be streamed in
+  // real-time until the job completes. Otherwise, only the currently accumulated
+  // output will be written, and the stream will be closed even if the job is
+  // still running.


Reading the challenge text I think that we should clarify that it's only necessary to implement the "follow" mode of getting the output - it will likely not change much in the implementation anyway, so it'd just be added noise.

espadolini · 2023-12-13T10:37:07Z

docs/rfd.md

+Output streaming will be implemented as follows:
+
+1. When a job is started, we will pipe the combined stdout and stderr of its process to a simple in-memory buffer.
+2. When a client initiates a stream by calling `Output()`, we will write the current contents of the buffer to the stream in 4MB chunks. Then, if the client requested to follow the output, the server will keep the stream open, and any subsequent reads from the process's output will be duplicated to all clients that are following the stream, in addition to the server's own buffer.


4MiB is the default message size limit but it's far larger than the recommended size for messages in a stream; "oral tradition" puts that at 16-64KiB (grpc/grpc.github.io#371) apparently.

Message size limits are tricky. The way I understand it is that regardless of grpc message size, the underlying http/2 transport will chunk it into frames (16KiB is the default frame size and as far as I know grpc doesn't change it), but grpc itself adjusts the transport's flow control windows (unless you change the default window size manually)

From my own testing, using larger messages results in significantly higher throughput, ostensibly because reading and writing larger messages triggers grpc to increase the size of the flow control windows, which results in fewer window update frames required per rpc.

This is all undocumented of course, and I've never been able to find a comprehensive guide on this topic, but that's what I have gathered by looking through the grpc and net/http2 sources.

That's pretty neat, although 4MiB still feels uncomfortably large for a buffer that will be handled as an atomic entity for network purposes - and you're also going to be sacrificing latency for other requests in the same h2 connection. It would be interesting (but definitely out of scope for this) to figure out the size at which the returns in throughput start diminishing: I can believe that the fragmentation at 16KiB makes the overhead noticeable, but is that still the case at 64? at 256?

Yeah, I agree with you there. This topic interests me a lot - so just for fun, some graphs! Each run is 60 seconds of a stream (with 1 client) sending messages as fast as it can, and the server echoing the response back. Would need way more data to get any real conclusions of course, but looks like maybe the best size could lie somewhere between 512KiB and 1 MiB. 1MiB is about where things drop off, the difference between 1 and 4 MiB is pretty small too.

Payload Size (KiB) Mbps msgs/s

16 580.7 278731

24 672.7 215249

32 844 202569

48 901.1 144174

64 996.9 119625

96 1266.2 101294

128 1300.5 78030

192 1431.7 57267

256 1665.3 49959

384 1782.5 35650

512 1968.7 29530

768 2373 23730

1024 2364.1 17731

2048 2582.4 9684

4096 2686.9 5038

5120 2723.3 4085

6144 2771.2 3464

7168 2685.2 2877

8192 2621.9 2458

16384 2596.3 1217

24576 2556.8 799

32768 2781.9 652

49152 2796.8 437

65536 3029.3 355

espadolini · 2023-12-13T10:42:05Z

docs/rfd.md

+  // Stops a running job. Once stopped, this method will wait until the job has
+  // completed before returning.


If it makes things easier feel free to have Stop mean "begin stopping the job".

espadolini · 2023-12-13T10:42:49Z

docs/rfd.md

+  // This will first attempt to stop the job's process using SIGTERM, but if the
+  // process does not exit within a short grace period, it will be forcefully
+  // killed with SIGKILL.


Leave this in if you feel like it'd be an interesting addition code-wise, otherwise just kill the process immediately.

I had this in here to account for being able to run jobs that run forever until you send SIGINT or SIGTERM (tail -f, journalctl -f, etc) without needing to always kill them. But it might end up being more trouble than it's worth, so I will keep that in mind.

smallinsky

LGMT, just reminder to update the RBAC model: https://github.com/kralicky/jobserver/pull/1/files#r1425022556 but I don't thin this this is a blocker.

kralicky force-pushed the design-doc branch from a398a6f to dfb0068 Compare December 11, 2023 22:50

Add design document

d2ba4ab

kralicky force-pushed the design-doc branch from dfb0068 to d2ba4ab Compare December 12, 2023 00:07

espadolini self-requested a review December 12, 2023 00:24

r0mant reviewed Dec 12, 2023

View reviewed changes

smallinsky reviewed Dec 13, 2023

View reviewed changes

espadolini reviewed Dec 13, 2023

View reviewed changes

r0mant approved these changes Dec 14, 2023

View reviewed changes

espadolini approved these changes Dec 14, 2023

View reviewed changes

smallinsky approved these changes Dec 14, 2023

View reviewed changes

Design doc updates

f38d482

espadolini approved these changes Dec 14, 2023

View reviewed changes

kralicky merged commit fa1d91a into main Dec 14, 2023

kralicky deleted the design-doc branch December 14, 2023 22:14

		// The job exited normally (with any exit code), or was terminated via signal.
		Completed = 3;


		```

		The RBAC configuration will be simply loaded from a config file on server startup. An example config file might look like:


		#### Process Lifecycle

		For jobs that do not specify resource limits, they will be run as a child process of the job server itself without any additional isolation. For jobs that do specify resource limits, they will be run in their own cgroup with all of the requested limits.


		#### Authentication and Authorization

		Authentication is implemented using simple mTLS, with a subject name encoded in the client certificate's common name field.

		1. Ensure its euid is 0 so it can manage cgroups, and ensure that cgroups v2 is enabled.
		2. Create a parent cgroup in `/user.slice/user-<uid>.slice/jobserver` if it doesn't exist. The specific path can be determined by first identifying its own cgroup from `/proc/self/cgroup` (which, if run as `sudo jobserver serve`, will be something like `/user.slice/user-<uid>.slice/<terminal emulator's cgroup>/vte-spawn-<some uuid>.scope`), then walking the tree up until it finds a cgroup matching `user-*.slice`.

		To authorize a user for a specific api method, the server will match the client's subject name against the rbac rules it loaded from the config file,
		then check to see if that subject is named in a role binding for which the associated role allows access to the method the client is calling.

Payload Size (KiB)	Mbps	msgs/s
16	580.7	278731
24	672.7	215249
32	844	202569
48	901.1	144174
64	996.9	119625
96	1266.2	101294
128	1300.5	78030
192	1431.7	57267
256	1665.3	49959
384	1782.5	35650
512	1968.7	29530
768	2373	23730
1024	2364.1	17731
2048	2582.4	9684
4096	2686.9	5038
5120	2723.3	4085
6144	2771.2	3464
7168	2685.2	2877
8192	2621.9	2458
16384	2596.3	1217
24576	2556.8	799
32768	2781.9	652
49152	2796.8	437
65536	3029.3	355

		// Stops a running job. Once stopped, this method will wait until the job has
		// completed before returning.

Design document #1

Design document #1

Conversation

kralicky commented Dec 11, 2023

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smallinsky left a comment

Choose a reason for hiding this comment

espadolini left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kralicky Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kralicky Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

smallinsky left a comment

Choose a reason for hiding this comment

kralicky Dec 13, 2023 •

edited

Loading

kralicky Dec 13, 2023 •

edited

Loading