control: add build API and builder daemon service #408

jgraettinger · 2022-03-10T17:35:46Z

Description:

This PR introduces new /builds endpoints:

GET /builds lists all builds which you're permitted to see (currently: owned by your account).
GET /builds/:build_id fetches a specific build, including its catalog definition
POST /builds takes a JSON models::Catalog, and creates a new queued build.

There's also now a builder daemon which will look for queued builds and attempt to build them.
The status of a build is updated and retrievable through /builds etc,
and the build database itself is put to the builds root and contains all related task logs of the build process.

Workflow steps:

Migrate your CP DB.

Here's a test catalog you can post to `/builds`.

{
	"$schema": "https://raw.githubusercontent.com/estuary/flow/master/flow.schema.json",
	"collections": {
		"acmeCo/collection": {
			"key": [
				"/key"
			],
			"schema": {
				"type": "object",
				"required": [
					"key"
				],
				"properties": {
					"key": {
						"type": "string"
					}
				}
			}
		}
	},
	"storageMappings": {
		"": {
			"stores": [
				{
					"bucket": "my-bucket",
					"provider": "S3"
				}
			]
		}
	},
	"tests": {
		"acmeCo/conversions/test": [
			{
				"ingest": {
					"collection": "acmeCo/collection",
					"description": "Description of the ingestion.",
					"documents": [
						{
							"key": "one"
						},
						{
							"key": "two"
						}
					]
				}
			},
			{
				"verify": {
					"collection": "acmeCo/collection",
					"description": "Description of the verification.",
					"documents": [
						{
							"key": "one"
						},
						{
							"key": "two"
						}
					]
				}
			}
		]
	}
}

$ curl -s -H "Authorization: Basic $AUTH" -H "Content-Type: application/json" --data-binary @/home/johnny/test_catalog.flow.json http://127.0.0.1:3000/builds | jq .

When a build completes, you'll see it under /var/tmp/flow-builds.

/var/tmp/flow-builds$ sqlite3 * 'select * from task_logs';

Documentation links affected:

None yet.

Notes for reviewers:

Recommend reviewing commit by commit.
There are also some commits for dependency updates and other minor fixups.

This change is

The API is no longer fallible, and the Exists::Implicit case should be matched over instead.

* Fix a number of deprecation warnings. * Fix linker errors related to Go <=> Rust sqlite usage.

The most interesting bit is the mechanism for dequeing / updating builds on behalf of a forthcoming builder daemon service. It uses transaction-scoped postgres advisory locks to offer a build queue which can be processed in parallel, and which is tolerant to builder failures. Issue #400

The builder daemon periodically polls the database for a build to run. On finding one, it builds it in a local temporary directory using `flowctl api build`, and then tests it using `flowctl temp-data-plane` and `flowctl api test`. This commit conceptualizes each of these activities as "tasks", where log lines of each task are directly streamed into the build database as they occur. The build DB can then be its own source-of-truth for all of its outputs during the build process. We also preserve structured log information from the build process. Looking forward, activations and other background tasks will produce similar types of logs, and we can use the same mechanism to capture / persist / query them. For now, the build daemon is directly integrated into the control plane server. We'll probably want to separate these in the future. Issue #400

mdibaiee · 2022-03-10T18:41:56Z

crates/control/src/main.rs

+    let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon);
+    out?;


nit: can this be

Suggested change

let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon);

out?;

futures::try_join!(server, builder_daemon);

We want to return a Result::Error

It needs the return value type hint to figure out how to Into::into the respective future error types.

Also add an index on builds.account_id.

* Use new url_for pattern * State::Done => State::Success * Terminology: "task" => "job" for background jobs of the control plane.

saterus

This looks great. 🥇

I have a bunch of questions about naming, but there were a few things that stuck out as things we probably want to fix before merging this.

The crash loop bringing the whole server down seems like something we want to address.
The unavoidable blocking calls hint that we should probably be running the Builder with spawn_blocking.

Makefile

crates/control/src/controllers/builds.rs

crates/control/src/controllers/builds/routes.rs

crates/control/src/models/builds.rs

saterus · 2022-03-11T16:11:42Z

crates/control/src/services/builder/mod.rs

+    std::fs::create_dir(&builds_dir).map_err(|err| Error::CreateDir(err))?;
+
+    // Write our catalog source file within the build directory.
+    std::fs::File::create(&builds_dir.join(&format!("{}.flow.yaml", id)))


I believe you'll want to use non-blocking equivalents for creating directories and files. I don't think it's safe to use std::fs operations in a Tokio task. I don't think there's any particular reason we want the Builder daemon to run on its own thread/thread_pool right now. I think we should be able to swap in these without harm.

Hmm, upon further reading, we may want to make sure we use spawn_blocking when spawning Builder daemons. It doesn't look like rusqlite is async safe. I think we might be blocking the executor when issuing calls to the Connection.

The other alternative would be to switch to SQLx for manipulating the sqlite databases. I'm not sure what features it can't/won't support, but I think there's some advantage to having a single way to talk to sql databases in the project.

I don't think it's safe to use std::fs operations in a Tokio task.

It's not a safety or correctness issue, but it can block the executor.

I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.

Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.

To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...

...

In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.

It's not a safety or correctness issue, but it can block the executor.

Yep, sorry for the imprecise language here. That's what I was implying, not a memory safety issue.

I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.

Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.

Interesting. That's not what I'd expected it to be doing under the hood.

To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...

I agree with you here. I mean, I guess it's better to avoid blocking the executor by spawning yet another thread, but if you care, you probably do want to manage this yourself.

In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.

Yeah, I think we may still want to use spawn_blocking and then pass that handle to try_join. There is a warning in the docs about making sure you spawn if you want parallelism or if one of the joined futures blocks. I think this would currently manifest itself as the Builder blocking and preventing Axum from handling new requests.

crates/control/src/services/builder/mod.rs

saterus · 2022-03-11T16:51:51Z

crates/control/src/services/builder/mod.rs

+// capture_task_logs consumes lines from the AsyncRead and adds each as a log
+// entry to a well-known `task_logs` table within the given SQLite database.
+// Each entry is identified by its task and source.
+// By convention, stdout is source=0 and stderr is source=1.


We can probably use the type system to enforce this convention. An enum with explicit discriminate values would do exactly what you want here.

Additional thought: we might line these values up with the standard unix file descriptor values for stdin = 0, stdout = 1, stderr = 2. I don't think there's any cost to this and there's a certain symmetry to the conceptual mapping you're doing here. It doesn't look like there's any enum/constants defining these fds in the standard library, but this would be trivial.

I updated to 1 & 2 and added a TODO but would prefer to defer pinning down meaning with types for the moment.

saterus · 2022-03-11T17:02:54Z

crates/control/src/models/builds.rs

+use models;
+use serde::{Deserialize, Serialize};
+use sqlx::{types::Json, FromRow};
+use std::fmt;
+
+use crate::models::accounts;
+use crate::models::id::Id;


This isn't specifically about this file, but I think it's a good example of what I'm talking about below. Nothing needs to change in this PR, I'm just gathering your thoughts.

I'm definitely feeling a naming clash between the models crate and the control::models module. I think it may be best to rename control::models to some other name. I think this would help with conceptually trying to map which models module something is referring to.

I'd originally used models simply as pulling from the Rails lingo for this. Phoenix uses schemas to talk refer to structs which map to database entities.

What do you think? control::schemas? control::resources? Other?

I like control::schemas 👍 . Resources is too easily conflated with resources-as-in-URLs.

saterus · 2022-03-11T17:14:31Z

crates/control/src/cmd/serve.rs

+    let builder_daemon =
+        crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into);


We probably need to do better error handling here. Right now, an error encountered by the builder will stop the entire server. There are a lot of errors that can bubble up to this point and having the server exit is not really a great way to handle those. This is doubly bad if something is consistently wrong, as the builder will dequeue the problem Build and fail again, immediately. My local server is currently in a crashed state that I cannot get it out of. 😬

Okay, so I've tracked down what's causing the crash loop. We don't currently assert that the "local" BuildsRoot directory actually exists. We fail with a generic "file/directory does not exist" error when the Builder attempts to call put_builds.put_build(). I spent time looking for a bug with the temp directories we're making.

This isn't an issue that you introduced, but it is a footgun for someone trying to run a control plane build for the first time. At the very least, it's an easy reproduction of the crash loop problem I was talking about above.

Discussed over VC, but to recap:

It should never be the case that user input can cause an error here. Returned errors ought to reflect configuration problems, or implementation bugs we need to address.

Before releasing this for real, we also intend to separate these into separate API server pod(s), and job runner pod(s). The current mixing in a single process is temporary. Assuming these are deployed as separate concerns, it's fine to let errors bubble up and let Kubernetes restart pods, given these are errors we want to know about and fix.

jgraettinger

Thanks, PTAL

jgraettinger · 2022-03-11T18:56:25Z

crates/control/src/cmd/serve.rs

+    let builder_daemon =
+        crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into);


Discussed over VC, but to recap:

It should never be the case that user input can cause an error here. Returned errors ought to reflect configuration problems, or implementation bugs we need to address.

Before releasing this for real, we also intend to separate these into separate API server pod(s), and job runner pod(s). The current mixing in a single process is temporary. Assuming these are deployed as separate concerns, it's fine to let errors bubble up and let Kubernetes restart pods, given these are errors we want to know about and fix.

crates/control/src/controllers/builds.rs

crates/control/src/controllers/builds/routes.rs

jgraettinger · 2022-03-11T18:59:51Z

crates/control/src/models/builds.rs

+use models;
+use serde::{Deserialize, Serialize};
+use sqlx::{types::Json, FromRow};
+use std::fmt;
+
+use crate::models::accounts;
+use crate::models::id::Id;


I like control::schemas 👍 . Resources is too easily conflated with resources-as-in-URLs.

crates/control/src/models/builds.rs

crates/control/src/services/builder/mod.rs

jgraettinger · 2022-03-11T19:42:46Z

crates/control/src/services/builder/mod.rs

+    std::fs::create_dir(&builds_dir).map_err(|err| Error::CreateDir(err))?;
+
+    // Write our catalog source file within the build directory.
+    std::fs::File::create(&builds_dir.join(&format!("{}.flow.yaml", id)))


I don't think it's safe to use std::fs operations in a Tokio task.

It's not a safety or correctness issue, but it can block the executor.

I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.

Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.

To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...

...

In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.

crates/control/src/services/builder/mod.rs

jgraettinger · 2022-03-11T19:52:56Z

crates/control/src/services/builder/mod.rs

+// capture_task_logs consumes lines from the AsyncRead and adds each as a log
+// entry to a well-known `task_logs` table within the given SQLite database.
+// Each entry is identified by its task and source.
+// By convention, stdout is source=0 and stderr is source=1.


I updated to 1 & 2 and added a TODO but would prefer to defer pinning down meaning with types for the moment.

saterus

This looks good to go. I think we want to wrap the Builder job runner in spawn_blocking(), but no other nits.

jgraettinger · 2022-03-11T21:13:46Z

This looks good to go. I think we want to wrap the Builder job runner in spawn_blocking(), but no other nits.

Thanks. Punting for right now, as empirically it only blocks the executor for the time required to a) create a directory, or b) create, serialize, write, and close the source catalog file, or c) run an INSERT transaction against the job_logs table.

jgraettinger added 5 commits March 10, 2022 12:27

control: tweaks to ease development in devcontainer / Codespaces

1dc3ad5

parser: correct use of inference locate()

d4d6873

The API is no longer fallible, and the Exists::Implicit case should be matched over instead.

cargo: update all dependencies

838bfa7

* Fix a number of deprecation warnings. * Fix linker errors related to Go <=> Rust sqlite usage.

jgraettinger force-pushed the johnny/build-api branch from 90b2c7e to 9b1c28e Compare March 10, 2022 17:53

mdibaiee reviewed Mar 10, 2022

View reviewed changes

jgraettinger added 2 commits March 10, 2022 15:00

control: advisory locks => SELECT ... WITH UPDATE SKIP LOCKED

fc1706c

Also add an index on builds.account_id.

control: builder cleanups from review

c62e0d6

* Use new url_for pattern * State::Done => State::Success * Terminology: "task" => "job" for background jobs of the control plane.

jgraettinger force-pushed the johnny/build-api branch from cdcb502 to fc1706c Compare March 11, 2022 14:19

jgraettinger requested a review from saterus March 11, 2022 15:20

saterus suggested changes Mar 11, 2022

View reviewed changes

jgraettinger force-pushed the johnny/build-api branch from 8164802 to c00fa97 Compare March 11, 2022 19:52

jgraettinger commented Mar 11, 2022

View reviewed changes

saterus approved these changes Mar 11, 2022

View reviewed changes

jgraettinger force-pushed the johnny/build-api branch from c00fa97 to c62e0d6 Compare March 11, 2022 21:12

jgraettinger merged commit 5093c3b into master Mar 11, 2022

jgraettinger deleted the johnny/build-api branch March 11, 2022 21:13

oliviamiannone added the docs complete / NA No (more) doc work related to this PR label Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control: add build API and builder daemon service #408

control: add build API and builder daemon service #408

jgraettinger commented Mar 10, 2022 •

edited

Loading

mdibaiee Mar 10, 2022

jgraettinger Mar 11, 2022

saterus left a comment

saterus Mar 11, 2022

saterus Mar 11, 2022

jgraettinger Mar 11, 2022 •

edited

Loading

saterus Mar 11, 2022

saterus Mar 11, 2022

jgraettinger Mar 11, 2022

saterus Mar 11, 2022

jgraettinger Mar 11, 2022

saterus Mar 11, 2022

saterus Mar 11, 2022

jgraettinger Mar 11, 2022

jgraettinger left a comment

jgraettinger Mar 11, 2022

jgraettinger Mar 11, 2022

jgraettinger Mar 11, 2022 •

edited

Loading

jgraettinger Mar 11, 2022

saterus left a comment

jgraettinger commented Mar 11, 2022 •

edited

Loading

		let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon);
		out?;

	let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon);
	out?;
	futures::try_join!(server, builder_daemon);

		let builder_daemon =
		crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into);

control: add build API and builder daemon service #408

control: add build API and builder daemon service #408

Conversation

jgraettinger commented Mar 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saterus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgraettinger Mar 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgraettinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgraettinger Mar 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saterus left a comment

Choose a reason for hiding this comment

jgraettinger commented Mar 11, 2022 • edited Loading

jgraettinger commented Mar 10, 2022 •

edited

Loading

jgraettinger Mar 11, 2022 •

edited

Loading

jgraettinger Mar 11, 2022 •

edited

Loading

jgraettinger commented Mar 11, 2022 •

edited

Loading