Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control: add build API and builder daemon service #408

Merged
merged 7 commits into from
Mar 11, 2022

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Mar 10, 2022

Description:

This PR introduces new /builds endpoints:

  • GET /builds lists all builds which you're permitted to see (currently: owned by your account).
  • GET /builds/:build_id fetches a specific build, including its catalog definition
  • POST /builds takes a JSON models::Catalog, and creates a new queued build.

There's also now a builder daemon which will look for queued builds and attempt to build them.
The status of a build is updated and retrievable through /builds etc,
and the build database itself is put to the builds root and contains all related task logs of the build process.

Workflow steps:

Migrate your CP DB.

Here's a test catalog you can post to `/builds`.

{
	"$schema": "https://raw.githubusercontent.com/estuary/flow/master/flow.schema.json",
	"collections": {
		"acmeCo/collection": {
			"key": [
				"/key"
			],
			"schema": {
				"type": "object",
				"required": [
					"key"
				],
				"properties": {
					"key": {
						"type": "string"
					}
				}
			}
		}
	},
	"storageMappings": {
		"": {
			"stores": [
				{
					"bucket": "my-bucket",
					"provider": "S3"
				}
			]
		}
	},
	"tests": {
		"acmeCo/conversions/test": [
			{
				"ingest": {
					"collection": "acmeCo/collection",
					"description": "Description of the ingestion.",
					"documents": [
						{
							"key": "one"
						},
						{
							"key": "two"
						}
					]
				}
			},
			{
				"verify": {
					"collection": "acmeCo/collection",
					"description": "Description of the verification.",
					"documents": [
						{
							"key": "one"
						},
						{
							"key": "two"
						}
					]
				}
			}
		]
	}
}

$ curl -s -H "Authorization: Basic $AUTH" -H "Content-Type: application/json" --data-binary @/home/johnny/test_catalog.flow.json http://127.0.0.1:3000/builds | jq .

When a build completes, you'll see it under /var/tmp/flow-builds.

/var/tmp/flow-builds$ sqlite3 * 'select * from task_logs';

Documentation links affected:

None yet.

Notes for reviewers:

Recommend reviewing commit by commit.
There are also some commits for dependency updates and other minor fixups.


This change is Reviewable

The API is no longer fallible, and the Exists::Implicit case should be
matched over instead.
* Fix a number of deprecation warnings.
* Fix linker errors related to Go <=> Rust sqlite usage.
The most interesting bit is the mechanism for dequeing / updating builds
on behalf of a forthcoming builder daemon service.

It uses transaction-scoped postgres advisory locks to offer a build
queue which can be processed in parallel, and which is tolerant to
builder failures.

Issue #400
The builder daemon periodically polls the database for a build to run.
On finding one, it builds it in a local temporary directory using
`flowctl api build`, and then tests it using `flowctl temp-data-plane`
and `flowctl api test`.

This commit conceptualizes each of these activities as "tasks", where
log lines of each task are directly streamed into the build database as
they occur. The build DB can then be its own source-of-truth for all of
its outputs during the build process. We also preserve structured log
information from the build process.

Looking forward, activations and other background tasks will produce
similar types of logs, and we can use the same mechanism to capture /
persist / query them.

For now, the build daemon is directly integrated into the control plane
server. We'll probably want to separate these in the future.

Issue #400
Comment on lines +45 to +46
let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon);
out?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this be

Suggested change
let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon);
out?;
futures::try_join!(server, builder_daemon);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We want to return a Result::Error
  2. It needs the return value type hint to figure out how to Into::into the respective future error types.

Also add an index on builds.account_id.
* Use new url_for pattern
* State::Done => State::Success
* Terminology: "task" => "job" for background jobs of the control plane.
Copy link
Contributor

@saterus saterus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. 🥇

I have a bunch of questions about naming, but there were a few things that stuck out as things we probably want to fix before merging this.

  1. The crash loop bringing the whole server down seems like something we want to address.
  2. The unavoidable blocking calls hint that we should probably be running the Builder with spawn_blocking.

Makefile Show resolved Hide resolved
crates/control/src/controllers/builds.rs Outdated Show resolved Hide resolved
crates/control/src/controllers/builds/routes.rs Outdated Show resolved Hide resolved
crates/control/src/models/builds.rs Outdated Show resolved Hide resolved
crates/control/src/models/builds.rs Show resolved Hide resolved
Comment on lines +106 to +109
std::fs::create_dir(&builds_dir).map_err(|err| Error::CreateDir(err))?;

// Write our catalog source file within the build directory.
std::fs::File::create(&builds_dir.join(&format!("{}.flow.yaml", id)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you'll want to use non-blocking equivalents for creating directories and files. I don't think it's safe to use std::fs operations in a Tokio task. I don't think there's any particular reason we want the Builder daemon to run on its own thread/thread_pool right now. I think we should be able to swap in these without harm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, upon further reading, we may want to make sure we use spawn_blocking when spawning Builder daemons. It doesn't look like rusqlite is async safe. I think we might be blocking the executor when issuing calls to the Connection.

The other alternative would be to switch to SQLx for manipulating the sqlite databases. I'm not sure what features it can't/won't support, but I think there's some advantage to having a single way to talk to sql databases in the project.

Copy link
Member Author

@jgraettinger jgraettinger Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's safe to use std::fs operations in a Tokio task.

It's not a safety or correctness issue, but it can block the executor.

I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.

Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.

To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...

...

In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a safety or correctness issue, but it can block the executor.

Yep, sorry for the imprecise language here. That's what I was implying, not a memory safety issue.

I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.

Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.

Interesting. That's not what I'd expected it to be doing under the hood.

To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...

I agree with you here. I mean, I guess it's better to avoid blocking the executor by spawning yet another thread, but if you care, you probably do want to manage this yourself.

In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.

Yeah, I think we may still want to use spawn_blocking and then pass that handle to try_join. There is a warning in the docs about making sure you spawn if you want parallelism or if one of the joined futures blocks. I think this would currently manifest itself as the Builder blocking and preventing Axum from handling new requests.

crates/control/src/services/builder/mod.rs Outdated Show resolved Hide resolved
// capture_task_logs consumes lines from the AsyncRead and adds each as a log
// entry to a well-known `task_logs` table within the given SQLite database.
// Each entry is identified by its task and source.
// By convention, stdout is source=0 and stderr is source=1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably use the type system to enforce this convention. An enum with explicit discriminate values would do exactly what you want here.

Additional thought: we might line these values up with the standard unix file descriptor values for stdin = 0, stdout = 1, stderr = 2. I don't think there's any cost to this and there's a certain symmetry to the conceptual mapping you're doing here. It doesn't look like there's any enum/constants defining these fds in the standard library, but this would be trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to 1 & 2 and added a TODO but would prefer to defer pinning down meaning with types for the moment.

Comment on lines +2 to +8
use models;
use serde::{Deserialize, Serialize};
use sqlx::{types::Json, FromRow};
use std::fmt;

use crate::models::accounts;
use crate::models::id::Id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't specifically about this file, but I think it's a good example of what I'm talking about below. Nothing needs to change in this PR, I'm just gathering your thoughts.

I'm definitely feeling a naming clash between the models crate and the control::models module. I think it may be best to rename control::models to some other name. I think this would help with conceptually trying to map which models module something is referring to.

I'd originally used models simply as pulling from the Rails lingo for this. Phoenix uses schemas to talk refer to structs which map to database entities.

What do you think? control::schemas? control::resources? Other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like control::schemas 👍 . Resources is too easily conflated with resources-as-in-URLs.

Comment on lines +37 to +38
let builder_daemon =
crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to do better error handling here. Right now, an error encountered by the builder will stop the entire server. There are a lot of errors that can bubble up to this point and having the server exit is not really a great way to handle those. This is doubly bad if something is consistently wrong, as the builder will dequeue the problem Build and fail again, immediately. My local server is currently in a crashed state that I cannot get it out of. 😬

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so I've tracked down what's causing the crash loop. We don't currently assert that the "local" BuildsRoot directory actually exists. We fail with a generic "file/directory does not exist" error when the Builder attempts to call put_builds.put_build(). I spent time looking for a bug with the temp directories we're making.

This isn't an issue that you introduced, but it is a footgun for someone trying to run a control plane build for the first time. At the very least, it's an easy reproduction of the crash loop problem I was talking about above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over VC, but to recap:

It should never be the case that user input can cause an error here. Returned errors ought to reflect configuration problems, or implementation bugs we need to address.

Before releasing this for real, we also intend to separate these into separate API server pod(s), and job runner pod(s). The current mixing in a single process is temporary. Assuming these are deployed as separate concerns, it's fine to let errors bubble up and let Kubernetes restart pods, given these are errors we want to know about and fix.

Copy link
Member Author

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL

Comment on lines +37 to +38
let builder_daemon =
crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over VC, but to recap:

It should never be the case that user input can cause an error here. Returned errors ought to reflect configuration problems, or implementation bugs we need to address.

Before releasing this for real, we also intend to separate these into separate API server pod(s), and job runner pod(s). The current mixing in a single process is temporary. Assuming these are deployed as separate concerns, it's fine to let errors bubble up and let Kubernetes restart pods, given these are errors we want to know about and fix.

crates/control/src/controllers/builds.rs Outdated Show resolved Hide resolved
crates/control/src/controllers/builds/routes.rs Outdated Show resolved Hide resolved
Comment on lines +2 to +8
use models;
use serde::{Deserialize, Serialize};
use sqlx::{types::Json, FromRow};
use std::fmt;

use crate::models::accounts;
use crate::models::id::Id;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like control::schemas 👍 . Resources is too easily conflated with resources-as-in-URLs.

crates/control/src/models/builds.rs Outdated Show resolved Hide resolved
crates/control/src/services/builder/mod.rs Show resolved Hide resolved
crates/control/src/services/builder/mod.rs Show resolved Hide resolved
Comment on lines +106 to +109
std::fs::create_dir(&builds_dir).map_err(|err| Error::CreateDir(err))?;

// Write our catalog source file within the build directory.
std::fs::File::create(&builds_dir.join(&format!("{}.flow.yaml", id)))
Copy link
Member Author

@jgraettinger jgraettinger Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's safe to use std::fs operations in a Tokio task.

It's not a safety or correctness issue, but it can block the executor.

I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.

Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.

To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...

...

In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.

crates/control/src/services/builder/mod.rs Outdated Show resolved Hide resolved
// capture_task_logs consumes lines from the AsyncRead and adds each as a log
// entry to a well-known `task_logs` table within the given SQLite database.
// Each entry is identified by its task and source.
// By convention, stdout is source=0 and stderr is source=1.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to 1 & 2 and added a TODO but would prefer to defer pinning down meaning with types for the moment.

Copy link
Contributor

@saterus saterus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to go. I think we want to wrap the Builder job runner in spawn_blocking(), but no other nits.

@jgraettinger
Copy link
Member Author

jgraettinger commented Mar 11, 2022

This looks good to go. I think we want to wrap the Builder job runner in spawn_blocking(), but no other nits.

Thanks. Punting for right now, as empirically it only blocks the executor for the time required to a) create a directory, or b) create, serialize, write, and close the source catalog file, or c) run an INSERT transaction against the job_logs table.

@jgraettinger jgraettinger merged commit 5093c3b into master Mar 11, 2022
@jgraettinger jgraettinger deleted the johnny/build-api branch March 11, 2022 21:13
@oliviamiannone oliviamiannone added the docs complete / NA No (more) doc work related to this PR label Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs complete / NA No (more) doc work related to this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants