-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control: add build API and builder daemon service #408
Conversation
The API is no longer fallible, and the Exists::Implicit case should be matched over instead.
* Fix a number of deprecation warnings. * Fix linker errors related to Go <=> Rust sqlite usage.
The most interesting bit is the mechanism for dequeing / updating builds on behalf of a forthcoming builder daemon service. It uses transaction-scoped postgres advisory locks to offer a build queue which can be processed in parallel, and which is tolerant to builder failures. Issue #400
The builder daemon periodically polls the database for a build to run. On finding one, it builds it in a local temporary directory using `flowctl api build`, and then tests it using `flowctl temp-data-plane` and `flowctl api test`. This commit conceptualizes each of these activities as "tasks", where log lines of each task are directly streamed into the build database as they occur. The build DB can then be its own source-of-truth for all of its outputs during the build process. We also preserve structured log information from the build process. Looking forward, activations and other background tasks will produce similar types of logs, and we can use the same mechanism to capture / persist / query them. For now, the build daemon is directly integrated into the control plane server. We'll probably want to separate these in the future. Issue #400
90b2c7e
to
9b1c28e
Compare
let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon); | ||
out?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this be
let out: Result<_, anyhow::Error> = futures::try_join!(server, builder_daemon); | |
out?; | |
futures::try_join!(server, builder_daemon); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We want to return a Result::Error
- It needs the return value type hint to figure out how to Into::into the respective future error types.
Also add an index on builds.account_id.
* Use new url_for pattern * State::Done => State::Success * Terminology: "task" => "job" for background jobs of the control plane.
cdcb502
to
fc1706c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. 🥇
I have a bunch of questions about naming, but there were a few things that stuck out as things we probably want to fix before merging this.
- The crash loop bringing the whole server down seems like something we want to address.
- The unavoidable blocking calls hint that we should probably be running the Builder with
spawn_blocking
.
std::fs::create_dir(&builds_dir).map_err(|err| Error::CreateDir(err))?; | ||
|
||
// Write our catalog source file within the build directory. | ||
std::fs::File::create(&builds_dir.join(&format!("{}.flow.yaml", id))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you'll want to use non-blocking equivalents for creating directories and files. I don't think it's safe to use std::fs
operations in a Tokio task. I don't think there's any particular reason we want the Builder daemon to run on its own thread/thread_pool right now. I think we should be able to swap in these without harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, upon further reading, we may want to make sure we use spawn_blocking
when spawning Builder daemons. It doesn't look like rusqlite
is async safe. I think we might be blocking the executor when issuing calls to the Connection
.
The other alternative would be to switch to SQLx for manipulating the sqlite databases. I'm not sure what features it can't/won't support, but I think there's some advantage to having a single way to talk to sql databases in the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's safe to use std::fs operations in a Tokio task.
It's not a safety or correctness issue, but it can block the executor.
I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.
Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.
To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...
...
In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a safety or correctness issue, but it can block the executor.
Yep, sorry for the imprecise language here. That's what I was implying, not a memory safety issue.
I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.
Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.
Interesting. That's not what I'd expected it to be doing under the hood.
To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...
I agree with you here. I mean, I guess it's better to avoid blocking the executor by spawning yet another thread, but if you care, you probably do want to manage this yourself.
In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.
Yeah, I think we may still want to use spawn_blocking
and then pass that handle to try_join
. There is a warning in the docs about making sure you spawn if you want parallelism or if one of the joined futures blocks. I think this would currently manifest itself as the Builder blocking and preventing Axum from handling new requests.
// capture_task_logs consumes lines from the AsyncRead and adds each as a log | ||
// entry to a well-known `task_logs` table within the given SQLite database. | ||
// Each entry is identified by its task and source. | ||
// By convention, stdout is source=0 and stderr is source=1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably use the type system to enforce this convention. An enum with explicit discriminate values would do exactly what you want here.
Additional thought: we might line these values up with the standard unix file descriptor values for stdin = 0, stdout = 1, stderr = 2. I don't think there's any cost to this and there's a certain symmetry to the conceptual mapping you're doing here. It doesn't look like there's any enum/constants defining these fds in the standard library, but this would be trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to 1 & 2 and added a TODO but would prefer to defer pinning down meaning with types for the moment.
use models; | ||
use serde::{Deserialize, Serialize}; | ||
use sqlx::{types::Json, FromRow}; | ||
use std::fmt; | ||
|
||
use crate::models::accounts; | ||
use crate::models::id::Id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't specifically about this file, but I think it's a good example of what I'm talking about below. Nothing needs to change in this PR, I'm just gathering your thoughts.
I'm definitely feeling a naming clash between the models
crate and the control::models
module. I think it may be best to rename control::models
to some other name. I think this would help with conceptually trying to map which models
module something is referring to.
I'd originally used models
simply as pulling from the Rails lingo for this. Phoenix uses schemas
to talk refer to structs which map to database entities.
What do you think? control::schemas
? control::resources
? Other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like control::schemas
👍 . Resources is too easily conflated with resources-as-in-URLs.
let builder_daemon = | ||
crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to do better error handling here. Right now, an error encountered by the builder
will stop the entire server. There are a lot of errors that can bubble up to this point and having the server exit is not really a great way to handle those. This is doubly bad if something is consistently wrong, as the builder will dequeue the problem Build and fail again, immediately. My local server is currently in a crashed state that I cannot get it out of. 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so I've tracked down what's causing the crash loop. We don't currently assert that the "local" BuildsRoot directory actually exists. We fail with a generic "file/directory does not exist" error when the Builder attempts to call put_builds.put_build()
. I spent time looking for a bug with the temp directories we're making.
This isn't an issue that you introduced, but it is a footgun for someone trying to run a control plane build for the first time. At the very least, it's an easy reproduction of the crash loop problem I was talking about above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed over VC, but to recap:
It should never be the case that user input can cause an error here. Returned errors ought to reflect configuration problems, or implementation bugs we need to address.
Before releasing this for real, we also intend to separate these into separate API server pod(s), and job runner pod(s). The current mixing in a single process is temporary. Assuming these are deployed as separate concerns, it's fine to let errors bubble up and let Kubernetes restart pods, given these are errors we want to know about and fix.
8164802
to
c00fa97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL
let builder_daemon = | ||
crate::services::builder::serve_builds(ctx.clone(), shutdown::signal()).map_err(Into::into); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed over VC, but to recap:
It should never be the case that user input can cause an error here. Returned errors ought to reflect configuration problems, or implementation bugs we need to address.
Before releasing this for real, we also intend to separate these into separate API server pod(s), and job runner pod(s). The current mixing in a single process is temporary. Assuming these are deployed as separate concerns, it's fine to let errors bubble up and let Kubernetes restart pods, given these are errors we want to know about and fix.
use models; | ||
use serde::{Deserialize, Serialize}; | ||
use sqlx::{types::Json, FromRow}; | ||
use std::fmt; | ||
|
||
use crate::models::accounts; | ||
use crate::models::id::Id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like control::schemas
👍 . Resources is too easily conflated with resources-as-in-URLs.
std::fs::create_dir(&builds_dir).map_err(|err| Error::CreateDir(err))?; | ||
|
||
// Write our catalog source file within the build directory. | ||
std::fs::File::create(&builds_dir.join(&format!("{}.flow.yaml", id))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's safe to use std::fs operations in a Tokio task.
It's not a safety or correctness issue, but it can block the executor.
I dug into this a bit. The tokio adapters are wrappers of the std library which claim (in docs) to merely hint to the tokio runtime that this is a blocking call. Go's runtime does a very similar thing where it starts the blocking operation on the current event thread (called P's) but associates a timer, and if it takes to long then it marks the thread as backround and moves tasks of that P to a new system thread. I thought that was what tokio was doing to... but it's not.
Instead it's always spawning a new background thread for the operation, which is quite a bit slower than just using the standard library blocking versions.
To be honest, I'm not really clear why developers should ever choose to use tokio::fs given this. I would think users would be better off managing their own background tasks for blocking jobs in almost all cases...
...
In any case this is probably moot because I agree that spawn_blocking is going to be more appropriate for builder daemons. I believe it's not an issue currently because the top-level main() does a synchronous join over the future, so we never send it through tokio::spawn.
// capture_task_logs consumes lines from the AsyncRead and adds each as a log | ||
// entry to a well-known `task_logs` table within the given SQLite database. | ||
// Each entry is identified by its task and source. | ||
// By convention, stdout is source=0 and stderr is source=1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to 1 & 2 and added a TODO but would prefer to defer pinning down meaning with types for the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to go. I think we want to wrap the Builder job runner in spawn_blocking()
, but no other nits.
c00fa97
to
c62e0d6
Compare
Thanks. Punting for right now, as empirically it only blocks the executor for the time required to a) create a directory, or b) create, serialize, write, and close the source catalog file, or c) run an INSERT transaction against the |
Description:
This PR introduces new
/builds
endpoints:GET /builds
lists all builds which you're permitted to see (currently: owned by your account).GET /builds/:build_id
fetches a specific build, including its catalog definitionPOST /builds
takes a JSONmodels::Catalog
, and creates a new queued build.There's also now a builder daemon which will look for queued builds and attempt to build them.
The status of a build is updated and retrievable through
/builds
etc,and the build database itself is put to the builds root and contains all related task logs of the build process.
Workflow steps:
Migrate your CP DB.
Here's a test catalog you can post to `/builds`.
$ curl -s -H "Authorization: Basic $AUTH" -H "Content-Type: application/json" --data-binary @/home/johnny/test_catalog.flow.json http://127.0.0.1:3000/builds | jq .
When a build completes, you'll see it under
/var/tmp/flow-builds
./var/tmp/flow-builds$ sqlite3 * 'select * from task_logs';
Documentation links affected:
None yet.
Notes for reviewers:
Recommend reviewing commit by commit.
There are also some commits for dependency updates and other minor fixups.
This change is