Phil/custom s3 endpoints #901

psFried · 2023-01-30T15:07:57Z

Description:

Resolves #892

Adds support for custom storage endpoints in catalog specs. This is not yet a user-facing change, since there isn't yet a user-facing edit capability for storage mappings. But I've tested and verified that custom storage endpoints can be use in storage_mappings.

Workflow steps:

Note that these steps apply to a control plane operator, not an end user.

Ensure that the gazette brokers have an appropriate AWS profile that specifies both the credentials and a region. The profile must be named exactly the same as the tenant name portion of the storage mapping prefix.
- It's important that the region configuration is present before any journals try to use the profile, though there is a pending fix for this in error when s3 client is missing region config gazette/core#330
In the storage_mappings table, you may now add a custom store to the spec. For example {"provider":"CUSTOM","bucket":"the-bucket","endpoint":"some.storage.endpoint.com"}
Create a publication that includes any affected tasks and collections. The easiest way to do this is to use flowctl catalog pull-specs --prefix <storage-mapping-prefix> in an empty directory, and then run flowctl catalog publish --source flow.yaml.

Documentation links affected:

We don't yet have this documented, and I'm thinking it's best to hold off on that until we figure out how we want to handle end-user edits to storage mappings.

This change is

There's a race condition in the agent publication handler, where the temp directory that's used for storing build outputs can sometimes get deleted before the activation of a build, which then errors out because it's unable to find the build database. This introduces an explicit call to `std::mem::drop` to ensure that the `TempDir` is not dropped until after the activation is completed.

Adds a new `CUSTOM` variant of storage mappings, which allows catalogs to use a variety of S3-compatible storage services by specifying the `endpoint` explicitly. This is not yet directly exposed to end-users, since `storageMappings` are handled by the control plane. But it does give us the ability to use custom storage endpoints by configuring the storage mappings using something like: ``` {"stores":[{"provider":"CUSTOM","bucket":"the-bucket","endpoint":"some.storage.endpoint.com"}]} ``` Credentials are handled by using the tenant name of each task or collection as a `profile` in the journal storage URI. This profile is understood by the brokers and is looked up in `~/.aws/credentials` and `~/.aws/config` to provide the credentials and region configuration. In order to prevent any `CUSTOM` storage endpoints from using the `default` aws config values, an additional validation was added to ensure that tenant names cannot be `default`. One thing to point is that the catalog JSON schema now isn't able to mark the "provider" field as required, due to SchemaRS lacking [support for internally tagged enums](GREsau/schemars#39). I'm thinking this isn't actually a huge deal since end users don't edit storage mappings in catalog specs anyway. So I'm inclined to leave that as it is for right now.

The oauth edge function had been hard coded to call the production conifg encryption service. This fixes that so that local installs now exclusively use the local config encryption serivce that's started by `start-flow.sh`. This unfortunately required a rather dirty hack to allow the oauth function, which is run by the supabase cli inside their docker container, to connect to the local config encryption service. That hack will likely remain until we can change how the oauth function is deployed.

jshearer

This LGTM! Had a couple of questions, but nothing that should block merging

jshearer · 2023-01-30T19:34:43Z

crates/agent/src/publications.rs

@@ -328,6 +328,8 @@ impl PublishHandler {
            return stop_with_errors(errors, JobStatus::PublishFailed, row, txn).await;
        }

+        // ensure that this tempdir doesn't get dropped before `deploy_build` is called, which depends on the files being there.
+        std::mem::drop(tmpdir_handle);


Is this because process is an async function, so tmpdir_handle would go out of scope at the next .await point after it's declared? Or what was the reason behind this? It looks good, just curious

I'm actually not certain exactly why this ever worked in the first place. The TempDir has all contents deleted synchronously when it's dropped, so this should have been failing consistently, since tmpdir should have been dropped as soon as the binding was overwritten (literally, the next line). I'm also not 100% clear on when the first tmpdir should be dropped, especially given the generated state machine for async functions. But I can confirm that this change did fix the no such file errors I was getting during publish operations locally 🤷‍♂️

jshearer · 2023-01-30T19:39:20Z

crates/models/src/journals.rs

-    #[validate]
-    #[serde(default)]
-    pub prefix: Option<Prefix>,
+#[serde(tag = "provider", rename_all = "SCREAMING_SNAKE_CASE")]


jshearer · 2023-01-30T19:42:01Z

crates/models/src/journals.rs

+                .split_once('/')
+                .expect("invalid catalog_name passed to Store::to_url")
+                .0;
+            url.query_pairs_mut()


So if you have a custom store, we'll pass a url like s3://my-bucket/my-prefix?profile=my-tenant&endpoint=https://minio.mydomain.com?

Yep, that's right. And Gazette already knows how to deal with those URL query parameters.

jshearer · 2023-01-30T19:44:51Z

crates/models/src/journals.rs

+        let mut url = url::Url::parse(&format!("{}://{}/{}", scheme, bucket, prefix))
+            .expect("parsing as URL should never fail");
+        if let Store::Custom(cfg) = self {
+            let tenant = catalog_name


Is assuming that the first segment of a catalog name will always be the tenant name risky/an assumption we're comfortable encoding here? For example, I believe right now we have a bunch of resources named trial/..., and we don't a tenant named trial. Maybe we should just make that tenant?

It would be risky to assume that there's a row in the tenants table, but we don't actually do that here. We only assume that the name is a valid catalog name, and thus contains at least one slash. And that is explicitly validated prior to reaching this point.

jshearer · 2023-01-30T19:52:04Z

crates/validation/src/storage_mapping.rs

 use superslice::Ext;

 pub fn walk_all_storage_mappings(
    storage_mappings: &[tables::StorageMapping],
    errors: &mut tables::Errors,
 ) {
    for m in storage_mappings {
+        for store in m.stores.iter() {
+            // TODO: it seems like we should also be calling `walk_name` for the bucket and prefix, right?


Is this still TODO? Seems like it might be relevant to validation/security.

I left this as a TODO because we already aren't performing this validation, and I not entirely certain that our regex validation even matches what's accepted by cloud providers. TBH I really can't think of any security concerns related to this, but please LMK if you can. I think we should introduce that validation, but I'd like to decouple that from this PR if possible, so I can check to make sure that all our existing values would even pass such validation.

jshearer · 2023-01-30T19:54:12Z

local/start-component.sh

+    # This container exists to do nothing other than to attach to the supabase docker network and expose port 8765, which
+    # is what config-encryption listens on. The pause container exists for just these kinds of shennanigans.
+    # Per: https://stackoverflow.com/a/44739847 the `docker start` will return 0 if the container is already running
+    docker start config_encryption_hack_proxy || \


I heard you talk about this but didn't really think too much about it... currently the oauth edge function talks to prod config-encryption, right? Is that bad because we'll have secrets encrypted with prod credentials... and then need those credentials locally to decrypt, or something like that?

yes, that's right

jshearer · 2023-01-30T20:02:42Z

supabase/pending/custom_storage_endpoints.sql

+-- the credentials. But the `default` AWS profile is special, and is configured with Flow's own credentials, so if a malicious
+-- user created a `default` tenant with a custom storage endpoint, then we could end up sending our credentials to that endpoint.
+-- This prevents a user from being able to create such a tenant.
+insert into internal.illegal_tenant_names (name) values ('default') on conflict do nothing;


Should we be reading from this table up in walk_all_storage_mappings()? I'm not seeing illegal_tenant_names used anywhere in this PR.

Edit: Oh, this table already exists, so we're probably already checking it somewhere
Edit 2: Oh, this is illegal tenant names, not illegal bucket names/prefixes. Derp

yeah, illegal_table_names is already checked as part of new tenant creation.

psFried requested a review from jshearer January 30, 2023 15:08

psFried force-pushed the phil/custom-s3-endpoints branch from b0b57c4 to c4be7fb Compare January 30, 2023 17:47

psFried added 3 commits January 30, 2023 14:58

psFried force-pushed the phil/custom-s3-endpoints branch from c4be7fb to 21ebe78 Compare January 30, 2023 20:00

jshearer approved these changes Jan 30, 2023

View reviewed changes

psFried merged commit 7bb5e98 into master Jan 30, 2023

oliviamiannone added the docs pending Improvements or additions to documentation noted or in progress label Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phil/custom s3 endpoints #901

Phil/custom s3 endpoints #901

psFried commented Jan 30, 2023 •

edited by jgraettinger

Loading

jshearer left a comment

jshearer Jan 30, 2023

psFried Jan 30, 2023

jshearer Jan 30, 2023

jshearer Jan 30, 2023

psFried Jan 30, 2023

jshearer Jan 30, 2023

psFried Jan 30, 2023

jshearer Jan 30, 2023

psFried Jan 30, 2023

jshearer Jan 30, 2023

psFried Jan 30, 2023

jshearer Jan 30, 2023 •

edited

Loading

psFried Jan 30, 2023

Phil/custom s3 endpoints #901

Phil/custom s3 endpoints #901

Conversation

psFried commented Jan 30, 2023 • edited by jgraettinger Loading

jshearer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshearer Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

psFried commented Jan 30, 2023 •

edited by jgraettinger

Loading

jshearer Jan 30, 2023 •

edited

Loading