-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault CSI support v2 #828
Vault CSI support v2 #828
Conversation
* Ensure VaultCSI secrets are sourcing from a directory * Utilize the parser mechanisms * Update error messages
6742582
to
a9015bf
Compare
baseplate/lib/secrets.py
Outdated
if options.provider == "vault_csi": | ||
parser = parse_vault_csi | ||
return DirectorySecretsStore(options.path, parser, timeout=timeout, backoff=backoff) | ||
return VaultCSISecretsStore( | ||
options.path, parser=parse_vault_csi, timeout=timeout, backoff=backoff | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chriskuehl We are a little torn here WRT backward incompatibility. WDYT about removing the old implementation straight away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS If we decide to replace the current implementation, we can probably remove DirectorySecretsStore & associated tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be transparent to users, right? Looking in Sourcegraph, I don't see any references to DirectorySecretsStore
outside of this repo, so that seems reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internally we seem fine.
The issue is that theoretically there are open source users of baseplate.py that might see breakage. Is it ok to risk that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to remove the old unused implementation. I did some cursory checks (searching GitHub for public users of baseplate) and also checking in with the rest of the team, and our general impression is that nobody is really using Baseplate.py outside of Reddit.
This isn't necessarily great open source hygiene, but I don't think it's worth increasing our maintenance burden for a benefit we're pretty sure isn't actually there. We should call out in the release notes that it's technically a breaking change though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've ripped out the old implementation based on this discussion.
@@ -119,7 +120,9 @@ def _decode_secret(path: str, encoding: str, value: str) -> bytes: | |||
raise CorruptSecretError(path, f"unknown encoding: {encoding!r}") | |||
|
|||
|
|||
SecretParser = Callable[[Dict[str, Any], str], Dict[str, str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second parameter has a default which can't be expressed with Callable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this together!
Co-authored-by: Tyler Lubeck <tyler@tylerlubeck.com>
simulate_secret_update(self.csi_dir) | ||
assert original_data_path != self.csi_dir.joinpath("..data").resolve() | ||
data = secrets_store.get_credentials("secret/example-service/example-secret") | ||
assert data.username == "reddit" | ||
assert data.password == "password" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important to validate more than one secret update -- historically, the main failure mode of our naive implementations that watch the file would work correctly for one update but will fail to notice the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also with secret value updates.
Also, meta question, did we validate this in snoodev with the real Vault CSI yet? |
Yes! Tested pretty extensively in snoodev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment about the mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks workable to me, but I am definitely neither a python nor baseplate expert so please also wait for the other reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, just one question about whether a race condition can actually happen.
def new_fake_csi(data: typing.Dict[str, SecretType]) -> Path: | ||
"""Creates a simulated CSI directory with data and symlinks. | ||
Note that this would already be configured before the pod starts.""" | ||
csi_dir = Path(tempfile.mkdtemp()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (optional): since we're using pytest as our test runner already, it might be nice to use pytest's tmp_path
fixture rather than creating temporary directories manually: https://docs.pytest.org/en/latest/how-to/tmp_path.html
You wouldn't need to manually clean it up this way either, since pytest handles cleanup (by default it leaves the past couple test run outputs around which is helpful in case you want to inspect failures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I would like to apply this suggestion, I think this would require overhauling the test suite pretty significantly. So I'll omit this change, but TIL about this feature
def test_secret_updated(self): | ||
secrets_store = get_secrets_store(str(self.csi_dir)) | ||
data = secrets_store.get_credentials("secret/example-service/example-secret") | ||
gevent.sleep(0.1) # prevent gevent shenanigans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of these sleeps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests were flaking if they weren't present. My theory is that gevent is passing control back to the tests. By sleeping, I ensure the IO is complete before the remainder of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were the flakes locally or in CI? I've been running the tests a bunch locally and can't seem to reproduce any flakes. I'd like to try to dig into this if possible before we merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only saw it in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a little bit and was able to replicate the issues frequently in GitHub Actions CI but never locally.
I added some debugging prints and noticed cases where the file mtime did not change but the file contents did, which caused the test failures:
Not sure if this is something weird with the CI environment (maybe like low-resolution timestamps in the filesystem or with the system clock?) or something with gevent I'm not understanding, but I think it's safe to go ahead and merge.
Co-authored-by: Chris Kuehl <chris.kuehl@reddit.com>
Co-authored-by: Chris Kuehl <chris.kuehl@reddit.com>
💸 TL;DR
This PR fixes Vault CSI support by reimplementing the CSI driver. The new driver is a bit simpler and has been tested against mocked CSI behavior (provided in test cases), as well as integration tested against a real Vault CSI driver.
📜 Details
This driver leverages this convenient tidbit from the Vault CSI code:
https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/c697863c35d5431ec048b440d36550eb3ceb338f/pkg/util/fileutil/atomic_writer.go#L60-L62
By taking advantage of this behavior (modification time of the
..data
symlink) we simplify caching logic greatly by observing modification time of the symlink for cache entry invalidation. Furthermore, since the CSI driver initializes the volume before start, we have forgone the filewatcher on this symlink.Secret files are rewritten atomically using this algorithm. The upshot is that if we don't resolve the
..data
symlink we can use it each time we read files to get the most recent version from disk.In my testing environment I was observing 2 minutes between refreshes. I haven't hunted down the configuration for this yet but that appears to be the default.
🧪 Testing Steps / Validation
This was tested with some print statements in a test environment where a Baseplate.py thrift service was serving testing traffic with Vault CSI running providing secrets. I monitored cache hits vs failures to ensure that the files weren't being reloaded more than expected.
Cache hits vs misses log
✅ Checks