Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault CSI support v2 #828

Merged
merged 17 commits into from
Mar 14, 2024
Merged

Vault CSI support v2 #828

merged 17 commits into from
Mar 14, 2024

Conversation

pnovotnak
Copy link
Contributor

@pnovotnak pnovotnak commented Feb 27, 2024

💸 TL;DR

This PR fixes Vault CSI support by reimplementing the CSI driver. The new driver is a bit simpler and has been tested against mocked CSI behavior (provided in test cases), as well as integration tested against a real Vault CSI driver.

📜 Details

This driver leverages this convenient tidbit from the Vault CSI code:

https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/c697863c35d5431ec048b440d36550eb3ceb338f/pkg/util/fileutil/atomic_writer.go#L60-L62

By taking advantage of this behavior (modification time of the ..data symlink) we simplify caching logic greatly by observing modification time of the symlink for cache entry invalidation. Furthermore, since the CSI driver initializes the volume before start, we have forgone the filewatcher on this symlink.

Secret files are rewritten atomically using this algorithm. The upshot is that if we don't resolve the ..data symlink we can use it each time we read files to get the most recent version from disk.

In my testing environment I was observing 2 minutes between refreshes. I haven't hunted down the configuration for this yet but that appears to be the default.

🧪 Testing Steps / Validation

This was tested with some print statements in a test environment where a Baseplate.py thrift service was serving testing traffic with Vault CSI running providing secrets. I monitored cache hits vs failures to ensure that the files weren't being reloaded more than expected.

Cache hits vs misses log
Feb 27 14:35:45 test-environment: INFO     Listening on ('0.0.0.0', 9090)
Feb 27 14:42:58 test-environment: cache miss: secret/example/secret-value-1@mtime=1709073701.164487, cache_entry=None, secret_data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}
Feb 27 14:42:58 test-environment: cache miss: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=None, secret_data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}
Feb 27 14:42:58 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:42:58 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:42:58 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:42:58 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:42:58 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-1@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-1@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-1@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:43:19 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073701.164487, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:44:10 test-environment: cache hit: secret/example/secret-value-1@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:44:10 test-environment: cache miss: secret/example/secret-value-1@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}, updating=False), secret_data={'current': 'xxxxxxxxxxx==', 'encoding': 'base64', 'type': 'versioned'}
Feb 27 14:44:10 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:44:10 test-environment: cache miss: secret/example/secret-value-2@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073701.164487, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False), secret_data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}
Feb 27 14:44:10 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073820.1701584, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:44:10 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073820.1701584, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:44:10 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073820.1701584, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)
Feb 27 14:44:10 test-environment: cache hit: secret/example/secret-value-2@mtime=1709073820.1701584, cache_entry=VaultCSIEntry(mtime=1709073820.1701584, data={'current': 'xxxxxxxx', 'encoding': 'base64', 'type': 'versioned'}, updating=False)

✅ Checks

  • CI tests (if present) are passing
  • Adheres to code style for repo
  • Contributor License Agreement (CLA) completed if not a Reddit employee

@pnovotnak pnovotnak mentioned this pull request Feb 27, 2024
3 tasks
TylerLubeck and others added 3 commits February 27, 2024 14:05
* Ensure VaultCSI secrets are sourcing from a directory

* Utilize the parser mechanisms

* Update error messages
@pnovotnak pnovotnak marked this pull request as ready for review February 27, 2024 23:36
@pnovotnak pnovotnak requested a review from a team as a code owner February 27, 2024 23:36
Comment on lines 597 to 600
if options.provider == "vault_csi":
parser = parse_vault_csi
return DirectorySecretsStore(options.path, parser, timeout=timeout, backoff=backoff)
return VaultCSISecretsStore(
options.path, parser=parse_vault_csi, timeout=timeout, backoff=backoff
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chriskuehl We are a little torn here WRT backward incompatibility. WDYT about removing the old implementation straight away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS If we decide to replace the current implementation, we can probably remove DirectorySecretsStore & associated tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be transparent to users, right? Looking in Sourcegraph, I don't see any references to DirectorySecretsStore outside of this repo, so that seems reasonable to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally we seem fine.

The issue is that theoretically there are open source users of baseplate.py that might see breakage. Is it ok to risk that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to remove the old unused implementation. I did some cursory checks (searching GitHub for public users of baseplate) and also checking in with the rest of the team, and our general impression is that nobody is really using Baseplate.py outside of Reddit.

This isn't necessarily great open source hygiene, but I don't think it's worth increasing our maintenance burden for a benefit we're pretty sure isn't actually there. We should call out in the release notes that it's technically a breaking change though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've ripped out the old implementation based on this discussion.

@@ -119,7 +120,9 @@ def _decode_secret(path: str, encoding: str, value: str) -> bytes:
raise CorruptSecretError(path, f"unknown encoding: {encoding!r}")


SecretParser = Callable[[Dict[str, Any], str], Dict[str, str]]
Copy link
Contributor Author

@pnovotnak pnovotnak Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second parameter has a default which can't be expressed with Callable

baseplate/lib/secrets.py Outdated Show resolved Hide resolved
baseplate/lib/secrets.py Outdated Show resolved Hide resolved
baseplate/lib/secrets.py Outdated Show resolved Hide resolved
Copy link
Contributor

@KTAtkinson KTAtkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this together!

Co-authored-by: Tyler Lubeck <tyler@tylerlubeck.com>
baseplate/lib/secrets.py Show resolved Hide resolved
baseplate/lib/secrets.py Outdated Show resolved Hide resolved
baseplate/lib/secrets.py Outdated Show resolved Hide resolved
docs/pyproject.toml Outdated Show resolved Hide resolved
tests/unit/lib/secrets/vault_csi_tests.py Show resolved Hide resolved
Comment on lines 147 to 151
simulate_secret_update(self.csi_dir)
assert original_data_path != self.csi_dir.joinpath("..data").resolve()
data = secrets_store.get_credentials("secret/example-service/example-secret")
assert data.username == "reddit"
assert data.password == "password"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to validate more than one secret update -- historically, the main failure mode of our naive implementations that watch the file would work correctly for one update but will fail to notice the second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also with secret value updates.

@kylelemons
Copy link
Contributor

kylelemons commented Feb 29, 2024

Also, meta question, did we validate this in snoodev with the real Vault CSI yet?

@pnovotnak
Copy link
Contributor Author

Yes! Tested pretty extensively in snoodev

Copy link
Contributor

@kylelemons kylelemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about the mutex

Copy link
Contributor

@kylelemons kylelemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks workable to me, but I am definitely neither a python nor baseplate expert so please also wait for the other reviewers.

Copy link
Member

@chriskuehl chriskuehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, just one question about whether a race condition can actually happen.

baseplate/lib/secrets.py Outdated Show resolved Hide resolved
tests/unit/lib/secrets/vault_csi_tests.py Outdated Show resolved Hide resolved
tests/unit/lib/secrets/vault_csi_tests.py Outdated Show resolved Hide resolved
def new_fake_csi(data: typing.Dict[str, SecretType]) -> Path:
"""Creates a simulated CSI directory with data and symlinks.
Note that this would already be configured before the pod starts."""
csi_dir = Path(tempfile.mkdtemp())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (optional): since we're using pytest as our test runner already, it might be nice to use pytest's tmp_path fixture rather than creating temporary directories manually: https://docs.pytest.org/en/latest/how-to/tmp_path.html

You wouldn't need to manually clean it up this way either, since pytest handles cleanup (by default it leaves the past couple test run outputs around which is helpful in case you want to inspect failures).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I would like to apply this suggestion, I think this would require overhauling the test suite pretty significantly. So I'll omit this change, but TIL about this feature

baseplate/lib/secrets.py Outdated Show resolved Hide resolved
def test_secret_updated(self):
secrets_store = get_secrets_store(str(self.csi_dir))
data = secrets_store.get_credentials("secret/example-service/example-secret")
gevent.sleep(0.1) # prevent gevent shenanigans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of these sleeps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests were flaking if they weren't present. My theory is that gevent is passing control back to the tests. By sleeping, I ensure the IO is complete before the remainder of the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were the flakes locally or in CI? I've been running the tests a bunch locally and can't seem to reproduce any flakes. I'd like to try to dig into this if possible before we merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only saw it in CI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a little bit and was able to replicate the issues frequently in GitHub Actions CI but never locally.

I added some debugging prints and noticed cases where the file mtime did not change but the file contents did, which caused the test failures:
Screenshot 2024-03-07 at 11 53 17 AM

Not sure if this is something weird with the CI environment (maybe like low-resolution timestamps in the filesystem or with the system clock?) or something with gevent I'm not understanding, but I think it's safe to go ahead and merge.

pnovotnak and others added 3 commits March 5, 2024 09:10
Co-authored-by: Chris Kuehl <chris.kuehl@reddit.com>
Co-authored-by: Chris Kuehl <chris.kuehl@reddit.com>
@chriskuehl chriskuehl merged commit 82dd952 into develop Mar 14, 2024
5 checks passed
@chriskuehl chriskuehl deleted the vault-csi-support-v2 branch March 14, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants