-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3314: CSI Changed Block Tracking #4082
Conversation
ihcsim
commented
Jun 12, 2023
•
edited
Loading
edited
- One-line PR description: KEP for CSI changed block tracking (CBT)
- Issue link: CSI Differential Snapshot for Block Volumes #3314
- Other comments: Supersedes KEP-3314: Changed Block Tracking With CSI VolumeSnapshotDelta #3367.
Signed-off-by: Ivan Sim <ivan.sim@dell.com>
Signed-off-by: Ivan Sim <ivan.sim@dell.com>
Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section.
Clarified the proposal.
Initial structure. Filled in the Proposal, Caveats and Risks. Put in the CSI spec in the Details section.
|
|
||
The `BlockMetadataType` enumeration specifies the style used: `FIXED_LENGTH` or `VARIABLE_LENGTH`. | ||
When the *block-based* style (`FIXED_LENGTH`) is used it is up to the SP plugin to define the | ||
block size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will backup clients make use of FIXED_LENGTH
/VARIABLE_LENGTH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A backup client would, in general, have to map the "view" of the block device returned by the API to its own view of the block device. For example, a client could have its own concept of block size and would have to map, say, a VMware extent map, to a range of blocks; conversely, it may have to map a range of AWS EBS blocks to an extent map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of FIXED_LENGTH
/VARIABLE_LENGTH
is still not clear to me.
The API returns an ascending, non-overlapping sequence of BlockMetadata
elements. Backup clients may accumulate adjacent BlockMetadata
elements (i.e. convert from blocks to extents) and then align offsets/lengths to the backup client's internal block size, but this does not require BlockMetadataType
.
Can you give a concrete example of how a backup client will use BlockMetadataType
to clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we discussed this in the CSI WG, the argument was thus:
- If the SP uses FIXED_LENGTH, it's a promise that lengths will be constant for the given pair of snapshots, which may unlock certain optimizations on the backup software side which wouldn't be possible with VARIABLE_LENGTH
- If the SP uses VARIABLE_LENGTH, it remains free to return whatever lengths it wants -- possible folding consecutive change blocks and thereby compressing the metadata, and the backup software has to cope with it.
If these details aren't clear from the CSI spec, then it's a failure of spec writing, and we should amend the spec to be more clear about the exact semantics of that field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bswartz While the spec covers the effect of BlockMetadataType
on the Length
field, it's still unclear to me how a backup client would use this information.
The one example I can think of, making decisions about backup data layout (e.g. block size and the backup client's internal metadata parameters), doesn't seem to be well-served by BlockMetadataType
:
BlockMetadataType
only applies for the current API call. Maybe it could even change between successive calls on the same snapshot pair. Definitely between different snapshot pairs from the same volume.
Is a backup client that uses this information making the assumption that BlockMetadataType
and the block size does not change? If yes, then either backup clients cannot use BlockMetadataType
for this purpose or the API design needs adjustment.
An API that captures permanent attributes like block size would be something like GetVolumeMetadataHints(volume_id) -> VolumeMetadataHints
where the spec guarantees the stability of the hints across snapshots of the volume.
Or maybe I'm still missing how backup clients could use BlockMetadataType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recall that these new snapshot metadata APIs are stream-oriented, so "one API call" is actually the whole stream of metadata for a given pair of snapshots. If the SP commits to always use a Length of 4096 (for example) then the backup software can safely allocate a fixed-size bitmap where each bit represents 4k of change for the whole snasphot.
If an SP returns variable length blocks, then the backup software has to be prepared to use something other than a bitmap to represent the delta, or be prepared to guess a optimal bitmap granularity and be prepared to adjust delta blocks to fit the bitmap, possibly wasting small amounts of space by backing up unchanged data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXED_LENGTH
is expensive over the wire because adjacent BlockMetadata
elements cannot be sent merged. It would be more efficient for the SP to include an optional block size hint field in the response and allow the Length
field to be vary instead of having BlockMetadataType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thing the API is alpha and we can evolve it. I agree this isn't the most efficient transport mechanism, but it's dramatically more efficient that what's being done in the absence of this feature. The designers were optimizing for simplicity rather than efficiency, which is the right place to focus in an alpha version of anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for considering this feedback!
repeated BlockMetadata block_metadata = 3; | ||
} | ||
|
||
message GetMetadataDeltaRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between GetMetadataAllocatedRequest
and GetMetadataDeltaRequest
aside from the snapshot_id
vs base_snapshot_id
/target_snapshot_id
fields?
If there is no difference, maybe make base_snapshot_id optional (like the CSI spec's Snapshot
group_snapshot_id
field) and unify the two requests to reduce duplication. When base_snapshot_id
is empty then allocated data is reported. When base_snapshot_id
is non-empty then the delta is reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a difference between GetMetadataAllocatedRequest and GetMetadataDeltaRequest aside from the snapshot_id vs base_snapshot_id/target_snapshot_id fields?
No, not really. We had considered your suggestion and decided to explicitly document separate methods for clarity.
``` | ||
cbt.storage.k8s.io/driver: NAME_OF_THE_CSI_DRIVER | ||
``` | ||
The presence of this label allows a backup application to efficiently locate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you have multiple CRs for the same driver?
Instead of using labels, can the name of the object == the name of the driver, which is what we require for the CSIDriver
object: https://kubernetes.io/docs/reference/kubernetes-api/config-and-storage-resources/csi-driver-v1/
Or should we put it in the CSIDriver object itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't see why we can't use the object name, instead of labels. We explored using the CSIDriver object to store this information, but was told that it was mainly used for kubelet/storage related spec, not things like HTTP endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be ambiguous to find multiple CRs for a given driver; which one should the application pick?
I agree the object could be named for the driver but the spec doesn't mandate that - it would totally depend on the driver installer. The reason is that the spec does not require these CRs to be in any particular namespace; the app finds them via a label search.
I think there were discussions on modifying the CSIDriver
object in the WG and the concept was shot down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which spec doesn't mandate the name of the CR? We could define this requirement as such like we did for the CSIDriver object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I'll add something on the name of the CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgotten to remove the CR label requirement - fixed now!
existing tests to make this code solid enough prior to committing the changes necessary | ||
to implement this enhancement. | ||
|
||
##### Prerequisite testing updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to make sure that our standard k8s-csi logging methods are not logging any tokens/secrets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the sentiment, but should we be stating that in the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think this needs to be explicit in the spec. It's just an implementation requirement for our common libraries/sidecars as this has been a source of CVEs in the past
Signed-off-by: Ivan Sim <ihcsim@gmail.com>
/assign @liggitt |
@@ -0,0 +1,3 @@ | |||
kep-number: 3314 | |||
alpha: | |||
approver: "@johnbelamaric" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at this before as shadow, and John is overbooked. So let's switch to me with PRR approval on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
/label tide/merge-method-squash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PRR is mostly good, I'll wait for approval from sig-storage and sig-auth.
- Components depending on the feature gate: | ||
- [x] Other | ||
- Describe the mechanism: | ||
The new components will be implemented as part of the out-of-tree CSI framework. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still holds, not a blocking, but would be nice to answer this and questions around monitoring and troubleshooting.
to Kubernetes backup applications | ||
by creating a [SnapshotMetadataService CR](#snapshot-metadata-service-custom-resource) | ||
that contains the service's TCP endpoint address, CA certificate and | ||
an audience string needed for token authentication. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow up offline on this, but we may want to enforce some syntax/restrictions on the allowed audience strings. The primary goal is to prevent reusing some reserved audience for kube-apiserver. cc @liggitt
/lgtm |
@soltysh could you provide approval for PRR? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
for PRR
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ihcsim, msau42, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |