generated from cybozu-go/neco-template
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: zeroalphat <taichi-takemura@cybozu.co.jp>
- Loading branch information
1 parent
b008988
commit dc9f480
Showing
1 changed file
with
152 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
Design Document | ||
=============== | ||
|
||
## Context and Scope | ||
|
||
It is possible to get profiles of containers running on Kubernetes using perf, but it requires strong permissions and a lot of manual work. | ||
NecoPerf provides an easy way to get profiles of running containers. | ||
NecoPerf can automate many manual operations. | ||
|
||
## Goals | ||
|
||
- Provides an easy way for tenant teams to perform perf and profiling | ||
- A user can specify the options when profiling | ||
|
||
## Non-goals | ||
|
||
- Support for various operating systems (initial implementation is Flatcar only) | ||
- Support TLS (to be implemented in the future) | ||
- Profiling of child processes | ||
- e.g. container using [tini](https://github.com/krallin/tini) | ||
- Continuous Profiling | ||
- Processing and visualization of acquired profile data, including conversion to [FlameGraph](https://github.com/brendangregg/FlameGraph) | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
This section describes the actual flow of a situation when a user uses perf to retrieve profiling. | ||
|
||
- The assumption is that the Kubernetes cluster in User stories is used in a multi-tenant environment | ||
- There is a team managing the cluster and several teams using the cluster | ||
- The team that uses the cluster is called the tenant team | ||
- Tenant teams do not have strong privileges | ||
|
||
- Tenant teams are aware that there are performance issues with their workloads and want to profile them using perf to identify bottlenecks. However, a lot of things need to be done manually, as the following steps are required to run perf | ||
1. Install a perf that is compatible with the kernel version of the host operating system in the container image | ||
2. Modify the manifest to add a sidecar or ephemeral container with the necessary permissions to run perf | ||
3. The user enters a sidecar or ephemeral container and executes perf against the target container to retrieve the profile | ||
|
||
- The team managing the Kubernetes cluster wants to minimize the permissions granted to the tenant team. | ||
However, to run perf, the tenant team needs to be able to grant the permissions such as `CAP_SYS_ADMIN` and `CAP_SYS_PTRACE` , which violates the principle of least privilege. | ||
|
||
NecoPerf does not require manual operations and allows for easy profiling of containers using perf. | ||
|
||
### Constraints | ||
|
||
- Restrictions on resolving symbols | ||
- Debug symbols are required for perf to resolve symbols. | ||
These debug symbols must be included in the container image to be profiled | ||
- Possible failure to retrieve profiling due to pod status | ||
- As NecoPerf performs profiling based on the PID, it may not be able to profile successfully if the target process is terminated during profiling | ||
|
||
### Risk and Mitigations | ||
|
||
- Security Risk | ||
- It is required for `CAP_SYSLOG` to allow unprivileged users to access kernel addresses (`kptr_restrict`) | ||
- It is required for `CAP_SYS_ADMIN` and `CAP_SYS_CHROOT` so that perf can resolve addresses to symbols in a container environment | ||
- Using NecoPerf removes the need to give tenant teams strong permissions like `CAP_SYS_ADMIN` and `CAP_SYS_CHROOT` to run perf | ||
- It is necessary to enable hostPID for NecoPerf to look up other PID(Process ID) of the host from within the pod | ||
- NecoPerf converts container id to PID via CRI(Container Runtime Interface) API. | ||
Therefore, NecoPerf needs to bind the socket of the container runtime, leaving NecoPerf with more functionality than it needs. | ||
If a read-only CRI API is added in the future, we would like to switch to using that API. | ||
- Performance Risk | ||
- To prevent tenant teams from running perf for long periods, the NecoPerf validates the values from the user request | ||
|
||
## The actual design | ||
|
||
The first implementation creates a gRPC server that simply runs perf on the specified container id and returns the profiling results. | ||
The perf command is used to retrieve profiling and convert the retrieved profiling data. | ||
|
||
We also create a command line tool as a client to send requests to the gRPC server. | ||
This command line tool queries the Kubernetes API server based on the pod and container name entered by the user and retrieves the container id. | ||
The command line tool sends a profiling request to the gRPC server based on the retrieved container id. | ||
|
||
```console | ||
necoperf-client -n <namespace> <pod-name> -c <container name> -o <output directory> | ||
``` | ||
|
||
### API | ||
|
||
```protobuf | ||
service NecoPerf { | ||
rpc Record(PerfRecordRequest) returns (PerfRecordResponse); | ||
} | ||
message PerfRecordRequest { | ||
string container_id = 1; | ||
int64 interval = 2; | ||
} | ||
message PerfRecordResponse { | ||
bytes data = 1; | ||
} | ||
``` | ||
|
||
### System Context Diagram | ||
|
||
```mermaid | ||
graph TD; | ||
User-->|exec|necoperf-client | ||
necoperf-client-->|GET|k8s-api-server[kube-apiserver] | ||
necoperf-client -->|gRPC call|necoperf-daemon | ||
subgraph node1 | ||
necoperf-daemon-->|CRI call|CRI | ||
perf-->|profile|pod[target pod] | ||
perf-.->|export/read|perf.data((necoperf.data)) | ||
perf-.->|export|perf.script((necoperf.script)) | ||
necoperf-daemon-->|exec|perf | ||
subgraph daemonset | ||
necoperf-daemon | ||
end | ||
end | ||
subgraph your-pod | ||
necoperf-client-.->|export|result((result)) | ||
end | ||
``` | ||
|
||
## Alternatives | ||
|
||
This section lists some existing systems and explains why they are not used. | ||
|
||
- [IBM/perf-sidecar-injector](https://github.com/IBM/perf-sidecar-injector) | ||
- perf-sidecar-injector is a mutating webhook that adds a perf container as a sidecar container | ||
- perf-sidecar-injector requires privileged access to run the perf container | ||
- perf-sidecar-injector needs to enable Pod `shareProcessNamespace` to access the target container from the sidecar. | ||
Enabling Pod `shareProcessNamespace` allows other containers in the pod to see environment variables and file systems. | ||
Some tenant teams may not accept this case. | ||
- [yahoo/kubectl-flame](https://github.com/yahoo/kubectl-flame) | ||
- kubectl-flame is a kubectl plugin that allows profiling of applications on kubernetes | ||
- kubectl-flame performs profiling of NodeJS applications by using perf. | ||
- The command-line arguments of kubectl-flame's profiling perf are hard-coded and the arguments cannot be changed except for the execution time. | ||
<https://github.com/yahoo/kubectl-flame/blob/master/agent/profiler/perf.go#L60> | ||
- kubectl-flame only supports docker runtime and does not support containerd runtime. | ||
<https://github.com/yahoo/kubectl-flame/issues/51> | ||
- [iovisor/kubectl-trace](https://github.com/iovisor/kubectl-trace) | ||
- kubectl-trace is a kubectl plugin to schedule bpftrace programmers against Pods on a Kubernetes cluster | ||
- kubectl-trace only supports tracing against Pods and does not support profiling | ||
- [giannisalinetti/perf-utils](https://github.com/giannisalinetti/perf-utils) | ||
- The container image of perf-utils installs tools for performance analysis and troubleshooting for immutable systems such as Fedora CoreOS | ||
- perf-utils does not install a perf compatible with the host kernel version | ||
|
||
Explains the problems with the sidecar container method and the Ephemeral Container method. | ||
|
||
- The sidecar container method requires the sidecar container to be deployed beforehand. | ||
If you deploy the sidecar container later, you need to allow the pod to restart. | ||
- As of Kubernetes 1.26, once an Ephemeral Container is added to a Pod, it cannot be changed or removed | ||
> Like regular containers, you may not change or remove an ephemeral container after you have added it to a Pod. | ||
[Ephemeral Container](https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/#understanding-ephemeral-containers) | ||
- The tenant team must be configured to grant permissions such as `CAP_SYS_ADMIN` to a Pod | ||
- It is difficult for tenant teams to prepare a version of perf that is compatible with the host OS |