Skip to content

Action Details

Jason Shaw edited this page Jun 4, 2024 · 37 revisions

Turbonomic Action Execution

Want to learn more about how KubeTurbo works to execute actions? Read on, but first please visit the Use Cases Overview page to have context on how our unique analysis is driving value.

Resizing (Vertical Scaling) of Containerized Workloads

Resizing action execution can be accomplished by any one of the following ways:

  1. In the running environment by KubeTurbo
  2. The user gets the action (via UI, or API) and then manually makes the change
  3. Integrate Turbonomic resize actions into your CD Pipeline / GitOps process to have your pipeline execute the change

When KubeTurbo resizes a running workload, this is essentially the equivalent of running kubectl edit workloadType foo and modifying the container specs. KubeTurbo will apply all container spec changes at the same time, minimizing the number of restarts to only 1. The workload will follow your specified deployment strategy of rolling update, etc. Once successful, you will be able to see the history of executed actions on the Workload Controller.

Is your workload managed by an Operator? You can still leverage Turbonomic to execute the resizing actions in the running environment by leveraging an Operator Resource Mapping. Samples provided here.

The Turbonomic Action Execution Framework also supports:

  • Scheduling actions
  • Integration to workflow orchestration such as ServiceNow
  • Action modes of Automated, Manual, Recommend only, and
  • Policies that can be applied to any scope of a single workload, group of workloads by namespace and/or cluster
  • A declarative k8s native approach to integrate into your GitOps process

StatefulSet and DaemonSet execution in a running environment is supported in version 8.3.5, but the user needs to determine the level of disruption this action can cause, and whether to leverage a pipeline schedule instead.

The key differentiator of Turbonomic resizing action is that ALL replicas that ever ran for a workload will be included in rightsizing, not just the current running pod. This history will be used in our unique analysis to generate actionable decisions that manage tradeoffs in limits and requests for performance and efficiency.

Turbonomic provides some out-of-the-box policies for rightsizing workloads that may require special handling such as side-cars, operator-controlled workloads, and system namespace workloads. For more information go here.

Workloads with manual triggers in Red Hat OpenShift Environments

DeploymentConfigs in Red Hat OpenShift can be configured with triggers and in some cases the specified trigger might require rollouts manually (Considerations for manual vs automatic rollouts read here). Additionally, the imageChange trigger type also prevents the roll out of the updated DeploymentConfig with any changes to the spec fields. Turbonomic's container resize action execution will update the parent only, with the expectation that the cluster resource controllers will ensure the running pods are in sync with the pod template specified in the parent controller, but DeploymentConfigs with the above listed triggers will not result in update of the pods after the Turbonomic action is executed.

Although this behavior is both consistent with the behavior of the triggers and Turbonomic's definition of an executed action, users have an option to configure kubeturbo to let it additionally roll out the updated DeploymentConfig, even if kubeturbo finds the triggers configured to the above stated types.

To enable this behavior, the below feature gate needs to be enabled in kubeturbo (disabled by default). Identify your deployment method and make the change:

  • straight yamls - modify the ConfigMap resource
      "featureGates": {
        "ForceDeploymentConfigRollout": true
      }
  • helm chart - provide this parameter -set featureGates.ForceDeploymentConfigRollout=true

  • operator - add to the CR

spec:
 featureGates:
   ForceDeploymentConfigRollout: true 
   #(default: false)

Node Provision and Suspend (Cluster Scaling)

This wiki is not the source for this information anymore and is no longer updated

NOTE: Please see official IBM Documentation for up-to-date information

Turbonomic will analyze the demand of all workloads, and make decisions when clusters need more or less nodes based on looking at the tradeoffs of efficiency (can workload safely consolidate), performance (AVOIDING node congestion) and compliance (understanding node selection policies). Relying on pod pending is too late, and wastes resources. Turbonomic allows you to save money by not over-provisioning, while at the same time assuring that pods can run on nodes that are not congested.

For more information on default node scaling policies based on Node Roles, and how to manage which nodes will suspend by role, see Node Role Scaling and HA.

You can also execute Node Provision (create a new node from a specific node pool or machineset) or Node Suspend (delete a node from a specific node pool or machineset) actions directly from Turbonomic if you have one of the following capabilities:

  1. Red Hat OpenShift 4.x Machine API Operator
  2. AKS node pool (with the Azure subscription as a target)
  3. EKS node pool (with AWS account as a target)

Coming Soon:

  1. GKE node pool (with Google Cloud as a target)

Turbonomic uses the apis available from these solutions to invoke creation of a node, or deletion.

Node provision and suspend actions can also be seen at a Node Pool level to understand the cost impact of managing cluster capacity.

Turbonomic does allow a user to execute any action through our action framework to make calls to third party workflow orchestrators such as Service Now, Ansible and Terraform. Consult Turbonomic User Guide for more information.

For more information on how to configure min and max number of nodes, refer to the article here.

Coming soon: Turbonomic is working on Google support for a future release to scale GKE node pools.

SLO Horizontal Scaling

KubeTurbo can also execute actions to horizontally scale a deployment that has an SLO policy defined. KubeTurbo will execute the equivalent of kubectl scale deployment foo --replicas=x when there is an action to provision or suspend (delete) a pod.

Requirements:

  • SLO Metrics of response time or transaction throughput / rate metric is captured either with Instana as a target, or Dynatrace (starting with Turbonomic version 8.10.4, or the use of custom metrics queried from Prometheus via Prometurbo.
  • define an SLO Policy to set your targeted SLO, min / max replicas
  • Set your Policy either through the Turbo UI, or leverage a Kubeturbo SLO Custom Resource Definitions. Define your SLO policy in one CR, and the services you want to bind them to in another policy binding CR

Key differentiator is Turbonomic will not only recommend and automate number of replicas needed to maintain SLO, but the analysis will also tell you how many nodes are required to run these replicas along with the other workload in the cluster. Smart scaling with intelligent pod placement.

Read the blog here

  • Manage horizontal scaling of services without thresholds
  • Manage the trade-offs of performance, availability of resources, and compliance
  • Leverage your SLO data to add response time and throughput leveraging telemetry data collected - Istio, Prometheus, etc

Turbonomic Pod Moves (continuous rescheduling)

Mitigate Congestion while Defragmenting Cluster Resources This is unique to Turbonomic and warrants some details.

The Problem

The Kubernetes scheduler does the work of scheduling a pod based on the workload specification (requests, tolerations, etc) and the assessment of the cluster’s available resources to find the best possible, compliant node. This decision is made every time a workload is placed in the scheduler’s queue.

But after the pod is scheduled, and workload demand fluctuates, there is nothing that asks the question “Is this node the best place to continue to run this workload?” The only recourse for node congestion is to wait for pods to get evicted, thereby placing them back into the queue. Eviction is not a good strategy to ensure application performance. More stateful workloads may suffer availability as pods are killed, and if the pressure is high enough, not only do all workloads suffer, but the node itself can become unusable, forcing all pods to have to get rescheduled elsewhere – if there is capacity.

Turbonomic Continuous Placement Solution

Analysis with prescriptive actions is needed to be able to manage workloads before node pressure causes eviction, while understanding fluctuating demand to determine where and when additional resources are need. Turbonomic uniquely solves this problem, by continuously analyzing workload demand to drive placement decisions that assure performance, drives efficiency, while being compliant to placement rules. Turbonomic will use 5 data dimensions of Memory usage, Memory Requests, CPU usage, CPU Requests, and Pod density, along with discovered compliance policies of node selector strategies whether node labels, taints/tolerations, explicit affinity or anti-affinity rules to make the assessment of which Pod to move when and where (which node).

Turbonomic generates Pod Move actions that will show the user the main performance, compliance or efficiency risk being mitigated, along with the impact of this and other actions on this node to show improvements in the nodes that are impacted.

Figure 1: Turbonomic Pod Move Action to Address Node VMEM Congestion

The user can also see the benefits across all the compute nodes in the cluster by seeing before and after simulation of actions executed, providing more proof of benefits to take the actions .

Figure 2: Turbonomic Projection of Node Utilization Improvement Achieved Through Taking Actions

In the event that there is no available compliant node capacity left, Turbonomic will generate a preventative and pre-emptive Node provision action that, when executed, will allow Turbonomic to move pods to this new node, assuring the node’s usage without waiting for pods to get evicted.

Pod Move Actions that Assure Availability

Turbonomic’s Pod Move actions can be executed through Turbonomic, and are designed to coordinate with the Kubernetes Controller that is managing the workload desired state and number of replicas. Turbonomic uses a mediation pod called KubeTurbo that is running within the Kubernetes Cluster to discover, gather data, and execute this action.

KubeTurbo provides pod moves through a series of steps that validates the state of the workload is running and ready. When Turbonomic executes a move, KubeTurbo will first launch a copy of the pod on the determined destination node. KubeTurbo waits for this copy to be running and ready, meeting liveliness and readiness probe goals. If the copy does not reach Ready state, then the action is gracefully failed, and the reason is logged. When the copy does reach Ready state, the workload now has N+1 replica running, assuring availability. KubeTurbo will next orchestrate deleting the original pod, and introducing the copy to the controller as the one to satisfy the desired replica count of N, through the use of labels, resulting in the original controller "adopting" the pod spun up by KubeTurbo.

This sequence of events assures availability of the workload, and smoothly handles action failure because the original pod is not deleted until the copy is ready for business!

For workloads that may not handle a replica being spun up because of RWO data access on a PV, or a backend service will not allow another replica to connect, Turbonomic will support an alternative delete – copy move action. Every action executed by Turbonomic is logged, with an audit trail of who executed the action, whether manually invoked or done under automation.

Pod Move Actions Technical Details

Want to know how it works? Review the GitHub project here for details on how the action is executed, how we use labels and leverage the adoption process.

Turbonomic created this move mechanism with the intent to assure there is always a pod ready, by not relying on evictions which are disruptive, and ensure high availability for even singletons. The following picture provides an overview of what we do: 1) make a copy of the pod, 2) wait for the copy to be ready and 3) delete the original and have the parent controller adopt the copy that is running.

image

Analysis and Execution: Turbonomic will recommend workload move actions taking the following into account:

  • DaemonSet and Mirror (Node controller) pods will not move. The analysis will never recommend moving these pods.
  • StatefulSets by default will fail because of the attached PV will likely not allow 2 consumers to RW at the same time, and today kubeturbo cannot determine if a StatefulSet can function with 2 copies running at the same time. The user can either set up a Turbo Automation Policy on Container Pods and select the default group of all StatefulSets (per cluster), and set Move actions to "Do Not Generate", or actions will automatically fail due to unsupported workload kind. Turbonomic is working on a definition of support for moving a StatefulSet.
  • Deployment and DeploymentConfig kinds with PVs. These moves will fail for the same reason that we cannot determine if the PV is RWO or RWX. Consider the alternate Move action execution order of operations Pods With PVs.
  • Custom Controllers may fail execution if Kubeturbo cannot spin up a copy, and may require support of the controller. The error message will state that this Custom Controller is not supported. You may open an IBM Idea here: https://ideas.ibm.com/ for a future enhancement for support.
  • K8s constraints of taints/tolerations, node affinity (node label selector, namespace annotations for nodes), pod to pod affinity/anti-affinity policies. Turbonomic imports these as constraints in the analysis to place the workload.

Turbonomic will wait for the copy of the pod that is being moved to reach its own readiness value before deleting the original pod and the action is considered successful. Failed actions will occur if for other reasons the moved (copy) pod does not reach readiness, and the move will gracefully fail leaving the original pod untouched running in place.

OpenShift Environments

In OpenShift environments, you will need to supply an argument to kubeturbo for scc context, or all moves will fail.

  • straight yamls - modify the deployment
spec:
  template:
    spec:
      containers:
        args:
          - --sccsupport:*
  • helm chart - provide this parameter -set args.sccsupport=*

  • operator - add to the CR

spec:
  args:
    sccsupport: '*' 

Pods with PVs

Pods with PVs that are RWO will need to have an alternative mechanism to be able to relocate the Pod onto another compliant node, since 2 copies of the pod will not be able to attach the same PV at the same time. For those workloads, we can first delete the original pod which would allow the copy to bind to the PV to get into a ready state.

NOTE: StatefulSets currently are not supported by our Pod moves, but will be in a later release as we build out support for moving them.

  • straight yamls - modify the deployment
spec:
  template:
    spec:
      containers:
        args:
          - --fail-volume-pod-moves=false
  • heml chart - provide this parameter -set args.failVolumePodMoves=false

  • operator - edit and add to the kubeturbo-release CR

spec:
  args:
    failVolumePodMoves: 'false' 

Kubeturbo

Introduction
  1. What's new
  2. Supported Platforms
Kubeturbo Use Cases
  1. Overview
  2. Getting Started
  3. Full Stack Management
  4. Optimized Vertical Scaling
  5. Effective Cluster Management
  6. Intelligent SLO Scaling
  7. Proactive Rescheduling
  8. Better Cost Management
  9. GitOps Integration
  10. Observability and Reporting
Kubeturbo Deployment
  1. Deployment Options Overview
  2. Prerequisites
  3. Turbonomic Server Credentials
  4. Deployment by Helm Chart
    a. Updating Kubeturbo image
  5. Deployment by Yaml
    a. Updating Kubeturbo image
  6. Deployment by Operator
    a. Updating Kubeturbo image
  7. Deployment by Red Hat OpenShift OperatorHub
    a. Updating Kubeturbo image
Kubeturbo Config Details and Custom Configurations
  1. Turbonomic Server Credentials
  2. Working with a Private Repo
  3. Node Roles: Control Suspend and HA Placement
  4. CPU Frequency Getter Job Details
  5. Logging
  6. Actions and Special Cases
Actions and how to leverage them
  1. Overview
  2. Resizing or Vertical Scaling of Containerized Workloads
    a. DeploymentConfigs with manual triggers in OpenShift Environments
  3. Node Provision and Suspend (Cluster Scaling)
  4. SLO Horizontal Scaling
  5. Turbonomic Pod Moves (continuous rescheduling)
  6. Pod move action technical details
    a. Red Hat Openshift Environments
    b. Pods with PVs
IBM Cloud Pak for Data & Kubeturbo:Evaluation Edition
Troubleshooting
  1. Startup and Connectivity Issues
  2. KubeTurbo Health Notification
  3. Logging: kubeturbo log collection and configuration options
  4. Startup or Validation Issues
  5. Stitching Issues
  6. Data Collection Issues
  7. Collect data for investigating Kubernetes deployment issue
  8. Changes to Cluster Role Names and Cluster Role Binding Names
Kubeturbo and Server version mapping
  1. Turbonomic - Kubeturbo version mappings
Clone this wiki locally