Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预测查询自定义及prometheus-adapter替代metric-adapter的实现 #520

Merged
merged 15 commits into from
Sep 17, 2022

Conversation

saikey0379
Copy link
Contributor

@saikey0379 saikey0379 commented Aug 23, 2022

What type of PR is this?

  • 支持对Resource类型配置自定义查询语句
  • 通过prometheus-adapter替代metric-adapter实现自定义metric的预测

Prometheus多级集群场景应用

prometheus-adapter设置为一级prometheus,craned设置为二级prometheus
扩展:若一级prometheus数据汇总至二级prometheus时增加了label,如{cluster="test"},可能导致模型数据源异常
此时可以基于TSP的expressionQuery修改后添加至annocation

使用示例

APIservice调整

将external的apiservice调整为原生的prometheus-adapter

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: prometheus-adapter
    namespace: default
    port: 443
  version: v1beta1
  versionPriority: 100

craned指标查询

# curl -sL 10.244.0.134:8080/metrics | grep ^crane | grep -E "cron|predi"| grep -v ^#
crane_autoscaling_cron{resourceIdentifier="cron",targetKind="Deployment",targetName="project-test",targetNamespace="test"} 7 1660202258972
crane_prediction_tsp{algorithm="dsp",resourceIdentifier="external.server_picker",targetKind="Deployment",targetName="project-test",targetNamespace="test"} 124.85 1660202265000
crane_autoscaling_prediction{algorithm="dsp",resourceIdentifier="external.server_picker",targetKind="Deployment",targetName="project-test",targetNamespace="test} 140.5433349609375 1660202258972
crane_prediction_tsp{algorithm="dsp",resourceIdentifier="resource.cpu",targetKind="Deployment",targetName="project-test",targetNamespace="test"} 3.79116 1660202265000
crane_autoscaling_prediction{algorithm="dsp",resourceIdentifier="resource.cpu",targetKind="Deployment",targetName="project-test",targetNamespace="test"} 4.085559844970703 1660202258972

注:resourceIdentifier为metricType.metricName, 其中metricName为原始指标

  • crane_autoscaling_cron定义cron类型定时扩容指标
  • crane_autoscaling_prediction定义resource类型预测指标

prometheus-adapter配置示例

apiVersion: v1
data:
  config.yaml: |
    externalRules:
    - seriesQuery: '{__name__="crane_autoscaling_cron",pod_name!=""}'
      metricsQuery: 'max(<<.Series>>{<<.LabelMatchers>>})'
      resources:
        namespaced: false
    - seriesQuery: '{__name__="crane_autoscaling_prediction",pod_name!=""}'
      metricsQuery: 'max(<<.Series>>{<<.LabelMatchers>>})'
      resources:
        namespaced: false
    - seriesQuery: '{__name__="server_picker",pod_name!=""}'
      metricsQuery: 'sum by(node, route)(rate(<<.Series>>{<<.LabelMatchers>>}[1m]))'
      resources:
        namespaced: false

TSP模型指标查询

  • 通用cpu查询以及符合默认外部查询语句的指标,参考后续EHPA配置示例
  • 复杂语句或数据转储时新增标签的指标,需基于原始语句补充annotations, 原始语句可通过tsp查看
    参考配置:
  annotations:
    metric-query.autoscaling.crane.io/resource.cpu: |
      sum(irate(container_cpu_usage_seconds_total{cluster="test",container!="",job=~"^kubernetes-nodes-resource-.*$",container!="POD",namespace="cloud",pod=~"^project-test-.*$"}[3m]))
    metric-query.autoscaling.crane.io/external.server_picker: |
      sum by(node, route)(rate(server_picker{cluster="test",node="test",name="server_picker"}[1m]))
#key    metric-query.autoscaling.crane.io/为固定前缀,配置规范为external{metricType}.{metricName}
#value  需与prometheus-adapter中对应指标语句相同

EHPA配置示例

apiVersion: autoscaling.crane.io/v1alpha1
kind: EffectiveHorizontalPodAutoscaler
metadata:
  name: project-test
  namespace: test
spec:
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 660
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 10
        periodSeconds: 15
      selectPolicy: Max
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: project-test
  minReplicas: 3
  maxReplicas: 25
  scaleStrategy: Preview
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: External
    external:
      metric:
        name: server_picker
        selector:
          matchLabels:
            node: "test"
            name: "server_picker"
      target:
        averageValue: 10
        type: AverageValue
  crons:
  - name: "cron1"
    description: "scale up"
    start: "0 * * * *"
    end: "5 * * * *"
    targetReplicas: 4
  prediction:
    predictionWindowSeconds: 600
    predictionAlgorithm:
      algorithmType: dsp
      dsp:
        estimators:
          fft:
            - marginFraction: "0.1"
              minNumOfSpectrumItems: 2
              lowAmplitudeThreshold: "0.2"
              highFrequencyThreshold: "6.0"
        sampleInterval: "15s"
        historyLength: "16d"

此配置示例中开启了cron以及prediction,prediction部分metric配置将生成相应tsp指标
应用该yaml后,将自动生成以ehpa前缀的hpa以及tsp[ehpa-project-test]

TSP查看

kubectl -n test get tsp ehpa-project-test -o yaml

apiVersion: prediction.crane.io/v1alpha1
kind: TimeSeriesPrediction
metadata:
spec:
  predictionMetrics:
  - algorithm:
    expressionQuery:
      expression: |
        sum(irate(container_cpu_usage_seconds_total{container!="",image!="",container!="POD",namespace="test",pod=~"^xxxxxx-.*$"}[3m])) #将resourceMetric修改为ExpressionQuery,暴露具体查询
    resourceIdentifier: resource.cpu
    type: ExpressionQuery
  - algorithm:
    expressionQuery:
      expression: |
        sum by(node, route)(rate(erver_picker{node="test",name="server_picker"}[1m]))
    resourceIdentifier: server_picker
    type: ExpressionQuery
  predictionWindowSeconds: 600

resourceIdentifier值将作为预测指标的matchLabel,实现预测指标匹配

查看hpa状态

# kubectl -n test describe hpa ehpa-project-test
Name:                                                                                     ehpa-project-test
Namespace:                                                                                test
Labels:                                                                                   app.kubernetes.io/managed-by=effective-hpa-controller
                                                                                          autoscaling.crane.io/effective-hpa-uid=c3783ed1-0d0e-45f8-af52-94c0771f766c
Annotations:                                                                              <none>
CreationTimestamp:                                                                        Thu, 11 Aug 2022 15:07:23 +0800
Reference:                                                                                Substitute/ehpa-project-test
Metrics:                                                                                  ( current / target )
  "server_picker" (target average value):                                                 5251m / 10
  "crane_autoscaling_prediction" (target average value):                                  162m / 300m
  "crane_autoscaling_prediction" (target average value):                                  5715m / 10
  "crane_autoscaling_cron" (target average value):                                        200m / 1
  resource cpu on pods  (as a percentage of request):                                     29% (147m) / 60%
Min replicas:                                                                             3
Max replicas:                                                                             25

此时可以看到hpa共有五个指标

  • "server_picker" #基础外部指标
  • "crane_autoscaling_prediction" #crane服务TSP模型预测指标,支持resource/pods/external三种原始指标的查询
  • "crane_autoscaling_cron" #crane服务cron计划任务指标
  • resource cpu on pods #基础cpu指标

查看hpa配置

# kubectl get HorizontalPodAutoscaler.v2beta2.autoscaling -ntest ehpa-project-test -o yaml

相应指标如下,通过prometheus-adapter获取

  - external:
      metric:
        name: crane_autoscaling_prediction
        selector:
          matchLabels:
            resourceIdentifier: resource.cpu
            targetKind: Deployment
            targetName: project-test
            targetNamespace: test
      target:
        averageValue: 300m
        type: AverageValue
    type: External
  - external:
      metric:
        name: crane_autoscaling_prediction
        selector:
          matchLabels:
            resourceIdentifier: external.server_picker
            targetKind: Deployment
            targetName: project-test
            targetNamespace: test
      target:
        averageValue: "10"
        type: AverageValue
    type: External
  - external:
      metric:
        name: crane_cron_external
        selector:
          matchLabels:
            resourceIdentifier: cron
            targetKind: Deployment
            targetName: project-test
            targetNamespace: test
      target:
        averageValue: "1"
        type: AverageValue
    type: External

结束

@github-actions
Copy link
Contributor

github-actions bot commented Aug 23, 2022

🎉 Successfully Build Images.
Now Support ARM Platforms.
Comment Post Time: 2022-09-16 18:25
Git Version: b2a3ba4

Docker Registry

Overview: https://hub.docker.com/u/gocrane

Image Pull Command
crane-agent:pr-520-b2a3ba4 docker pull gocrane/crane-agent:pr-520-b2a3ba4
dashboard:pr-520-b2a3ba4 docker pull gocrane/dashboard:pr-520-b2a3ba4
metric-adapter:pr-520-b2a3ba4 docker pull gocrane/metric-adapter:pr-520-b2a3ba4
craned:pr-520-b2a3ba4 docker pull gocrane/craned:pr-520-b2a3ba4

Quick Deploy - Helm

helm repo add crane https://finops-helm.pkg.coding.net/gocrane/gocrane
helm install crane -n crane-system --create-namespace \
                   --set craned.image.repository=gocrane/craned \
                   --set craned.image.tag=pr-520-b2a3ba4 \
                   --set metricAdapter.image.repository=gocrane/metric-adapter \
                   --set metricAdapter.image.tag=pr-520-b2a3ba4 \
                   --set craneAgent.image.repository=gocrane/crane-agent \
                   --set craneAgent.image.tag=pr-520-b2a3ba4 \
                   --set cranedDashboard.image.repository=gocrane/dashboard \
                   --set cranedDashboard.image.tag=pr-520-b2a3ba4 crane/crane

Coding Registry

Overview: https://finops.coding.net/public-artifacts/gocrane/crane/packages

Image Pull Command
crane-agent:pr-520-b2a3ba4 docker pull finops-docker.pkg.coding.net/gocrane/crane/crane-agent:pr-520-b2a3ba4
dashboard:pr-520-b2a3ba4 docker pull finops-docker.pkg.coding.net/gocrane/crane/dashboard:pr-520-b2a3ba4
metric-adapter:pr-520-b2a3ba4 docker pull finops-docker.pkg.coding.net/gocrane/crane/metric-adapter:pr-520-b2a3ba4
craned:pr-520-b2a3ba4 docker pull finops-docker.pkg.coding.net/gocrane/crane/craned:pr-520-b2a3ba4

Quick Deploy - Helm

helm repo add crane https://finops-helm.pkg.coding.net/gocrane/gocrane
helm install crane -n crane-system --create-namespace \
                   --set craned.image.repository=finops-docker.pkg.coding.net/gocrane/crane/craned \
                   --set craned.image.tag=pr-520-b2a3ba4 \
                   --set metricAdapter.image.repository=finops-docker.pkg.coding.net/gocrane/crane/metric-adapter \
                   --set metricAdapter.image.tag=pr-520-b2a3ba4 \
                   --set craneAgent.image.repository=finops-docker.pkg.coding.net/gocrane/crane/crane-agent \
                   --set craneAgent.image.tag=pr-520-b2a3ba4 \
                   --set cranedDashboard.image.repository=finops-docker.pkg.coding.net/gocrane/crane/dashboard \
                   --set cranedDashboard.image.tag=pr-520-b2a3ba4 crane/crane

Ghcr Registry

Overview: https://github.com/orgs/gocrane/packages?repo_name=crane

Image Pull Command
crane-agent:pr-520-b2a3ba4 docker pull ghcr.io/gocrane/crane/crane-agent:pr-520-b2a3ba4
dashboard:pr-520-b2a3ba4 docker pull ghcr.io/gocrane/crane/dashboard:pr-520-b2a3ba4
metric-adapter:pr-520-b2a3ba4 docker pull ghcr.io/gocrane/crane/metric-adapter:pr-520-b2a3ba4
craned:pr-520-b2a3ba4 docker pull ghcr.io/gocrane/crane/craned:pr-520-b2a3ba4

Quick Deploy - Helm

helm repo add crane https://finops-helm.pkg.coding.net/gocrane/gocrane
helm install crane -n crane-system --create-namespace \
                   --set craned.image.repository=ghcr.io/gocrane/crane/craned \
                   --set craned.image.tag=pr-520-b2a3ba4 \
                   --set metricAdapter.image.repository=ghcr.io/gocrane/crane/metric-adapter \
                   --set metricAdapter.image.tag=pr-520-b2a3ba4 \
                   --set craneAgent.image.repository=ghcr.io/gocrane/crane/crane-agent \
                   --set craneAgent.image.tag=pr-520-b2a3ba4 \
                   --set cranedDashboard.image.repository=ghcr.io/gocrane/crane/dashboard \
                   --set cranedDashboard.image.tag=pr-520-b2a3ba4 crane/crane

@saikey0379 saikey0379 changed the title 通过crand与原生prometheus-adapter实现ehpa扩缩功能 预测查询自定义及prometheus-adapter替代metric-adapter的实现 Aug 24, 2022
var averageValue *resource.Quantity
switch metric.Type {
case autoscalingv2.ResourceMetricSourceType:
switch metric.Resource.Name {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metricIdentifier = utils.GetMetricIdentifier(metric, metric.Resource.name.String())

no need for switch case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

metricsForPrediction = append(metricsForPrediction, metricSpec)
//first get annocation
expressionQuery = utils.GetExpressionQueryAnnocation(metricIdentifier, ehpa.Annotations)
//if annocation not matched, build expressionQuery by metric and ehpa.TargetName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo for annocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case autoscalingv2.ResourceMetricSourceType:
switch metric.Resource.Name {
case "cpu":
metricIdentifier = utils.GetMetricIdentifier(metric, v1.ResourceCPU.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for switch case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/controller/ehpa/predict.go Outdated Show resolved Hide resolved
pkg/metricprovider/custom_metric_provider.go Show resolved Hide resolved
@@ -229,7 +235,9 @@ func GetPrediction(ctx context.Context, kubeclient client.Client, namespace stri
matchingLabels := client.MatchingLabels(map[string]string{"app.kubernetes.io/managed-by": known.EffectiveHorizontalPodAutoscalerManagedBy})
// merge metric selectors
for key, value := range labelSelector {
matchingLabels[key] = value
if key == "targetName" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this works for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because labelSelector set from autoscaling.crane.io/effective-hpa-uid to
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"targetKind": ehpa.Spec.ScaleTargetRef.Kind,
"targetName": ehpa.Spec.ScaleTargetRef.Name,
"targetNamespace": ehpa.Namespace,
"resourceIdentifier": metricIdentifier,
},
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from parameter we get metricSelector, which need to matched prediction labels,
Labels: map[string]string{
"app.kubernetes.io/name": name,
"app.kubernetes.io/part-of": ehpa.Name,
"app.kubernetes.io/managed-by": known.EffectiveHorizontalPodAutoscalerManagedBy,
known.EffectiveHorizontalPodAutoscalerUidLabel: string(ehpa.UID),
},
so we can changed effective-hpa-uid to "app.kubernetes.io/part-of": ehpa.Name
we can also add a label like ["app.kubernetes.io/target-name": ehpa.OwnerReferences[0].Name]
This will be more accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/metricprovider/custom_metric_provider.go Show resolved Hide resolved
@@ -219,7 +225,7 @@ func IsLocalCustomMetric(metricInfo provider.CustomMetricInfo, client client.Cli
return false
}

func GetPrediction(ctx context.Context, kubeclient client.Client, namespace string, metricSelector labels.Selector) (*predictionapi.TimeSeriesPrediction, error) {
func GetPredictions(ctx context.Context, kubeclient client.Client, namespace string, metricSelector labels.Selector) ([]predictionapi.TimeSeriesPrediction, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to external_metric_provider.go

}

return &predictionList.Items[0], nil
return predictionList.Items, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return a tsp list, is there some case that has more than one tsp for a metric query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we query the TSP by effective-hpa-uid, we can achieve exact matching. However, when we query the TSP by Target, if the user misconfigures multiple ehpa, there may be multiple TSPS. In this case, matching by TSP and metricIdentifier is compatible. Subsequent processing can be handed over to the HPA

if v.Timestamp < timestampStart.Unix() || v.Timestamp > timestampEnd.Unix() {
continue
for _, prediction := range predictions {
resourceIdentifier, bl := metricSelector.RequiresExactMatch("resourceIdentifier")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's meaning for bl ? better to change to found for all metricSelector.RequiresExactMatch(xxxx)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change utils.GetReadyPredictionMetric(info.Metric, prediction) to
utils.GetReadyPredictionMetric(info.Metric, resourceIdentifier, &prediction)
,need to get resourceIdentifier, and same as resourceIdentifier == ""

if strings.HasPrefix(info.Metric, "crane") {
prediction, err := GetPrediction(ctx, p.client, namespace, metricSelector)
case known.MetricNamePrediction:
predictions, err := GetPredictions(ctx, p.client, namespace, metricSelector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for returning more than one tsp?

@@ -149,48 +171,13 @@ func (p *ExternalMetricProvider) GetCronExternalMetrics(ctx context.Context, nam
if len(errs) > 0 {
return nil, fmt.Errorf("%v", errs)
}
replicas := DefaultCronTargetMetricValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DefaultCronTargetMetricValue set to 1
change to minScales

"strings"
"time"

autoscalingapi "github.com/gocrane/api/autoscaling/v1alpha1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pmMap[pm.ResourceIdentifier] = pm
}

//* collected metric "crane_prediction_time_series_prediction_metric" { label:<name:"aggregateKey" value:"nodes-mem#instance=192.168.56.166:9100" > label:<name:"algorithm" value:"percentile" > label:<name:"expressionQuery" value:"" > label:<name:"rawQuery" value:"sum(node_memory_MemTotal_bytes{} - node_memory_MemAvailable_bytes{}) by (instance)" > label:<name:"resourceIdentifier" value:"nodes-mem" > label:<name:"resourceQuery" value:"" > label:<name:"targetKind" value:"Node" > label:<name:"targetName" value:"192.168.56.166" > label:<name:"targetNamespace" value:"" > label:<name:"type" value:"RawQuery" > gauge:<value:1.82784510645e+06 > timestamp_ms:1639466220000 } was collected before with the same name and label values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update comments to newest design

if err != nil {
klog.Errorf("Failed to list ehpa: %v", err)
}
var uniqPredictionMetrics []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement unique metrics by map

// ContainerMemUsageExprTemplate is used to query container cpu usage by promql, param is namespace,pod,container
ContainerMemUsageExprTemplate = `container_memory_working_set_bytes{container!="POD",namespace="%s",pod=~"^%s.*$",container="%s"}`
"github.com/gocrane/crane/pkg/utils"
"k8s.io/apimachinery/pkg/util/sets"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please format imports

case "memory":
expressionQuery = GetWorkloadMemUsageExpression(namespace, name)
}
case autoscalingv2.PodsMetricSourceType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use PodsMetricSourceType now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for ExpressionQuery

Copy link
Member

@qmhu qmhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM good job!

@qmhu qmhu merged commit de1bbfd into gocrane:main Sep 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants