Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor analyze and merge for crane agent #171

Merged
merged 1 commit into from
Mar 2, 2022

Conversation

chenkaiyue
Copy link
Contributor

Refactor crane agent

@chenkaiyue chenkaiyue changed the title refactor analyze and merge for crane agent [WIP]Refactor analyze and merge for crane agent Mar 1, 2022
@chenkaiyue chenkaiyue force-pushed the refactorAgent branch 3 times, most recently from 7c152cb to 6abc74c Compare March 1, 2022 08:39
series := s.getTimeSeriesFromMap(state, object.MetricRule.Selector)
func (s *AnormalyAnalyzer) getImpacted(ts common.TimeSeries) []types.NamespacedName {
//TODO: basicThrottleQosPriority be a input para to be a vara in AnormalyAnalyzer
var basicThrottleQosPriority = executor.ClassAndPriority{PodQOSClass: v1.PodQOSBestEffort, PriorityClassValue: 0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still a todo? as discussed offline, I don't think we should have ClassAndPriority struct here, as more factor would be involved in the future, like the resource request

Copy link
Contributor Author

@chenkaiyue chenkaiyue Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have moved the container's metric out, so there will don't have the baseline and this func has gone.

pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
var impacted []types.NamespacedName
var triggered, threshold bool
for _, ts := range series {
triggered = s.evaluator.EvalWithMetric(object.MetricRule.Name, float64(object.MetricRule.Value.Value()), ts.Samples[0].Value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you also add a follow up here? as current the evaluator is fixed as expression, it should be a interface and the actual method is defined in the CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be accomplished:

  1. After an action is triggered, we will don't have the baseline to filter pods, we will select every pod as an alternative;
  2. Use more dimensional metrics to sort pods;
  3. In executor, we will calculate the diff to the water level, and select some pods from the sorted pods according to the diff and end the process.

pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
}
//step1 filter dry run detections
var dcsFiltered []ecache.DetectionCondition
dcsFiltered = s.filterDryRunDetections(dcs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we need filter dry run? as dry run is an action type IMO

pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/executor/evict.go Outdated Show resolved Hide resolved
pkg/ensurance/executor/throttle.go Outdated Show resolved Hide resolved
examples/ensurance/waterline3.yaml Show resolved Hide resolved
pkg/ensurance/executor/throttle.go Outdated Show resolved Hide resolved
pkg/ensurance/executor/evict.go Outdated Show resolved Hide resolved
pkg/ensurance/collector/collector.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Show resolved Hide resolved
pkg/ensurance/analyzer/analyzer.go Outdated Show resolved Hide resolved
@chenkaiyue
Copy link
Contributor Author

chenkaiyue commented Mar 1, 2022

To be accomplished:

  1. After an action is triggered, we will don't have the baseline to filter pods, we will select every pod as an alternative;
  2. Use more dimensional metrics to sort pods;
  3. In executor, we will calculate the diff to the water level, and select some pods from the sorted pods according to the diff and end the process.

@chenkaiyue chenkaiyue force-pushed the refactorAgent branch 2 times, most recently from 4d35082 to 99910a0 Compare March 2, 2022 02:50
if !triggered {
continue
}
klog.V(4).Infof("key %s, threshold %v", key, threshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz refine this log, log should be a sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

ts.Samples[0].Value,
common.GetValueByName(ts.Labels, common.LabelNamePodNamespace),
common.GetValueByName(ts.Labels, common.LabelNamePodName))
//step2: use opa to check if triggered and get impacted pods for container MetricRule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opa is not implemented yet, suggest to remove keyword of opa to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return dc, err
func (s *AnormalyAnalyzer) computeActionContext(threshold bool, key string, object ensuranceapi.ObjectiveEnsurance, ac *ecache.ActionContext) {
if threshold {
s.restored[key] = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please double confirm is this thread safe? the whole Analyze function is executed in different go routine, in case the Analyze process takes time, it is possible that two threads access the same s.restored map and result in unexpected behavior or panic.
It's ok to have a follow up PR.

case state := <-s.stateChann:
    go s.Analyze(state)

Copy link
Contributor Author

@chenkaiyue chenkaiyue Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not thread safe. Not only for restored, triggered and actionEventStatus are maps too. I will make another PR to fix this.

cruntime "github.com/gocrane/crane/pkg/ensurance/runtime"
"github.com/gocrane/crane/pkg/utils"
)

const (
MAX_UP_QUOTA = 60 * 1000 // 60CU
MAX_UP_QUOTA = 60 * 1000 // 60CU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to use Camel style to define const as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@chenkaiyue
Copy link
Contributor Author

chenkaiyue commented Mar 2, 2022

For scheduler action:
image
image

@chenkaiyue
Copy link
Contributor Author

chenkaiyue commented Mar 2, 2022

For action throttle:
we use two pod:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: greedy
description: Priority for low level.
value: -100

---

apiVersion: v1
kind: Pod
metadata:
  name: low
spec:
  containers:
    - image: docker.io/gocrane/stress-ng:v0.12.09
      imagePullPolicy: Always
      name: low
      command:
      - stress-ng
      - -c
      - "3"
      - --cpu-method
      - cpuid
  priorityClassName: greedy
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: middle
description: Priority for middle level.
value: -1

---

apiVersion: v1
kind: Pod
metadata:
  name: middle
spec:
  containers:
    - image: docker.io/gocrane/stress-ng:v0.12.09
      imagePullPolicy: Always
      name: middle
      command:
        - /bin/bash
        - -c
        - "sleep 36000"
      resources:
        requests:
          memory: 2Gi
          cpu: 1
        limits:
          memory: 15Gi
          cpu: 8
  priorityClassName: middle

and start two stress-ng processes in middle pod,
the cpu usage in action is 4000
image

After a while, the middle pod cpu quota is
image
from 8->4.2

and the cpu usage on node is :
wecom-temp-5100d8d7a187a2a0165ab5b5413229d5
caculate cpuacct.usage changes in a period of time:
wecom-temp-67252523d752101d946d83af71d0675c

@yan234280533
Copy link

/lgtm

1 similar comment
@mfanjie
Copy link
Contributor

mfanjie commented Mar 2, 2022

/lgtm

@mfanjie
Copy link
Contributor

mfanjie commented Mar 2, 2022

conditionType: analyzed-pressure should follow same style with existing ones, like PIDPressure.
and It's better if we can show what kind of pressure is identified instead of analyzed-pressure, but I am open for this, please consider which is the best way.

@mfanjie mfanjie merged commit d3a4847 into gocrane:main Mar 2, 2022
@mfanjie mfanjie changed the title [WIP]Refactor analyze and merge for crane agent Refactor analyze and merge for crane agent Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants