Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Withdraw resource when vm creation is failed #289

Closed
wants to merge 11 commits into from
Closed

Withdraw resource when vm creation is failed #289

wants to merge 11 commits into from

Conversation

kimeunju108
Copy link
Collaborator

What type of PR is this?

/kind feature

What this PR does / why we need it:
This PR proposes to withdraw site (cluster) resource allocated to a pod when the pod's vm creation is failed.

Which issue(s) this PR fixes:
Fixes #286

Special notes for your reviewer:

Does this PR introduce a user-facing change?:
NONE

How to test
~/go/src/k8s.io/arktos$./hack/globalscheduler/globalscheduler-up.sh
open a new terminal
$cd ~/go/src/k8s.io/arktos/globalscheduler/test/yaml
$kubectl apply -f sample_6_clusters.yaml
$kubectl apply -f sample_2_schedulers.yaml
$kubectl apply -f sample_2_distributors.yaml
$kubectl apply -f sample_2_dispatchers.yaml
$kubectl apply -f sample_6_pods.yaml
$kubectl get pods

@@ -217,12 +215,12 @@ func (p *Process) SendPodToCluster(pod *v1.Pod) {
go func() {
instanceId, err := openstack.ServerCreate(host, token, &pod.Spec)
if err == nil {
klog.V(3).Infof("The openstack vm for the pod %v has been created at the host %v", pod.ObjectMeta.Name, host)
klog.Infof("The openstack vm for the pod %v has been created at the host %v", pod.ObjectMeta.Name, host)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispatcher process log comes with log level. Why changing it to Infof without log level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens by chance. It was recovered.

@@ -80,6 +80,13 @@ type ScheduleResult struct {
FeasibleSites int // Number of feasible site on one stack scheduled
}

type PodSiteResourceAllocation struct {
Copy link
Collaborator

@jshaofuturewei jshaofuturewei May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is confusing. There is no allocation object at all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct name is changed to "PodSiteResource"

if pod, ok := t.Obj.(*v1.Pod); ok {
return failedToSchedule(pod) && responsibleForPod(pod, sched.SchedulerName)
}
utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%T is the type of the object. A typo of %v?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%T is type of variable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore it is equals to ".(type)", right? It is cache.DeletedFinalStateUnknown as we already know. Do we need to knew sched type as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is from HQ and I just don't change if possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we can copy codes like that. We don't need sched type at all.

utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
return false
default:
utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%T is the type of the object. A typo of %v?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%T is type of variable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to know sched's type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is from HQ.

verified = false
name := pod.Name
flavors := pod.Spec.VirtualMachine.Flavors
if pod.Name == "" || flavors == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pod can be a container pod without any flavor information

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flavor check is removed

@@ -310,7 +310,7 @@ func (f *Flavor) Filter(ctx context.Context, cycleState *interfaces.CycleState,
var isCommonMatch, _ = isComFlavorMatch(flavorMap, siteCacheInfo)
var isSpotMatch, _ = isSpotFlavorMatch(spotFlavorMap, siteCacheInfo)
if isCommonMatch && isSpotMatch {
klog.Infof("*** isCommonMatch:%v, isSpotMatch:%v ", isCommonMatch, isSpotMatch)
klog.Infof("isCommonMatch:%v, isSpotMatch:%v ", isCommonMatch, isSpotMatch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is Infof with log level. I am not sure that scheduler starts with any log level info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. I tried, but it didn't work yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the rest of you codes, you add log level while this part still remains infof with infof without level. I am not sure if the other parts work.

Copy link
Collaborator Author

@kimeunju108 kimeunju108 May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infof() is again from HQ original source code. If we decide to change, it is not an issue. We can discuss in meeting today. Then we may change 800 files. I am not sure it is fine.

@@ -643,20 +653,28 @@ func (n *SiteCacheInfo) updateSiteFlavor(resourceTypes []string, regionFlavors m
}
}

func (n *SiteCacheInfo) deductFlavor() {
func (n *SiteCacheInfo) updateFlavorCount(deduct bool) {
var m int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big concern here is that map is not thread safe in golang. Do we need to add mutex here? We might add mutex in the codes to call the function. However, people might be not cautious enough to update map directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutex is used for this map

Copy link
Collaborator

@jshaofuturewei jshaofuturewei May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golang map does not come with mutex. If I call updateFlavorCount twice at the same time, does it work as expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateFlavorCount() is private function and it is called by only updateSiteFlavor() and mutex is used there.

//siteSelectedInfo is type of SiteSelectorInfo at cycle_state.go
siteSelectedInfo, err := interfaces.GetSiteSelectorState(state, siteID)
if err != nil {
klog.Errorf("Gettng site selector state failed! err: %s", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is changed to "Getting"

//siteSelectedInfo is type of SiteSelectorInfo at cycle_state.go
siteSelectedInfo, err := interfaces.GetSiteSelectorState(state, siteID)
if err != nil {
klog.Errorf("Gettng site selector state failed! err: %s", err)
Copy link
Collaborator

@jshaofuturewei jshaofuturewei May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is not a string. Please use %v not %s

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is changed to %v

stack.Resources[i].FlavorIDSelected = flavorID
klog.V(4).Infof("GetFlavor - flavorID: %s, region: %s", flavorID, region)
flv, ok := cache.FlavorCache.GetFlavor(flavorID, region)
if !ok {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not very sure about the logic here. If the pod flavor does not exist in that region, do we need to return scheduling failure?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, it assign the best site based on score.

@kimeunju108
Copy link
Collaborator Author

tried reopen

Copy link
Collaborator Author

@kimeunju108 kimeunju108 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to reopen..

@kimeunju108
Copy link
Collaborator Author

kimeunju108 commented May 25, 2021

Reopen
Duplicate of #298

@kimeunju108 kimeunju108 changed the title Withdraw resource when vm creation failed Withdraw resource when vm creation is failed May 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revoke cluster resources when scheduling failed to create a vm
2 participants