Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error resync logic #385

Merged
merged 1 commit into from
Aug 2, 2019
Merged

Conversation

hzxuzhonghu
Copy link
Collaborator

@hzxuzhonghu hzxuzhonghu commented Jul 24, 2019

fixes: #384

There is time latency between Controller.syncJob creating a pod and the k8s informer get the notification. So there are cases job controller calling Controller.syncJob creating same pod twice,

@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 24, 2019
@@ -265,10 +265,6 @@ func (cc *Controller) syncJob(jobInfo *apis.JobInfo, updateStatus state.UpdateSt
pod.Name, job.Name, err)
creationErrs = append(creationErrs, fmt.Errorf("failed to create pod %s, err: %#v", pod.Name, err))
} else {
if err != nil && apierrors.IsAlreadyExists(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need provide more detail why we delete this resync logic here rather than fix the "can not find pod <namespace/pod> in cache" issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a race between the core job controller and the resync controller.

And same race between JobInfo.AddPod and controller.syncJob

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a race between the core job controller and the resync controller.

Please share your analysis of this issue firstly before PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the analysis to Pr description

@TommyLike
Copy link
Contributor

fixes: #384

There is time latency between Controller.syncJob creating a pod and the k8s informer get the notification. So there are cases job controller calling Controller.syncJob creating same pod twice,

Do you mean this will always happen no matter we delete pod or not

fixes: #384

There is time latency between Controller.syncJob creating a pod and the k8s informer get the notification. So there are cases job controller calling Controller.syncJob creating same pod twice,

This change can help fix this issue, but I am wondering if the resync logic is used to cover the case intentionaly

@hzxuzhonghu
Copy link
Collaborator Author

This change can help fix this issue, but I am wondering if the resync logic is used to cover the case intentionaly

I think so. For eventual consistent model, resync is a regular way to catch up.

@TommyLike
Copy link
Contributor

/lgtm
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 2, 2019
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu, TommyLike

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 2, 2019
@volcano-sh-bot volcano-sh-bot merged commit 1a856a1 into volcano-sh:master Aug 2, 2019
@hzxuzhonghu hzxuzhonghu deleted the resync branch July 6, 2020 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The job task resync logic is not right
4 participants