Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix minReadySeconds for DC #14954

Conversation

tnozicka
Copy link
Contributor

@tnozicka tnozicka commented Jun 29, 2017

Follow up to: #14936 (needs to be merged first)

Make AvailableReplicas work with MinReadySeconds set.
Removes obsolete counting of pods which makes it overlap with AvailableReplicas from RC. This was causing RC to be in a state where AvailableReplicas=0 and deployment-phase=Complete with about 50% chance. This state lasts for a very short time.

[Outdated] At this time ignore the first 2 commits which are part of #14936 (because that isn't merged yet)

@tnozicka tnozicka added this to the 3.7.0 milestone Jun 29, 2017
@tnozicka tnozicka requested review from mfojtik and 0xmichalis June 29, 2017 10:08
@tnozicka
Copy link
Contributor Author

[test]

@tnozicka
Copy link
Contributor Author

re-[test]

@mfojtik mfojtik changed the title WIP - Fix minReadySeconds for DC Fix minReadySeconds for DC Jul 3, 2017
@mfojtik mfojtik changed the title Fix minReadySeconds for DC WIP: Fix minReadySeconds for DC Jul 3, 2017
@wanghaoran1988
Copy link
Member

/cc @zhouying7780

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2017
@tnozicka tnozicka force-pushed the fix-minreadyseconds-for-dc-final branch from 4c932ab to 05aebc8 Compare July 20, 2017 09:11
@tnozicka
Copy link
Contributor Author

@mfojtik @Kargakis - I've rebased it to new master, PTAL

return fmt.Errorf("acceptAvailablePods failed to watch ReplicationController %s/%s: %v", rc.Namespace, rc.Name, err)
}

_, err = watch.Until(c.timeout, watcher, func(event watch.Event) (bool, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis I know what you are about to say :), PTO and certification got into way of fixing watch.Until I'll get back to working on it.
Well, it used WATCH even in previous implementation.

@tnozicka tnozicka changed the title WIP: Fix minReadySeconds for DC Fix minReadySeconds for DC Jul 20, 2017
@tnozicka tnozicka removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 20, 2017
@tnozicka
Copy link
Contributor Author

this also eliminates the need for rebase patch b7e5324 because the code is not used anymore

@tnozicka
Copy link
Contributor Author

Fixes #15274

@mfojtik
Copy link
Contributor

mfojtik commented Jul 20, 2017

[test]


_, err = watch.Until(c.timeout, watcher, func(event watch.Event) (bool, error) {
if t := event.Type; t != watch.Modified {
return false, fmt.Errorf("acceptAvailablePods failed watching for ReplicationController %s/%s: recieved event %s", rc.Namespace, rc.Name, t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

received

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also use %v for event?

not sure i understand this code completely... this means we will wait till the RC is updated and then check the acceptCondition. If we receive deleted event for example we return with error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there shouldn't be any other event that modified while waiting for availability change; if we receive deleted we failed which seems like the right thing to do because deleted RC can't get available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default timeout for this watch is 10 minutes? should we have retry logic here in case the watch is dropped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was using watch even before...

But I am working on fixing watch.Until so it restarts watch and doesn't end prematurely in a separate branch. We can wait for it though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed received and %v

glog.V(4).Infof("Still waiting for %d pods to become ready for rc %s", unready.Len(), rc.Name)
return false, nil
newRc := event.Object.(*kapi.ReplicationController)
return acceptCondition(newRc), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acceptCondition is vague, can you name this something like "allReplicasAvailable()" or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

}
return fmt.Errorf("pod readiness check failed for rc %q: %v", rc.Name, err)
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing the context from the error?

Copy link
Contributor Author

@tnozicka tnozicka Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in current context it feels misleading in what it's saying. Other things might have failed here, not just "pod readiness check".

I am open to suggestions if we want to reformat the error here and not leave it on the caller.

@tnozicka tnozicka force-pushed the fix-minreadyseconds-for-dc-final branch from 39e497f to bee1f5e Compare July 20, 2017 10:52
@tnozicka
Copy link
Contributor Author

had to rename the upstream commit to hopefully pass travis check; no other changes

@mfojtik
Copy link
Contributor

mfojtik commented Jul 20, 2017

flake #14897

[test]

@tnozicka tnozicka force-pushed the fix-minreadyseconds-for-dc-final branch from bee1f5e to 50c9a66 Compare July 20, 2017 11:59
@tnozicka
Copy link
Contributor Author

flake #14897 again; re-[test]

@tnozicka
Copy link
Contributor Author

[test] to find out if that was a flake. unfortunately we appear not to dump pods in failureTraps (just the deployers) this might have been sync delay for pod -> rc or one of the pods didn't became available for some reason (infra)

@tnozicka
Copy link
Contributor Author

test succeeded, trying another run [test]

@openshift-merge-robot openshift-merge-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2017
@tnozicka
Copy link
Contributor Author

tnozicka commented Aug 1, 2017

(no logs anymore)
/retest

@tnozicka
Copy link
Contributor Author

tnozicka commented Aug 1, 2017

yum failed because of network :/
/retest

@0xmichalis
Copy link
Contributor

/retest

1 similar comment
@tnozicka
Copy link
Contributor Author

/retest

@tnozicka
Copy link
Contributor Author

@mfojtik bump

@tnozicka
Copy link
Contributor Author

P1 after the associated issue #15274

@mfojtik
Copy link
Contributor

mfojtik commented Aug 22, 2017

/retest
/approve

@Kargakis this LGTM to me, do you have any last comments?

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2017
@0xmichalis
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kargakis, mfojtik, tnozicka

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@tnozicka
Copy link
Contributor Author

/test extended_conformance_install_update

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. priority/P1 size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants