-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix minReadySeconds for DC #14954
Fix minReadySeconds for DC #14954
Conversation
[test] |
re-[test] |
/cc @zhouying7780 |
4c932ab
to
05aebc8
Compare
return fmt.Errorf("acceptAvailablePods failed to watch ReplicationController %s/%s: %v", rc.Namespace, rc.Name, err) | ||
} | ||
|
||
_, err = watch.Until(c.timeout, watcher, func(event watch.Event) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kargakis I know what you are about to say :), PTO and certification got into way of fixing watch.Until I'll get back to working on it.
Well, it used WATCH
even in previous implementation.
this also eliminates the need for rebase patch b7e5324 because the code is not used anymore |
Fixes #15274 |
[test] |
|
||
_, err = watch.Until(c.timeout, watcher, func(event watch.Event) (bool, error) { | ||
if t := event.Type; t != watch.Modified { | ||
return false, fmt.Errorf("acceptAvailablePods failed watching for ReplicationController %s/%s: recieved event %s", rc.Namespace, rc.Name, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
received
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also use %v for event?
not sure i understand this code completely... this means we will wait till the RC is updated and then check the acceptCondition. If we receive deleted event for example we return with error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, there shouldn't be any other event that modified while waiting for availability change; if we receive deleted we failed which seems like the right thing to do because deleted RC can't get available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default timeout for this watch is 10 minutes? should we have retry logic here in case the watch is dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was using watch even before...
But I am working on fixing watch.Until so it restarts watch and doesn't end prematurely in a separate branch. We can wait for it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed received
and %v
glog.V(4).Infof("Still waiting for %d pods to become ready for rc %s", unready.Len(), rc.Name) | ||
return false, nil | ||
newRc := event.Object.(*kapi.ReplicationController) | ||
return acceptCondition(newRc), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acceptCondition is vague, can you name this something like "allReplicasAvailable()" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
} | ||
return fmt.Errorf("pod readiness check failed for rc %q: %v", rc.Name, err) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why removing the context from the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in current context it feels misleading in what it's saying. Other things might have failed here, not just "pod readiness check".
I am open to suggestions if we want to reformat the error here and not leave it on the caller.
39e497f
to
bee1f5e
Compare
had to rename the upstream commit to hopefully pass travis check; no other changes |
flake #14897 [test] |
bee1f5e
to
50c9a66
Compare
flake #14897 again; re-[test] |
[test] to find out if that was a flake. unfortunately we appear not to dump pods in failureTraps (just the deployers) this might have been sync delay for pod -> rc or one of the pods didn't became available for some reason (infra) |
test succeeded, trying another run [test] |
(no logs anymore) |
yum failed because of network :/ |
/retest |
1 similar comment
/retest |
@mfojtik bump |
P1 after the associated issue #15274 |
/retest @Kargakis this LGTM to me, do you have any last comments? |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kargakis, mfojtik, tnozicka The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/test extended_conformance_install_update |
/retest Please review the full test history for this PR and help us cut down flakes. |
6 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Automatic merge from submit-queue |
Follow up to: #14936 (needs to be merged first)
Make AvailableReplicas work with MinReadySeconds set.
Removes obsolete counting of pods which makes it overlap with AvailableReplicas from RC. This was causing RC to be in a state where AvailableReplicas=0 and deployment-phase=Complete with about 50% chance. This state lasts for a very short time.
[Outdated] At this time ignore the first 2 commits which are part of #14936 (because that isn't merged yet)