Fix minReadySeconds for DC #14954

tnozicka · 2017-06-29T10:08:53Z

Follow up to: #14936 (needs to be merged first)

Make AvailableReplicas work with MinReadySeconds set.
Removes obsolete counting of pods which makes it overlap with AvailableReplicas from RC. This was causing RC to be in a state where AvailableReplicas=0 and deployment-phase=Complete with about 50% chance. This state lasts for a very short time.

[Outdated] At this time ignore the first 2 commits which are part of #14936 (because that isn't merged yet)

tnozicka · 2017-06-29T10:10:35Z

[test]

tnozicka · 2017-06-29T14:07:28Z

re-[test]

wanghaoran1988 · 2017-07-04T06:49:03Z

/cc @zhouying7780

tnozicka · 2017-07-20T10:06:24Z

@mfojtik @Kargakis - I've rebased it to new master, PTAL

tnozicka · 2017-07-20T09:19:19Z

pkg/deploy/strategy/support/lifecycle.go

+		return fmt.Errorf("acceptAvailablePods failed to watch ReplicationController %s/%s: %v", rc.Namespace, rc.Name, err)
+	}
+
+	_, err = watch.Until(c.timeout, watcher, func(event watch.Event) (bool, error) {


@Kargakis I know what you are about to say :), PTO and certification got into way of fixing watch.Until I'll get back to working on it.
Well, it used WATCH even in previous implementation.

tnozicka · 2017-07-20T10:19:17Z

this also eliminates the need for rebase patch b7e5324 because the code is not used anymore

tnozicka · 2017-07-20T10:36:35Z

Fixes #15274

mfojtik · 2017-07-20T10:40:37Z

[test]

mfojtik · 2017-07-20T10:41:35Z

pkg/deploy/strategy/support/lifecycle.go

+
+	_, err = watch.Until(c.timeout, watcher, func(event watch.Event) (bool, error) {
+		if t := event.Type; t != watch.Modified {
+			return false, fmt.Errorf("acceptAvailablePods failed watching for ReplicationController %s/%s: recieved event %s", rc.Namespace, rc.Name, t)


also use %v for event?

not sure i understand this code completely... this means we will wait till the RC is updated and then check the acceptCondition. If we receive deleted event for example we return with error?

yes, there shouldn't be any other event that modified while waiting for availability change; if we receive deleted we failed which seems like the right thing to do because deleted RC can't get available

the default timeout for this watch is 10 minutes? should we have retry logic here in case the watch is dropped?

it was using watch even before...

But I am working on fixing watch.Until so it restarts watch and doesn't end prematurely in a separate branch. We can wait for it though

fixed received and %v

mfojtik · 2017-07-20T10:47:45Z

pkg/deploy/strategy/support/lifecycle.go

-		glog.V(4).Infof("Still waiting for %d pods to become ready for rc %s", unready.Len(), rc.Name)
-		return false, nil
+		newRc := event.Object.(*kapi.ReplicationController)
+		return acceptCondition(newRc), nil


acceptCondition is vague, can you name this something like "allReplicasAvailable()" or something?

mfojtik · 2017-07-20T10:48:27Z

pkg/deploy/strategy/support/lifecycle.go

 		}
-		return fmt.Errorf("pod readiness check failed for rc %q: %v", rc.Name, err)
+		return err


why removing the context from the error?

Because in current context it feels misleading in what it's saying. Other things might have failed here, not just "pod readiness check".

I am open to suggestions if we want to reformat the error here and not leave it on the caller.

tnozicka · 2017-07-20T10:52:36Z

had to rename the upstream commit to hopefully pass travis check; no other changes

mfojtik · 2017-07-20T11:33:23Z

flake #14897

[test]

This reverts commit ad71c5c.

tnozicka · 2017-07-21T08:18:43Z

flake #14897 again; re-[test]

tnozicka · 2017-07-21T11:45:57Z

[test] to find out if that was a flake. unfortunately we appear not to dump pods in failureTraps (just the deployers) this might have been sync delay for pod -> rc or one of the pods didn't became available for some reason (infra)

tnozicka · 2017-07-21T15:18:31Z

test succeeded, trying another run [test]

tnozicka · 2017-08-01T09:07:19Z

(no logs anymore)
/retest

tnozicka · 2017-08-01T12:05:27Z

yum failed because of network :/
/retest

0xmichalis · 2017-08-03T07:25:13Z

/retest

tnozicka · 2017-08-22T08:09:36Z

/retest

tnozicka · 2017-08-22T09:39:29Z

@mfojtik bump

tnozicka · 2017-08-22T09:40:40Z

P1 after the associated issue #15274

mfojtik · 2017-08-22T09:53:57Z

/retest
/approve

@Kargakis this LGTM to me, do you have any last comments?

0xmichalis · 2017-08-23T19:35:07Z

/lgtm

openshift-merge-robot · 2017-08-23T19:35:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kargakis, mfojtik, tnozicka

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/deploy/OWNERS~~ [kargakis,mfojtik]
~~test/extended/OWNERS~~ [kargakis,mfojtik]
~~vendor/k8s.io/kubernetes/pkg/api/OWNERS~~ [mfojtik]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-bot · 2017-08-24T06:54:40Z

/retest