fix: controlling issues #5756

rangoo94 · 2024-08-13T13:27:25Z

Pull request description

Checklist (choose whats happened)

Breaking changes

Changes

Fixes

…imestamp

…ith details

…orkflowResult model

…hing more resilient to external problems, expose more Kubernetes error details

…orkers and services

vsukhin · 2024-08-13T14:17:28Z

pkg/testworkflows/testworkflowcontroller/controller.go

-			if nodeName != prevNodeName || podIP != prevPodIP || prevStatus != status || prevCurrent != current {
+			// TODO: the final status should always have the finishedAt too,
+			//       there should be no need for checking isFinished diff
+			if nodeName != prevNodeName || isFinished != prevIsFinished || podIP != prevPodIP || prevStatus != status || prevCurrent != current {


such code will not be simple, to maintain (

I fully agree, although this code is rather not meant to be maintained. This PR, along actual bugfixes, is meant to "enable" the new orchestration with similar watching system.

The (mainly) WatchInstrumentedPod and TestWorkflowResult have so many edge cases handled, clock calibration, and auto-healing mechanisms implemented, that at this point it's probably better to just rewrite them based on the observations that we have for the last few months of running Test Workflows. After that, we will likely not need conditions like these at all.

I'm guessing that half of the healing mechanisms and edge cases handlers are no longer needed, considering iterations of orchestration improvements.

pkg/testworkflows/testworkflowcontroller/watchinstrumentedpod.go

pkg/testworkflows/testworkflowcontroller/podstate.go

* fix: continue paused container, when the abort is requested * fix: ensure the lightweight container watcher will get `finishedAt` timestamp * chore: add minor todos * fix: configure no preemption policy by default for Test Workflows * fix: allow Test Workflow status notifier to update "Aborted" status with details * fix: ensure the parallel workers will not end without result * fix: properly build timestamps and detect finished resul in the TestWorkflowResult model * fix: use Pod/Job StatusConditions for detecting the status, make watching more resilient to external problems, expose more Kubernetes error details * chore: do not require job/pod events when fetching logs of parallel workers and services * fixup unit tests * fix: delete preemption policy setup * fixup unit tests * fix: adjust resume time to avoid negative duration * fix: calibrate clocks * chore: use consts * fixup unit tests

rangoo94 added 9 commits August 13, 2024 15:27

fix: continue paused container, when the abort is requested

be757f6

fix: ensure the lightweight container watcher will get finishedAt t…

7022c54

…imestamp

chore: add minor todos

3e349ed

fix: configure no preemption policy by default for Test Workflows

1d3c269

fix: allow Test Workflow status notifier to update "Aborted" status w…

47b9935

…ith details

fix: ensure the parallel workers will not end without result

0f3cdf9

fix: properly build timestamps and detect finished resul in the TestW…

ad1c241

…orkflowResult model

fix: use Pod/Job StatusConditions for detecting the status, make watc…

8915856

…hing more resilient to external problems, expose more Kubernetes error details

chore: do not require job/pod events when fetching logs of parallel w…

aaf1673

…orkers and services

rangoo94 marked this pull request as ready for review August 13, 2024 13:33

rangoo94 requested a review from a team as a code owner August 13, 2024 13:33

rangoo94 added 2 commits August 13, 2024 15:40

fixup unit tests

2a98b47

fix: delete preemption policy setup

d7bdcd7

vsukhin reviewed Aug 13, 2024

View reviewed changes

pkg/testworkflows/testworkflowcontroller/watchinstrumentedpod.go Outdated Show resolved Hide resolved

vsukhin reviewed Aug 13, 2024

View reviewed changes

pkg/testworkflows/testworkflowcontroller/podstate.go Outdated Show resolved Hide resolved

vsukhin approved these changes Aug 13, 2024

View reviewed changes

rangoo94 added 5 commits August 13, 2024 20:57

fixup unit tests

fc785b3

fix: adjust resume time to avoid negative duration

a70e92b

fix: calibrate clocks

5652bfd

chore: use consts

95a1911

fixup unit tests

f66c67f

rangoo94 merged commit fe27bb7 into develop Aug 13, 2024
35 checks passed

rangoo94 deleted the sandbox/control-issues branch August 13, 2024 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: controlling issues #5756

fix: controlling issues #5756

rangoo94 commented Aug 13, 2024 •

edited

Loading

vsukhin Aug 13, 2024

rangoo94 Aug 13, 2024 •

edited

Loading

fix: controlling issues #5756

fix: controlling issues #5756

Conversation

rangoo94 commented Aug 13, 2024 • edited Loading

Pull request description

Checklist (choose whats happened)

Breaking changes

Changes

Fixes

vsukhin Aug 13, 2024

Choose a reason for hiding this comment

rangoo94 Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

rangoo94 commented Aug 13, 2024 •

edited

Loading

rangoo94 Aug 13, 2024 •

edited

Loading