feat: Add version to offload nodes. Fixes #1944 and #1946 #1974

alexec · 2020-01-14T17:52:37Z

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
Optional. I've added My organization is added to the README.
I've signed the CLA and required builds are green.

The goal of this PR is to use resource version to make more sure that we our offloaded wfs are in-sync with k8s.

… apiserverimpl

markterm · 2020-01-17T11:44:34Z

persist/sqldb/offload_node_status_repo.go

-
-	err = tx.Collection(wdc.tableName).UpdateReturning(wfDB)
+func (wdc *nodeOffloadRepo) Get(uid, version string) (wfv1.Nodes, error) {
+	log.WithFields(log.Fields{"uid": uid, "version": version}).Debug("Getting offloaded nodes")


If the workflow has been updated in the Kubernetes API but the database hasn't been updated yet, then this could fail - should it retry for this case?

I don't think so (but happy to be proved wrong).

We use the version so can have historical versions of the record in the database. AFAIK, only the controller can change these, and we keep them for 5m (same as Etcd).

Ah you're right. What if:

Goroutine 1 reads workflow status nodes from DB

Goroutine 1 updates status nodes object in memory

Goroutine 2 reads workflow status nodes from DB (same workflow)

Goroutine 2 updates status nodes object in memory

Goroutine 1 writes updated status nodes to DB

Goroutine 1 updates workflow (exc. status nodes) in Kubernetes API

Goroutine 2 writes updated status nodes to DB

Goroutine 2 updates workflow (exc. status nodes) in Kubernetes API (note that this will call reapplyUpdate which will merge any Goroutine 1 changes from the Kubernetes API before persisting the workflow)

Won't the status nodes in the DB be missing the changes from Goroutine 1?

@mark9white ,

Thank you for this - I really appreciate being challenged on this and it is critical we get it right.

Goroutine 1 reads workflow status nodes from DB
Goroutine 1 updates status nodes object in memory
Goroutine 2 reads workflow status nodes from DB (same workflow)
Goroutine 2 updates status nodes object in memory

At this point both routines have the same values for

metadata/resourceVersion

status/nodes

status/offloadedNodeStatusVersion

Goroutine 1 writes updated status nodes to DB
Goroutine 1 updates workflow (exc. status nodes) in Kubernetes API

At this point routine 2 has a different values for resourceVersion/nodes/offloadedNodeStatusVersion.

Goroutine 2 writes updated status nodes to DB

Now we have two records in the database each with a different versions.

Goroutine 2 updates workflow (exc. status nodes) in Kubernetes API (note that this will call reapplyUpdate which will merge any Goroutine 1 changes from the Kubernetes API before persisting the workflow)

This is correct . Aside - I wonder if reapplyUpdate is safe anyway?

The K8S will have goroutine 2's offloadedNodeStatusVersion - there will be two records in the database with two version. The older one is effectively orphaned and only needed so that we can support watches.

At this point Goroutine has a problem. Regardless of what happens with node status, it has out of data information. This is the behaviour today.

Please keep this feedback coming!

Agreed that this issue is already happening today already when offload nodes are turned on (though not when they're turned off as the status nodes will be part of the merge in reapplyUpdate).

I would think reapplyUpdate should be safe as long as there's no conflicts between the two updates. The safer way would be for goroutine 2 to abort and restart.

workflow/controller/controller.go

workflow/controller/operator.go

test/e2e/argo_server_test.go

test/e2e/offloading_test.go

markterm · 2020-02-03T17:08:45Z

persist/sqldb/backfill_cluster_name.go

+			Update(s.tableName).
+			Set("clustername", s.clusterName).
+			Where(db.Cond{"clustername": nil}).
+			And(db.Cond{"uuid": uid}).


should this line read the following? (as there is no uuid column)

And(db.Cond{"uid": uid}).

yes - though why exactly any SQL framework would not throw some kind of exception at this clear error is beyond me 😠 😠 😠 😠

actually - maybe this code never runs

It's running for me on migrating a legacy database ...

did it migrate successfully? give that it'l likely that every record should have "default" it may be just luckily - though I ask again - who writes framework that allows such a big error thought?

no reviewing the code - this backfill is really implemented stupidly - I think I'll make a patch fix and then a re-write

markterm · 2020-02-03T18:22:52Z

persist/sqldb/backfill_nodes.go

+		logCtx := log.WithFields(log.Fields{"name": wf.Name, "namespace": wf.Namespace, "version": version})
+		logCtx.Info("Back-filling node status")
+		res, err := session.Update(archiveTableName).
+			Set("version", wf.ResourceVersion).


Also this line - the archive table doesn't have a version column

alexec and others added 30 commits January 6, 2020 20:05

search

59c0aed

docs + help

daa5bd2

Merge branch 'master' into apiserverimpl

e31737a

Fix imports

678d5ae

Fixes:

8962a47

error boundary + loading

4556caf

Can create workflow tempaltes from UI

aa73fe2

removed dup delete fields

75d2f07

lint

06ffb49

vendor

39803ec

ui

e84719f

layout

8f72da8

rm yamljs

49910f5

Fixed NoAuth issue

d46ed75

Merge branch 'apiserverimpl' of https://github.com/argoproj/argo into…

619c8aa

… apiserverimpl

reduce workflow size

850669f

ok

a265031

continue

079029b

continue

bbb8434

Remove leading space

008d19c

namespace filter

ab84060

added missing labels

2922c89

help

350069e

error boundary

cf3c842

change creation endpoints

f5ce107

lint

494ef47

refactor db logic removing unused methods

99b296a

tidy up

156bafb

Cron Workflow UI Core

fea29e7

Cron Workflows app layer

53a83d2

alexec added 5 commits January 16, 2020 16:26

slightly longer timeout

c87fcb8

pointer for slice

0a23c5f

only print diagnostics for e2e tests

698f4ae

correct README.md

01a1c5b

fix e2e tests

321189a

markterm reviewed Jan 17, 2020

View reviewed changes

alexec and others added 4 commits January 17, 2020 10:40

Merge branch 'master' into nos

89c8e6d

codegen

6e70609

Merge branch 'master' into nos

9ffbab4

fix: post-merge error

9870772

jessesuen reviewed Jan 21, 2020

View reviewed changes

workflow/controller/controller.go Outdated Show resolved Hide resolved

jessesuen reviewed Jan 21, 2020

View reviewed changes

workflow/controller/operator.go Show resolved Hide resolved

alexec added 5 commits January 21, 2020 15:37

Merge branch 'master' into nos

46527bc

Merge

07ccbf1

Merge branch 'nos' of github.com:alexec/argo into nos

b852c54

Merge branch 'master' into nos

e74f1e2

feat(sql): post merge fixes

3de1820

golangcibot reviewed Jan 21, 2020

View reviewed changes

test/e2e/argo_server_test.go Outdated Show resolved Hide resolved

test/e2e/offloading_test.go Outdated Show resolved Hide resolved

alexec added 3 commits January 21, 2020 16:11

Merge branch 'master' into nos

8594d92

lint

2bac47f

code review comments

24a68d8

jessesuen approved these changes Jan 22, 2020

View reviewed changes

alexec merged commit 3293c83 into argoproj:master Jan 22, 2020

alexec deleted the nos branch January 22, 2020 01:18

simster7 mentioned this pull request Jan 28, 2020

Individual steps have dissapeared from CLI watch #2088

Closed

markterm reviewed Feb 3, 2020

View reviewed changes

alexec mentioned this pull request Feb 3, 2020

fix: Correct SQL syntax. #2141

Merged

5 tasks

markterm reviewed Feb 3, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add version to offload nodes. Fixes #1944 and #1946 #1974

feat: Add version to offload nodes. Fixes #1944 and #1946 #1974

alexec commented Jan 14, 2020 •

edited

Loading

markterm Jan 17, 2020

alexec Jan 17, 2020

markterm Jan 18, 2020

alexec Jan 21, 2020

markterm Jan 21, 2020

markterm Feb 3, 2020 •

edited

Loading

alexec Feb 3, 2020

alexec Feb 3, 2020

markterm Feb 3, 2020

alexec Feb 3, 2020

alexec Feb 3, 2020

alexec Feb 3, 2020

markterm Feb 3, 2020

feat: Add version to offload nodes. Fixes #1944 and #1946 #1974

feat: Add version to offload nodes. Fixes #1944 and #1946 #1974

Conversation

alexec commented Jan 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markterm Feb 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexec commented Jan 14, 2020 •

edited

Loading

markterm Feb 3, 2020 •

edited

Loading