-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add version to offload nodes. Fixes #1944 and #1946 #1974
Conversation
|
||
err = tx.Collection(wdc.tableName).UpdateReturning(wfDB) | ||
func (wdc *nodeOffloadRepo) Get(uid, version string) (wfv1.Nodes, error) { | ||
log.WithFields(log.Fields{"uid": uid, "version": version}).Debug("Getting offloaded nodes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the workflow has been updated in the Kubernetes API but the database hasn't been updated yet, then this could fail - should it retry for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so (but happy to be proved wrong).
We use the version so can have historical versions of the record in the database. AFAIK, only the controller can change these, and we keep them for 5m (same as Etcd).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right. What if:
-
Goroutine 1 reads workflow status nodes from DB
-
Goroutine 1 updates status nodes object in memory
-
Goroutine 2 reads workflow status nodes from DB (same workflow)
-
Goroutine 2 updates status nodes object in memory
-
Goroutine 1 writes updated status nodes to DB
-
Goroutine 1 updates workflow (exc. status nodes) in Kubernetes API
-
Goroutine 2 writes updated status nodes to DB
-
Goroutine 2 updates workflow (exc. status nodes) in Kubernetes API (note that this will call reapplyUpdate which will merge any Goroutine 1 changes from the Kubernetes API before persisting the workflow)
Won't the status nodes in the DB be missing the changes from Goroutine 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mark9white ,
Thank you for this - I really appreciate being challenged on this and it is critical we get it right.
Goroutine 1 reads workflow status nodes from DB
Goroutine 1 updates status nodes object in memory
Goroutine 2 reads workflow status nodes from DB (same workflow)
Goroutine 2 updates status nodes object in memory
At this point both routines have the same values for
metadata/resourceVersion
status/nodes
status/offloadedNodeStatusVersion
Goroutine 1 writes updated status nodes to DB
Goroutine 1 updates workflow (exc. status nodes) in Kubernetes API
At this point routine 2 has a different values for resourceVersion/nodes/offloadedNodeStatusVersion.
Goroutine 2 writes updated status nodes to DB
Now we have two records in the database each with a different versions.
Goroutine 2 updates workflow (exc. status nodes) in Kubernetes API (note that this will call reapplyUpdate which will merge any Goroutine 1 changes from the Kubernetes API before persisting the workflow)
This is correct . Aside - I wonder if reapplyUpdate
is safe anyway?
The K8S will have goroutine 2's offloadedNodeStatusVersion - there will be two records in the database with two version. The older one is effectively orphaned and only needed so that we can support watches.
At this point Goroutine has a problem. Regardless of what happens with node status, it has out of data information. This is the behaviour today.
Please keep this feedback coming!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this issue is already happening today already when offload nodes are turned on (though not when they're turned off as the status nodes will be part of the merge in reapplyUpdate).
I would think reapplyUpdate should be safe as long as there's no conflicts between the two updates. The safer way would be for goroutine 2 to abort and restart.
Update(s.tableName). | ||
Set("clustername", s.clusterName). | ||
Where(db.Cond{"clustername": nil}). | ||
And(db.Cond{"uuid": uid}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this line read the following? (as there is no uuid column)
And(db.Cond{"uid": uid}).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - though why exactly any SQL framework would not throw some kind of exception at this clear error is beyond me 😠 😠 😠 😠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually - maybe this code never runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did it migrate successfully? give that it'l likely that every record should have "default" it may be just luckily - though I ask again - who writes framework that allows such a big error thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reviewing the code - this backfill is really implemented stupidly - I think I'll make a patch fix and then a re-write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logCtx := log.WithFields(log.Fields{"name": wf.Name, "namespace": wf.Namespace, "version": version}) | ||
logCtx.Info("Back-filling node status") | ||
res, err := session.Update(archiveTableName). | ||
Set("version", wf.ResourceVersion). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this line - the archive table doesn't have a version column
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.The goal of this PR is to use resource version to make more sure that we our offloaded wfs are in-sync with k8s.