Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add version to offload nodes. Fixes #1944 and #1946 #1974

Merged
merged 508 commits into from
Jan 22, 2020

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Jan 14, 2020

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I have written unit and/or e2e tests for my change. PRs without these are unlike to be merged.
  • Optional. I've added My organization is added to the README.
  • I've signed the CLA and required builds are green.

The goal of this PR is to use resource version to make more sure that we our offloaded wfs are in-sync with k8s.


err = tx.Collection(wdc.tableName).UpdateReturning(wfDB)
func (wdc *nodeOffloadRepo) Get(uid, version string) (wfv1.Nodes, error) {
log.WithFields(log.Fields{"uid": uid, "version": version}).Debug("Getting offloaded nodes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the workflow has been updated in the Kubernetes API but the database hasn't been updated yet, then this could fail - should it retry for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so (but happy to be proved wrong).

We use the version so can have historical versions of the record in the database. AFAIK, only the controller can change these, and we keep them for 5m (same as Etcd).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right. What if:

  • Goroutine 1 reads workflow status nodes from DB

  • Goroutine 1 updates status nodes object in memory

  • Goroutine 2 reads workflow status nodes from DB (same workflow)

  • Goroutine 2 updates status nodes object in memory

  • Goroutine 1 writes updated status nodes to DB

  • Goroutine 1 updates workflow (exc. status nodes) in Kubernetes API

  • Goroutine 2 writes updated status nodes to DB

  • Goroutine 2 updates workflow (exc. status nodes) in Kubernetes API (note that this will call reapplyUpdate which will merge any Goroutine 1 changes from the Kubernetes API before persisting the workflow)

Won't the status nodes in the DB be missing the changes from Goroutine 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mark9white ,

Thank you for this - I really appreciate being challenged on this and it is critical we get it right.

Goroutine 1 reads workflow status nodes from DB
Goroutine 1 updates status nodes object in memory
Goroutine 2 reads workflow status nodes from DB (same workflow)
Goroutine 2 updates status nodes object in memory

At this point both routines have the same values for

  • metadata/resourceVersion
  • status/nodes
  • status/offloadedNodeStatusVersion

Goroutine 1 writes updated status nodes to DB
Goroutine 1 updates workflow (exc. status nodes) in Kubernetes API

At this point routine 2 has a different values for resourceVersion/nodes/offloadedNodeStatusVersion.

Goroutine 2 writes updated status nodes to DB

Now we have two records in the database each with a different versions.

Goroutine 2 updates workflow (exc. status nodes) in Kubernetes API (note that this will call reapplyUpdate which will merge any Goroutine 1 changes from the Kubernetes API before persisting the workflow)

This is correct . Aside - I wonder if reapplyUpdate is safe anyway?

The K8S will have goroutine 2's offloadedNodeStatusVersion - there will be two records in the database with two version. The older one is effectively orphaned and only needed so that we can support watches.

At this point Goroutine has a problem. Regardless of what happens with node status, it has out of data information. This is the behaviour today.

Please keep this feedback coming!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this issue is already happening today already when offload nodes are turned on (though not when they're turned off as the status nodes will be part of the merge in reapplyUpdate).

I would think reapplyUpdate should be safe as long as there's no conflicts between the two updates. The safer way would be for goroutine 2 to abort and restart.

test/e2e/argo_server_test.go Outdated Show resolved Hide resolved
test/e2e/offloading_test.go Outdated Show resolved Hide resolved
@alexec alexec merged commit 3293c83 into argoproj:master Jan 22, 2020
@alexec alexec deleted the nos branch January 22, 2020 01:18
Update(s.tableName).
Set("clustername", s.clusterName).
Where(db.Cond{"clustername": nil}).
And(db.Cond{"uuid": uid}).
Copy link
Contributor

@markterm markterm Feb 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this line read the following? (as there is no uuid column)

And(db.Cond{"uid": uid}).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - though why exactly any SQL framework would not throw some kind of exception at this clear error is beyond me 😠 😠 😠 😠

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually - maybe this code never runs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's running for me on migrating a legacy database ...

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it migrate successfully? give that it'l likely that every record should have "default" it may be just luckily - though I ask again - who writes framework that allows such a big error thought?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reviewing the code - this backfill is really implemented stupidly - I think I'll make a patch fix and then a re-write

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexec alexec mentioned this pull request Feb 3, 2020
5 tasks
logCtx := log.WithFields(log.Fields{"name": wf.Name, "namespace": wf.Namespace, "version": version})
logCtx.Info("Back-filling node status")
res, err := session.Update(archiveTableName).
Set("version", wf.ResourceVersion).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this line - the archive table doesn't have a version column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants