Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve a race condition in FlowDurabilityTest causing test flakes #568

Merged
merged 7 commits into from
Aug 5, 2022

Conversation

Riliane
Copy link
Contributor

@Riliane Riliane commented Jul 20, 2022

A race condition in FlowDurabilityTest.testResumeBlockedAddedAfterRunStart caused the test to flake.

org.junit.ComparisonFailure: expected:<[sleep]> but was:<[End of Pipeline]>
at org.junit.Assert.assertEquals(Assert.java:117)
at org.junit.Assert.assertEquals(Assert.java:146)
at org.jenkinsci.plugins.workflow.cps.FlowDurabilityTest.assertIncludesNodes(FlowDurabilityTest.java:665)
at org.jenkinsci.plugins.workflow.cps.FlowDurabilityTest$23.evaluate(FlowDurabilityTest.java:788)

The last "sleep" node was not loaded on restart.
Even though all individual nodes were persisted properly, a change to the head node ID was being done by another thread and was not completing before the original thread went on to restart Jenkins.
CpsThreadGroup.notifyNewHead starts a thread every time it needs to write a new head to the build.xml file. (in line 499)
Timeline of the failure:

  • the sleep step node is persisted with id 5 in an xml file
  • notifyNewHead is called with node with id 5 which is the sleep step node - thread updating the head starts
  • main thread starts writing snapshot
  • thread created in notifyNewHead reaches the writing stage - too late, snapshot has been written with the old head!

The solution is to wait for CpsFlowExecution to go into SUSPENDED state as it finishes persisting build.xml before going on with the restart.
The same flake seems to be happening in two more tests - likely same solution applies.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • [n/a] Ensure you have provided tests - that demonstrates feature works or fixes the issue

avoid race condition causing flake
Copy link

@jmdesprez jmdesprez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@Pldi23 Pldi23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes seems clear to me, let's fix other flaky tests in this class (I assume the fix would be similar).
I also see the PR was not build because of test failure, what are your thoughts is it other flake? If yes probably good idea would be to fix it also to avoid multiple releases?

@Riliane
Copy link
Contributor Author

Riliane commented Jul 21, 2022

@Pldi23 it's a different test class:

org.jenkinsci.plugins.workflow.cps.CpsFlowDefinitionRJRTest.smokes
connect: Address is invalid on local machine, or port is not valid on remote machine

Probably an unstable connection issue. I could take a look into the logs and see if I can hunt that down

@jglick
Copy link
Member

jglick commented Jul 22, 2022

also see #570 (comment)

@jglick jglick added the tests label Jul 22, 2022
@Riliane
Copy link
Contributor Author

Riliane commented Jul 25, 2022

@jglick yes, testFullyDurableSurvivesDirtyRestart is the exact same flake - the writing of the head ID into the build.xml is cut short, the head is not set to the sleep node, the sleep node is not reloaded on restart, and the test then checks in verifySafelyResumed whether the head node is a semaphore or sleep one.

@Riliane Riliane marked this pull request as ready for review August 3, 2022 12:34
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK—again, I am not deeply familiar with this. @dwnusbaum?

@jglick jglick requested a review from dwnusbaum August 4, 2022 12:24
Copy link
Member

@dwnusbaum dwnusbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks generally fine to me, but I would make the new method private like Jesse mentioned.

private static void verifyDirtyResumed(JenkinsRule rule, WorkflowRun run, String logStart) throws Exception {
assertHasTimingAction(run.getExecution());
rule.waitForCompletion(run);
Assert.assertEquals(Result.SUCCESS, run.getResult());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW there is JenkinsRule.assertBuildStatus for this purpose, though it hardly matters.

@jglick jglick enabled auto-merge August 5, 2022 12:56
@jglick jglick merged commit 9101f1b into jenkinsci:master Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants