Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: rerun always if any failure #26574

Closed

Conversation

v1v
Copy link
Member

@v1v v1v commented Jun 29, 2021

What does this PR do?

Rerun any failed stages automatically up to 3 times. Those stages are:

  • build
  • lint
  • test (unit, ITs, cloud)

For every beat.

It's excluded for:

  • packaging Linux
  • Packaging ARM
  • k8s testing.

In addition a rerun.json file is archived with the stages that were retried and their parameters, this should help to debug how often retries are happening.

Why is it important?

If flakiness, we built a way to rerun the given commit manually and discard the existing success stages.
This new proposal, it adds the logic to the pipeline, therefore every stage will have the chance to retry again.

IMPORTANT: If a genuine failure then the stage will be retried up to 3 times :/ . I assume this is what we pay in order to reduce the flakiness, though, potentially we could add a test analyser between retries, but it might add some complexity and maintainability.

Issues

Tests

Failed stage

If a stage failed then the test results won't be archived as long as there are more retries

image

$ curl -s https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/pipelines/beats/branches/PR-26574/runs/10/nodes/6077/log/\?start\=0 | grep flaki
[2021-07-06T09:21:32.138Z] [WARN] x-pack/filebeat-build failed - 1 out of 3, let's try again and discard any kind of flakiness.

$ curl -s https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/pipelines/beats/branches/PR-26574/runs/10/nodes/6077/log/\?start\=0 | grep disab
script returned exit code 1[2021-07-06T09:21:23.128Z] [WARN] archiveTestOutput is disabled. Reason (archive: 'true', (numberOfRetries == currentRetry): false, failed: true)

Retried stages

$ curl -s https://beats-ci.elastic.co/job/Beats/job/beats/job/PR-26574/10/consoleText | grep flaki
[2021-07-06T09:21:32.138Z] [WARN] x-pack/filebeat-build failed - 1 out of 3, let's try again and discard any kind of flakiness.
[2021-07-06T10:16:31.369Z] [WARN] x-pack/winlogbeat-windows-10-windows-10 failed - 1 out of 3, let's try again and discard any kind of flakiness.

$ curl -s https://beats-ci.elastic.co/job/Beats/job/beats/job/PR-26574/10/artifact/rerun.json | jq .
{
  "x-pack/filebeat-build": {
    "context": "x-pack/filebeat-build",
    "command": "mage build test",
    "directory": "x-pack/filebeat",
    "label": "immutable && ubuntu-18",
    "withModule": true,
    "isMage": true,
    "id": "x-pack/filebeat-build",
    "numberOfRetries": 3,
    "currentRetry": 2,
    "package": false,
    "dockerArch": "amd64"
  },
  "x-pack/winlogbeat-windows-10-windows-10": {
    "context": "x-pack/winlogbeat-windows-10-windows-10",
    "command": "mage build unitTest",
    "directory": "x-pack/winlogbeat",
    "label": "windows-10",
    "withModule": false,
    "isMage": true,
    "id": "x-pack/winlogbeat-windows-10-windows-10",
    "numberOfRetries": 3,
    "currentRetry": 2,
    "package": false,
    "dockerArch": "amd64"
  }
}

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 29, 2021
@v1v v1v added automation backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify Team:Automation Label for the Observability productivity team labels Jun 29, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 29, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 29, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Reason: Stage retries: 0

  • Start Time: 2021-07-15T11:51:03.748+0000

  • Duration: 0 min 0 sec

  • Commit: 5e40150

Test stats 🧪

Test Results
Failed 0
Passed 49267
Skipped 5396
Total 54663

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 49267
Skipped 5396
Total 54663

v1v added 9 commits July 5, 2021 14:28
…stage-failed-within-same-build

* upstream/master: (36 commits)
  Revert "[CI] fight the flakiness with some retry option in the CI only for the Pull Requests (elastic#26617)" (elastic#26704)
  Packaging: linux/armv7 is not supported (elastic#26706)
  Cyberarkpas: Link to official docs on how to setup TLS (elastic#26614)
  Make network_direction, registered_domain and convert processors compatible with ES older than 7.13.0 (elastic#26676)
  Disable armv7 packaging (elastic#26679)
  [Heartbeat] use --params flag for synthetics (elastic#26674)
  Update dependent package to avoid downloading a suspicious file (elastic#26406)
  [mergify] set title and allow bp in any direction (elastic#26648)
  Fix memory leak in SQL helper when database is not available (elastic#26607)
  [CI] fight the flakiness with some retry option in the CI only for the Pull Requests (elastic#26617)
  [mergify] automate PRs that change the backport rules (elastic#26641)
  [Metricbeat] Add Airflow module in xpack (elastic#26220)
  chore: add-backport-next (elastic#26620)
  [metricbeat] Add state_job metricset (elastic#26479)
  CI: jenkins labels are less time consuming now (elastic#26613)
  [MetricBeat] [AWS] Fix aws metric tags with resourcegroupstaggingapi paginator (elastic#26385) (elastic#26443)
  Move openmetrics module to oss (elastic#26561)
  Skip flaky test TestFilestreamMetadataUpdatedOnRename (elastic#26609)
  [filebeat][fortinet] Use default add_locale for fortinet.firewall (elastic#26524)
  Enroll proxy settings (elastic#26514)
  ...
@v1v v1v marked this pull request as ready for review July 6, 2021 11:22
@v1v v1v requested a review from a team as a code owner July 6, 2021 11:22
@v1v v1v self-assigned this Jul 6, 2021
@kuisathaverat
Copy link
Contributor

I vote to retry only the command directly at https://github.com/elastic/beats/blob/master/Jenkinsfile#L565 nothing else

@v1v
Copy link
Member Author

v1v commented Jul 6, 2021

I vote to retry only the command directly at https://github.com/elastic/beats/blob/master/Jenkinsfile#L565 nothing else

If we do so, we cannot warranty the worker is in a good shape for the below reasons:

  1. the mage/mage command could fail if the workspace contains some corrupted files if reused.
  2. reused workers will cause a failure when preparing the build context. This case won't be covered if we add a retry in Jenkinsfile#L565.
  3. preparing the build context might fail when preparing the tools (network glitches). This case won't be covered if we add a retry in Jenkinsfile#L565.

Also, we cannot skip genuine failures in stages such as linting or packaging, that they don't have flakiness normally.

#26736 is the one that contains the changes from your suggestion

@v1v v1v added backport-v7.15.0 Automated backport with mergify and removed backport-v7.13.0 Automated backport with mergify labels Jul 6, 2021
@v1v v1v mentioned this pull request Jul 6, 2021
@kuisathaverat
Copy link
Contributor

I vote to retry only the command directly at https://github.com/elastic/beats/blob/master/Jenkinsfile#L565 nothing else

If we do so, we cannot warranty the worker is in a good shape for the below reasons:

  1. the mage/mage command could fail if the workspace contains some corrupted files if reused.
  2. reused workers will cause a failure when preparing the build context. This case won't be covered if we add a retry in Jenkinsfile#L565.
  3. preparing the build context might fail when preparing the tools (network glitches). This case won't be covered if we add a retry in Jenkinsfile#L565.

Also, we cannot skip genuine failures in stages such as linting or packaging, that they don't have flakiness normally.

#26736 is the one that contains the changes from your suggestion

most of the failures are due to test failures that are covered by the retry.

@v1v
Copy link
Member Author

v1v commented Jul 6, 2021

most of the failures are due to test failures that are covered by the retry.

Still 10% of the Beats builds are having a reused worker within the same build.

image

We fixed the reused workspace with https://github.com/elastic/apm-pipeline-library/blob/7f03e76e64c3a615a3ccdc8b911fbd236112daa7/vars/withNode.groovy#L38-L39 but still as you can see in the above numbers for reusing is still there.

So we can give a go with your suggestion, though I'd like to add some further configuration to exclude the retry in the linting and packaging

@v1v
Copy link
Member Author

v1v commented Jul 6, 2021

Superseded by #26736

Let's try a simple approach and if needed we can come back to this particular approach

@v1v v1v closed this Jul 6, 2021
@v1v v1v reopened this Jul 13, 2021
@mergify
Copy link
Contributor

mergify bot commented Jul 13, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/retry-if-stage-failed-within-same-build upstream/feature/retry-if-stage-failed-within-same-build
git merge upstream/master
git push upstream feature/retry-if-stage-failed-within-same-build

@v1v
Copy link
Member Author

v1v commented Jul 14, 2021

/test

@v1v v1v closed this Jul 14, 2021
@v1v v1v reopened this Jul 14, 2021
Jenkinsfile Outdated Show resolved Hide resolved
@v1v
Copy link
Member Author

v1v commented Jul 15, 2021

We have agreed to close this approach to avoid adding more complexity in the pipeline and wait for the fix in the CI ecosystem.

@v1v v1v closed this Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation backport-v7.14.0 Automated backport with mergify backport-v7.15.0 Automated backport with mergify Team:Automation Label for the Observability productivity team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants