Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry for OpenSearch Container #304

Merged
merged 1 commit into from
Oct 11, 2022
Merged

Retry for OpenSearch Container #304

merged 1 commit into from
Oct 11, 2022

Conversation

nhtruong
Copy link
Collaborator

@nhtruong nhtruong commented Oct 3, 2022

Description

Tests that run docker-compose up randomly fail with "no tests found error". Rerunning the failed jobs resolves the issue. This is a bug in Opensearch build that causes the container to intermittently fail. Restarting the container when encountering this issue is a workaround this PR is addressing.

Issues Resolved

closes #268

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dblock
Copy link
Member

dblock commented Oct 3, 2022

Maybe we can either fix the underlying issue or use something like https://www.npmjs.com/package/retry instead of rolling our own?

@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 3, 2022

@dblock I'm looking into finding/fixing the source myself. I'm just checking if we can use retry to ease the pain of manually retrying failed jobs. Unfortunately, the retry I implemented in the first commit will still fail all 3 attempts if the first attempt fails. Now I'm using this PR to test out different strategies. My current suspicion is that TAP is somehow looking for TS or JSX files as if TAP_TS and TAP_TXS were set to 1.

@nhtruong nhtruong changed the title Implemented Retries for Integration workflows Implemented Retry for OpenSearch Container Oct 4, 2022
@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 4, 2022

Found the cause and a workaround. Please check the updated description for details.

@nhtruong nhtruong marked this pull request as ready for review October 4, 2022 12:50
@nhtruong nhtruong requested a review from a team as a code owner October 4, 2022 12:50
@nhtruong nhtruong changed the title Implemented Retry for OpenSearch Container Retry for OpenSearch Container Oct 4, 2022
@dblock
Copy link
Member

dblock commented Oct 4, 2022

I hate it but I think it works ;)

@dblock
Copy link
Member

dblock commented Oct 4, 2022

Berfore we merge this, is there a way to retry the tests at GHA level? I'm worried about hacks on top of hacks.

@harshavamsi
Copy link
Collaborator

Could look into https://github.com/marketplace/actions/retry-action and https://github.com/marketplace/actions/retry-step. Validate container start up docker and execute retry logic.

@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 4, 2022

Could look into https://github.com/marketplace/actions/retry-action and https://github.com/marketplace/actions/retry-step. Validate container start up docker and execute retry logic.

I thought of retry-step for the Container Startup step but decided not to use it:

  • It will retry the whole step while we only need to restart the container instead of going through the entire clean/build/boot-up sequence.
  • This step never failed (or failed silently), so we will still have to modify this step to grep for the kill signals to call exit 1 explicitly to signal retry-step to retry.
  • The workaround is in the make file, people running make commands on their local machine will benefit from this too, not just github workflow.

I agree that grepping for a piece of text is not pretty 😞. There are potentially more elegant workarounds if we want to invest more time.

@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 4, 2022

Maybe we can use this: https://hub.docker.com/r/willfarrell/autoheal/
I'll take a look this afternoon.

@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 5, 2022

@dblock @harshavamsi
Sooo the best the solution is stupidly simple! The docker-compose up step not failing on the failed tests led me to believe that the container was still running and only the OS service inside the container that crashed. However, what actually happened is, the container itself crashed soon after docker-compose up. And we just need to tell docker to always restart the container (instead of the default no) when it crashes.

https://hub.docker.com/r/willfarrell/autoheal/ is not appropriate for this situation because it's mean for crashed services, not crashed containers.

@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 5, 2022

Testing a few more times to make sure that the problem is solved.

@nhtruong
Copy link
Collaborator Author

nhtruong commented Oct 5, 2022

I spoke too soon. This bug doesn't always crash the container when the opensearch processed is killed. It only does that sometimes.

Signed-off-by: Theo Truong <theotr@amazon.com>
@dblock
Copy link
Member

dblock commented Oct 11, 2022

@nhtruong This is pretty clean retry-wise. Do you want to merge it or it doesn't work?

@nhtruong
Copy link
Collaborator Author

@dblock I've tested this on over 700 runs and non failed. So, it's good to go!

@dblock dblock merged commit 5efb1aa into opensearch-project:main Oct 11, 2022
@nhtruong nhtruong deleted the flaky_workflow branch October 11, 2022 21:28
AMoo-Miki pushed a commit to AMoo-Miki/opensearch-js that referenced this pull request Jul 12, 2023
Signed-off-by: Theo Truong <theotr@amazon.com>

Signed-off-by: Theo Truong <theotr@amazon.com>
AMoo-Miki added a commit to AMoo-Miki/opensearch-js that referenced this pull request Jul 12, 2023
* Manually sync CHANGELOG.md, release-drafter.yml, RELEASING.md, and USER_GUIDE.md
* Version Bump: 2.3.0 (opensearch-project#546)
* Upgrade and secure the backport workflow (opensearch-project#547)
* Add serialization and deserialization of numerals larger than `Number.MAX_SAFE_INTEGER` (opensearch-project#544)
* Add upgrading NPM to all workflows running older Node.js versions (opensearch-project#545)
* Bump @types/node from 20.1.4 to 20.2.5 (opensearch-project#528)
* Bump @aws-sdk/types from 3.329.0 to 3.341.0 (opensearch-project#527)
* Bump ora from 6.3.0 to 6.3.1 (opensearch-project#524)
* Bump node-fetch from 3.2.10 to 3.3.1 (opensearch-project#526)
* Bump eslint from 8.39.0 to 8.41.0 (opensearch-project#525)
* Bump @types/node from 18.16.0 to 20.1.4 (opensearch-project#518)
* Bump xmlbuilder2 from 3.1.0 to 3.1.1 (opensearch-project#519)
* Bump semver from 7.3.8 to 7.5.1 (opensearch-project#520)
* Added timursaurus to MAINTAINERS.md (opensearch-project#517)
* Bump @aws-sdk/types from 3.257.0 to 3.329.0 (opensearch-project#516)
* Bump @babel/eslint-parser from 7.21.3 to 7.21.8 (opensearch-project#513)
* Bump simple-git from 3.17.0 to 3.18.0 (opensearch-project#512)
* Bump xmlbuilder2 from 3.0.2 to 3.1.0 (opensearch-project#503)
* Bump eslint from 8.38.0 to 8.39.0 (opensearch-project#504)
* Bump @types/node from 18.15.11 to 18.16.0 (opensearch-project#499)
* Bump ora from 6.1.2 to 6.3.0 (opensearch-project#500)
* Bump prettier from 2.8.7 to 2.8.8 (opensearch-project#501)
* Implemented Docker Image caching for `integration-unreleased` workflow (opensearch-project#498)
* [Bug] Fixed bundler.yml (opensearch-project#497)
* Bumped Version 2.2.1 In preparation for release (opensearch-project#495)
* Bump @types/node from 18.15.10 to 18.15.11 (opensearch-project#493)
* Bump eslint from 8.33.0 to 8.38.0 (opensearch-project#494)
* Bump rimraf from 4.4.0 to 5.0.0 (opensearch-project#492)
* [CCI] feat: add missing createConnection type (opensearch-project#490)
* Add Search guide (opensearch-project#489)
* Create search guide (opensearch-project#488)
* [CCI] Create advanced index actions guide (opensearch-project#483)
* [CCI] Connection `request` method callback (opensearch-project#478)
* feat: NotCompatibleError (opensearch-project#484)
* Bump @babel/eslint-parser from 7.19.1 to 7.21.3 (opensearch-project#486)
* Bump jsdoc from 4.0.0 to 4.0.2 (opensearch-project#485)
* Bump prettier from 2.8.4 to 2.8.7 (opensearch-project#487)
* Create document lifecycle guide (opensearch-project#481)
* Create index lifecycle guide (opensearch-project#482)
* Create bulk guide (opensearch-project#480)
* [CCI] Create index_template guide (opensearch-project#479)
* Bump split2 from 4.1.0 to 4.2.0 (opensearch-project#467)
* Bump deepmerge from 4.3.0 to 4.3.1 (opensearch-project#468)
* Bump @types/node from 18.15.3 to 18.15.10 (opensearch-project#462)
* Bump eslint-config-prettier from 8.6.0 to 8.8.0 (opensearch-project#463)
* [CCI] Update developerGuide regarding yarn troubleshoot steps (opensearch-project#456)
* [CCI] Bump caniuse-lite from 1.0.30001249 to 1.0.30001469  (opensearch-project#459)
* added the solution for the possible error during yarn installation on Windows OS (opensearch-project#453)
* Fixed deprecation warnings (opensearch-project#446)
* [CCI] Don't run tests on PRs with doc changes only  (opensearch-project#441)
* [CCI] Add pull request template (opensearch-project#440)
* Bump @types/node from 17.0.45 to 18.15.3 (opensearch-project#448)
* Bump simple-git from 3.16.0 to 3.17.0 (opensearch-project#447)
* [CCI] Update Developer Guide (opensearch-project#436)
* [CCI] Remove waitCluster in Integration Tests (opensearch-project#423)
* Bump tsd from 0.25.0 to 0.27.0 (opensearch-project#431)
* Bump rimraf from 4.1.1 to 4.4.0 (opensearch-project#432)
* [CCI] doc: fix grammar (opensearch-project#427)
* Fix deprecated folder mapping "./" in the "exports" field (opensearch-project#416)
* [CCI] fix: pass required `data` argument to SerializationError class (opensearch-project#419)
* Bump tap from 16.3.0 to 16.3.4 (opensearch-project#413)
* [CCI] Refactor: Remove unnecessary `data` argument when invoking `OpenSearchClientError` (opensearch-project#421)
* Bump deepmerge from 4.2.2 to 4.3.0 (opensearch-project#414)
* Updated Maintainers list and changelogs (opensearch-project#409)
* Downgraded @types/node to 17 from 18 due to breaking changes. (opensearch-project#405)
* Bump aws4 from 1.11.0 to 1.12.0 (opensearch-project#406)
* Bump prettier from 2.8.3 to 2.8.4 (opensearch-project#407)
* Bump minimist from 1.2.7 to 1.2.8 (opensearch-project#388)
* Bump tsd from 0.24.1 to 0.25.0 (opensearch-project#389)
* Add missing types for AwsSigv4SignerOptions.service (opensearch-project#377)
* Created untriaged issue workflow. (opensearch-project#386) Daniel (dB.) Doubrovkine* 2/14/23, 12:14 PM
* Bump dezalgo from 1.0.3 to 1.0.4 (opensearch-project#383)
* Bump simple-statistics from 7.7.0 to 7.8.3 (opensearch-project#384)
* Bump eslint from 8.32.0 to 8.33.0 (opensearch-project#379)
*  Allow fields in BulkOperation to be optional (opensearch-project#378)
* Bump @types/node from 18.11.18 to 18.11.19 (opensearch-project#380)
* Bump ora from 5.4.1 to 6.1.2 (opensearch-project#376)
* Bump @aws-sdk/types from 3.226.0 to 3.257.0 (opensearch-project#375)
* Bump secure-json-parse from 2.4.0 to 2.7.0 (opensearch-project#369)
* Bump simple-git from 3.15.1 to 3.16.0 (opensearch-project#373)
* Updated user guide to include Amazon OpenSearch Serverless (opensearch-project#372)
* Added Support for AOSS (opensearch-project#366)
* Bump rimraf from 3.0.2 to 4.1.1 (opensearch-project#370)
* Bump eslint-config-prettier from 8.5.0 to 8.6.0 (opensearch-project#368)
* Bump hpagent from 0.1.2 to 1.2.0 (opensearch-project#361)
* Bump eslint from 8.30.0 to 8.32.0 (opensearch-project#362)
* Fixed missing namespace for API-PIT endpoints (opensearch-project#364)
* Bump prettier from 2.7.1 to 2.8.3 (opensearch-project#363)
* Ensure Dependabot PR workflow retriggers on label change (opensearch-project#360)
* Added point in time APIs (opensearch-project#348)
* Bump @types/node from 15.14.7 to 18.11.18 (opensearch-project#353)
* Bump split2 from 3.2.2 to 4.1.0 (opensearch-project#354)
* Bump json5 from 2.2.0 to 2.2.3 (opensearch-project#359)
* Bump @aws-sdk/types from 3.190.0 to 3.226.0 (opensearch-project#355)
* Updated MAINTAINERS.md to match recommended opensearch-project format. (opensearch-project#358) Daniel (dB.) Doubrovkine* 1/6/23, 8:36 AM
* Bump minimist from 1.2.6 to 1.2.7 (opensearch-project#350)
* Bump eslint from 7.32.0 to 8.30.0 (opensearch-project#346)
* Bump eslint-plugin-prettier from 4.0.0 to 4.2.1 (opensearch-project#351)
* Removed test artifacts from gh_pages workflow (opensearch-project#347)
* Bump minimatch from 3.0.4 to 3.1.2 (opensearch-project#345)
* Bump xmlbuilder2 from 2.4.1 to 3.0.2 (opensearch-project#321)
* Implement JSDOC (opensearch-project#337)
* Added skip-changelog label (opensearch-project#339)
* Bumps simple-git from 3.4.0 to 3.15.0 (opensearch-project#341)
* [FEATURE] Allow overriding the aws service identifier in AwsSigv4Signer (opensearch-project#333)
* Add release details and bump the jenkins lib version (opensearch-project#319)
* Bump Version from 2.1.0 to 2.1.1 (opensearch-project#318)
* Bump semver from 7.3.7 to 7.3.8 (opensearch-project#313)
* Add release workflows (opensearch-project#317)
* issue opensearch-project#291 mutability of headers (opensearch-project#311)
* Bump tsd from 0.22.0 to 0.24.1 (opensearch-project#312)
* Add changelog and changelog verifier (opensearch-project#306)
* Bump prettier from 2.6.2 to 2.7.1 (opensearch-project#310)
* Bump @aws-sdk/types from 3.160.0 to 3.190.0 (opensearch-project#309)
* Bump node-fetch from 2.6.7 to 3.2.10 (opensearch-project#281)
* Update Maintainers List (opensearch-project#308)
* Implemented Retry for OpenSearch Container (opensearch-project#304)
* Feature/add default credentials provider (opensearch-project#295)
* Fix new line problem between diffs (opensearch-project#302)
* Removed Unused Variables (opensearch-project#301)
* fix awssigv4signer.test.js tests not running (opensearch-project#294)
* fix: support TS resolution for .mjs (opensearch-project#296)
* Add AwsSigV4 signing functionality (opensearch-project#279)
* Fix opensearch-project#253 Cannot read property 'then' of null in Transport.js issue. (opensearch-project#283)
* Bump simple-git from 3.5.0 to 3.13.0 (opensearch-project#286)
* Bump semver from 7.3.5 to 7.3.7 (opensearch-project#280)
* Adds bulk example to README (opensearch-project#277)
* Adding Dependabot configuration (opensearch-project#278)
* Removing OpenDistro integration tests (opensearch-project#271)
* Upgrade dependencies (opensearch-project#272)
* fix: add missing memory circuit breaker options (opensearch-project#266)
* feat: allow `doc` overwrite in `onDocument` (opensearch-project#263)
* Adding link checker workflow (opensearch-project#262)
* Using standardized templates from .github (opensearch-project#249)
* Adding new OpenSearch versions and updating compatibility matrix (opensearch-project#257)

Signed-off-by: Miki <miki@amazon.com>
@AMoo-Miki AMoo-Miki mentioned this pull request Jul 12, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Integration tests are failing intermittently
4 participants