Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solutions] Detection engine API integration tests fail and do not fail gracefully, CI aborts #125319

Closed
liza-mae opened this issue Feb 10, 2022 · 0 comments · Fixed by #125432
Assignees
Labels
blocker bug Fixes for quality problems that affect the customer experience Team:Detection Rule Management Security Detection Rule Management Team

Comments

@liza-mae
Copy link
Contributor

Version: 8.x

The tests in detection engine API integration tests (security and spaces), are failing with the following error:

│ERROR Did not get an expected 200 "ok" when waiting for a rule success or status (waitForRuleSuccessOrStatus). CI issues could happen. Suspect this line if you are seeing CI issues. body: {"message":"id: \"238f2b40-8a1e-11ec-8e7c-bb69d325f0dd\" not found","status_code":404}, status: 404

Getting this error is not the most problematic part, it is the fact that the when you get into this condition the tests continue to retry, causing the CI to time out abort after 5+ hours and flooding the CI console 100,000+ lines of errors.

Two parts to address:

  • Limit the retries to an acceptable amount
  • Fix the root cause of this error

I am labeling it a blocker for testing.

@liza-mae liza-mae added bug Fixes for quality problems that affect the customer experience blocker Team:Detection Rule Management Security Detection Rule Management Team labels Feb 10, 2022
@FrankHassanabad FrankHassanabad self-assigned this Feb 11, 2022
FrankHassanabad added a commit that referenced this issue Feb 11, 2022
…es one log.error to log.debug to avoid spamming cloud servers (#125432)

## Summary

Fixes #125319

Reduces timeouts from 10 minutes per test down to 2 minutes and changes log.error to log.debug to avoid spamming

I test this by making a failing test here:
```
x-pack/test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

By increasing `await waitForSignalsToBePresent(supertest, log, 4, [id]);` from `4` to `5`....Then I watched the logs and timed it to ensure it doesn't take 2 minutes.

I did this by running the server:
```sh
 node scripts/functional_tests_server.js --config test/detection_engine_api_integration/security_and_spaces/config.ts
```

And then running the client:
```sh
 node scripts/functional_test_runner.js --config test/detection_engine_api_integration/security_and_spaces/config.ts --include test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

If we start to see flake again on the regular build servers we might have to slightly increase this number again. If this doesn't allow failing servers to fully complete we might need to also decrease this number further or we might need to make this more configurable.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Feb 11, 2022
…es one log.error to log.debug to avoid spamming cloud servers (elastic#125432)

## Summary

Fixes elastic#125319

Reduces timeouts from 10 minutes per test down to 2 minutes and changes log.error to log.debug to avoid spamming

I test this by making a failing test here:
```
x-pack/test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

By increasing `await waitForSignalsToBePresent(supertest, log, 4, [id]);` from `4` to `5`....Then I watched the logs and timed it to ensure it doesn't take 2 minutes.

I did this by running the server:
```sh
 node scripts/functional_tests_server.js --config test/detection_engine_api_integration/security_and_spaces/config.ts
```

And then running the client:
```sh
 node scripts/functional_test_runner.js --config test/detection_engine_api_integration/security_and_spaces/config.ts --include test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

If we start to see flake again on the regular build servers we might have to slightly increase this number again. If this doesn't allow failing servers to fully complete we might need to also decrease this number further or we might need to make this more configurable.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

(cherry picked from commit 6d88579)
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Feb 11, 2022
…es one log.error to log.debug to avoid spamming cloud servers (elastic#125432)

## Summary

Fixes elastic#125319

Reduces timeouts from 10 minutes per test down to 2 minutes and changes log.error to log.debug to avoid spamming

I test this by making a failing test here:
```
x-pack/test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

By increasing `await waitForSignalsToBePresent(supertest, log, 4, [id]);` from `4` to `5`....Then I watched the logs and timed it to ensure it doesn't take 2 minutes.

I did this by running the server:
```sh
 node scripts/functional_tests_server.js --config test/detection_engine_api_integration/security_and_spaces/config.ts
```

And then running the client:
```sh
 node scripts/functional_test_runner.js --config test/detection_engine_api_integration/security_and_spaces/config.ts --include test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

If we start to see flake again on the regular build servers we might have to slightly increase this number again. If this doesn't allow failing servers to fully complete we might need to also decrease this number further or we might need to make this more configurable.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

(cherry picked from commit 6d88579)
kibanamachine added a commit that referenced this issue Feb 11, 2022
…es one log.error to log.debug to avoid spamming cloud servers (#125432) (#125441)

## Summary

Fixes #125319

Reduces timeouts from 10 minutes per test down to 2 minutes and changes log.error to log.debug to avoid spamming

I test this by making a failing test here:
```
x-pack/test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

By increasing `await waitForSignalsToBePresent(supertest, log, 4, [id]);` from `4` to `5`....Then I watched the logs and timed it to ensure it doesn't take 2 minutes.

I did this by running the server:
```sh
 node scripts/functional_tests_server.js --config test/detection_engine_api_integration/security_and_spaces/config.ts
```

And then running the client:
```sh
 node scripts/functional_test_runner.js --config test/detection_engine_api_integration/security_and_spaces/config.ts --include test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

If we start to see flake again on the regular build servers we might have to slightly increase this number again. If this doesn't allow failing servers to fully complete we might need to also decrease this number further or we might need to make this more configurable.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

(cherry picked from commit 6d88579)

Co-authored-by: Frank Hassanabad <frank.hassanabad@elastic.co>
kibanamachine added a commit that referenced this issue Feb 11, 2022
…es one log.error to log.debug to avoid spamming cloud servers (#125432) (#125442)

## Summary

Fixes #125319

Reduces timeouts from 10 minutes per test down to 2 minutes and changes log.error to log.debug to avoid spamming

I test this by making a failing test here:
```
x-pack/test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

By increasing `await waitForSignalsToBePresent(supertest, log, 4, [id]);` from `4` to `5`....Then I watched the logs and timed it to ensure it doesn't take 2 minutes.

I did this by running the server:
```sh
 node scripts/functional_tests_server.js --config test/detection_engine_api_integration/security_and_spaces/config.ts
```

And then running the client:
```sh
 node scripts/functional_test_runner.js --config test/detection_engine_api_integration/security_and_spaces/config.ts --include test/detection_engine_api_integration/security_and_spaces/tests/aliases.ts
```

If we start to see flake again on the regular build servers we might have to slightly increase this number again. If this doesn't allow failing servers to fully complete we might need to also decrease this number further or we might need to make this more configurable.

### Checklist

Delete any items that are not applicable to this PR.

- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

(cherry picked from commit 6d88579)

Co-authored-by: Frank Hassanabad <frank.hassanabad@elastic.co>
FrankHassanabad added a commit that referenced this issue Mar 15, 2022
…126294)

## Summary

Increases the timeouts from 2 minutes to now 5 minutes and unskips the detection tests. If any of the tests fail consistently then I will skip just those individual tests instead of the whole suit.

See #125851 and elastic/elasticsearch#84256

This could cause issues with:
#125319

If so, then we will have to deal with the cloud based tests in a different way but in reality we need the extra time as some test cases do take a while to run on CI.

This also:
* Removes skips around code that have been in there for a while and adds documentation to the parts that are left over.

### Checklist
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
maksimkovalev pushed a commit to maksimkovalev/kibana that referenced this issue Mar 18, 2022
…lastic#126294)

## Summary

Increases the timeouts from 2 minutes to now 5 minutes and unskips the detection tests. If any of the tests fail consistently then I will skip just those individual tests instead of the whole suit.

See elastic#125851 and elastic/elasticsearch#84256

This could cause issues with:
elastic#125319

If so, then we will have to deal with the cloud based tests in a different way but in reality we need the extra time as some test cases do take a while to run on CI.

This also:
* Removes skips around code that have been in there for a while and adds documentation to the parts that are left over.

### Checklist
- [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Fixes for quality problems that affect the customer experience Team:Detection Rule Management Security Detection Rule Management Team
Projects
None yet
2 participants