Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch integration tests to provision stacks in the CFT region in production #3459

Closed
cmacknz opened this issue Sep 21, 2023 · 8 comments · Fixed by #3701
Closed

Switch integration tests to provision stacks in the CFT region in production #3459

cmacknz opened this issue Sep 21, 2023 · 8 comments · Fixed by #3701
Assignees
Labels
Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Sep 21, 2023

We have recently had some stability challenges with the QA GCP region and this is because it isn't granted the same reliability guarantees as the staging or production regions.

Luckily there is a dedicated cloud first testing region for internal use with the same stability guarantees as production in gcp-us-west2 Los Angeles.

Let's switch our integration tests to use that region. We could also decide to use staging cloud, which will give us more cloud providers and regions to test against but with slightly less stability as it will still have pre-production cloud code. Generally the CFT region is the best place for internal testing like we are doing with our integration tests.

@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Sep 21, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator
Copy link
Contributor

++ to using the more-stable CFT region in production. One of the reasons for using the QA environment was that deployments in that environment are auto-terminated after 24 hours. We will need to build some automation for handling that when we switch over to the CFT region in production.

@blakerouse
Copy link
Contributor

++ QA cloud stability is definitely a constant issue we are hitting running the tests. Does the region provide access to snapshot builds?

@ycombinator
Copy link
Contributor

Does the region provide access to snapshot builds?

Yes, it does.

Screenshot 2023-09-22 at 02 16 58

@pchila
Copy link
Member

pchila commented Sep 22, 2023

This is blocked by #3463 and #3456.
Until we have proper tracking of created deployments and the buildkite hook that perform the cleanup in all cases we cannot switch to CFT where the leaked deployments will keep consuming resources until somebody realizes there's a problem

I would also add a daily job that cleans up all integration tests deployments (we need to label them or make them easily discoverable not just by name) older than 24h

@jlind23
Copy link
Contributor

jlind23 commented Oct 3, 2023

@pmoust in CFT region is there any way to tag deployment that needs to be removed after a given time period?

@pchila
Copy link
Member

pchila commented Oct 5, 2023

As a first step to move away from QA I am preparing a quick PR to move to staging (it's not the long term solution but it should help a bit)

@cmacknz
Copy link
Member Author

cmacknz commented Nov 1, 2023

Pulled into the current sprint as we have too many stability problems in the non-prod testing regions. Let's move to the gcp-us-west2 CFT region. We will pair this change with #3463 which should minimize the number of deployments we leak.

Separately we will create a scheduled job to detect orphaned deployments but we will no longer consider this a blocker for this change. The thinking is that most of the deployments we leak happen because they fail to come up, which should be less likely in the CFT region in prod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants