fix: Add retry on locks #4997

Fabianoshz · 2024-10-11T00:42:15Z

what

I'm opening this as a draft to receive feedback early, I don't expect this to break anything but I believe it could be hidden behind a flag and with better default values for timeout and retries (maybe exponential retry?).

This adds a retry logic to the lock mechanism to mitigate the issue described in here and also in this ADR.

The locking issue itself is more complex and requires much more work, this is just a small step so users don't have to see the error anymore effectively making the code wait instead of asking the user to retry.

why

Currently the user has to rerun any operations that fail because a certain workspace path is locked, this tries just to automate the process.

tests

Will add if this approach receive support.

references

Did my best to try to understand which issues this would affect.

Relates to #3345
Relates to #2921
Relates to #2882

Relates to #4489
Relates to #305
Relates to #4829
Relates to #1847
Relates to #4566

Closes #1618
Closes #2200
Closes #3785
Closes #4489
Closes #4368

Fabianoshz · 2024-10-12T11:30:12Z

After some testing, this is what I'm seeing if I run 2 plan at the same time.

This doesn't really fix the issue, but at least reduces the amount of times users have to deal with the error messages.

X-Guardian · 2024-11-02T12:24:36Z

I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two plan commands consecutively on the same commit, which is not what most people have issues with.

jamengual · 2024-11-02T20:06:29Z

I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two plan commands consecutively on the same commit, which is not what most people have issues with.

There are a lot of people using terragrunt, atmos and other tools that can run many projects/plans at once, so I see how this help those users.

X-Guardian · 2024-11-02T20:13:29Z

I use Terragrunt extensively with Atlantis running parallel plans/applies in a PR, and don't have any locking issues that this change would make any difference to.

Fabianoshz · 2024-11-04T14:45:10Z

I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two plan commands consecutively on the same commit, which is not what most people have issues with.

I truly believe we still have much to do to make sure that the locking issue is dealt with. Although I'm not so sure that this code only is applicable for simultaneous commands, I've only used it to showcase the issue since is the same mechanism (TryLock). Maybe I'm overly optimistic about this...

Buy I may argue that the UX will be even better for most cases. The current behavior is:

John tries to plan common-dir and dir-1
While John still has the locks for common-dir and dir-1 Mary tries to plan common-dir and dir-2
Mary receives The default workspace is currently locked by another command that is running for this pull request.

This PR will change this to:

John tries to plan common-dir and dir-1
While John has the locks for common-dir and dir-1 Mary tries to plan common-dir and dir-2
When John finishes using the lock, Mary acquires the lock (given it didn't reached the timeout)

As I stated here most users are suffering from seeing the message that just says "try again later", I'm automating this step so users only see this message if we are more or less sure that there is a real issue with the plan taking too long (which can be configured by the timeout setting).

Fabianoshz · 2024-11-04T14:52:58Z

I've used to work in a company that had a pretty big Atlantis install, unfortunately I don't anymore so I can't really test this in large scale, I invite anyone who can test this PR to give it a try.

Two areas of improvement I already can see:

We could evaluate if a user is trying to trigger the same command while another one is already being executed, this would allow us to avoid this scenario while proving a better UX, as I believe it's not good to let duplicated commands reach this locking mechanism.
We could queue executions instead of just blindly retrying it as I'm doing it here now if we want to have FIFO, queueing is a bit harder though because we might have issues if the queue gets to big because external issues.

lukemassa · 2024-12-18T00:21:57Z

I've seen this once or twice (for example if you're impatient and run atlantis plan when it was already auto-planning); I'm not against the idea of having essentially automating the "try again later" step, I feel like spinning a full 15m by default with no way to disable seems aggressive? Maybe the default behavior should be disabled? I feel like the possibility of starving threads/livelock/etc is high with so many people's different use cases. Maybe I'm being paranoid

Fabianoshz · 2024-12-18T14:09:42Z

I feel like spinning a full 15m by default with no way to disable seems aggressive? Maybe the default behavior should be disabled? I feel like the possibility of starving threads/livelock/etc is high with so many people's different use cases. Maybe I'm being paranoid

I believe a lot of other things would break before this, AFAIK tickers are pretty lightweight so I don't think they would consume a lot of resources and if the system is too slow the tickers might fail and we might skip a few checks but that's not a big issue for our logic.

But to make this more clear and resilient we could add another return value in this line, this value could represent the status of the lock acquire process (acquired, pull running and workspace running). This would leave the decision to the caller, let's say the caller is the manual plan requester, if the return is pull running we could return a more appropriated message, something like: "There is another plan request being executed for this PR, wait until it is finished", or if the return is worspace running: "There is another plan request using this workspace, your plan execution is queued" .

This would mean that each caller would need to implement either a queue system or the retry mechanism like the one I've added in this PR. We could also inject an interface with functions to send the messages, I'm fine with either, I'm also open to other suggestions.

lukemassa · 2024-12-31T20:34:37Z

I don't know; I'm still worried about the complexity. 15m is a long time after I type "atlantis apply" for something to happen.

For example in your scenario, when Mary goes to run, if she immediately gets an error saying that another PR is locked, she could look and see it's John, then reach out to John, who might have completely lost track of the PR and say "yeah go ahead and close that" at which point Mary can apply her PR. This happens a lot in my workflows, and so adding an additional 15m delay before Mary knows even what the problem is would actually slow this process down for me. My preference would be for this to be disabled by default (timeout = 0) and configurable.

As for having the TryLock return a more specific message I think that could be useful, it would especially help with the "it's spinning and I don't know why" problem from above. As you said though that affects the callers as well, so it might be worth doing as a separate follow up.

Fabianoshz mentioned this pull request Oct 11, 2024

docs(adr): Project Locks 0002 #3345

Open

Fabianoshz force-pushed the add-rety-on-locks branch 2 times, most recently from 284e6f9 to d313d5a Compare October 11, 2024 12:14

Fabianoshz changed the title ~~Add retry on locks~~ fix: Add retry on locks Oct 11, 2024

Fabianoshz force-pushed the add-rety-on-locks branch 2 times, most recently from e6d6135 to 6fdaa09 Compare October 12, 2024 11:23

fix: Add retry on locks

f554a9b

Fabianoshz force-pushed the add-rety-on-locks branch from 6fdaa09 to f554a9b Compare October 12, 2024 11:23

Merge branch 'main' into add-rety-on-locks

cea9455

Fabianoshz marked this pull request as ready for review October 12, 2024 11:31

Fabianoshz requested review from a team as code owners October 12, 2024 11:31

Fabianoshz requested review from GenPage, nitrocode and X-Guardian and removed request for a team October 12, 2024 11:31

github-actions bot added the go Pull requests that update Go code label Oct 12, 2024

Merge branch 'main' into add-rety-on-locks

d8daea6

Merge branch 'main' into add-rety-on-locks

13976d6

Fabianoshz and others added 2 commits December 21, 2024 08:41

Merge branch 'main' into add-rety-on-locks

f25a2db

Merge branch 'main' into add-rety-on-locks

75eeb17

jamengual added the waiting-on-review Waiting for a review from a maintainer label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add retry on locks #4997

fix: Add retry on locks #4997

Fabianoshz commented Oct 11, 2024

Fabianoshz commented Oct 12, 2024

X-Guardian commented Nov 2, 2024

jamengual commented Nov 2, 2024

X-Guardian commented Nov 2, 2024

Fabianoshz commented Nov 4, 2024 •

edited

Loading

Fabianoshz commented Nov 4, 2024

lukemassa commented Dec 18, 2024

Fabianoshz commented Dec 18, 2024 •

edited

Loading

lukemassa commented Dec 31, 2024

fix: Add retry on locks #4997

Are you sure you want to change the base?

fix: Add retry on locks #4997

Conversation

Fabianoshz commented Oct 11, 2024

what

why

tests

references

Fabianoshz commented Oct 12, 2024

X-Guardian commented Nov 2, 2024

jamengual commented Nov 2, 2024

X-Guardian commented Nov 2, 2024

Fabianoshz commented Nov 4, 2024 • edited Loading

Fabianoshz commented Nov 4, 2024

lukemassa commented Dec 18, 2024

Fabianoshz commented Dec 18, 2024 • edited Loading

lukemassa commented Dec 31, 2024

Fabianoshz commented Nov 4, 2024 •

edited

Loading

Fabianoshz commented Dec 18, 2024 •

edited

Loading