Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry fork/exec errors when running hook #2325

Merged
merged 9 commits into from
Aug 29, 2023

Conversation

triarius
Copy link
Contributor

@triarius triarius commented Aug 28, 2023

A customer reported that their jobs were unable to run a hook because the hook wrapper script was locked.

The hooks should have collision avoidant names, and they should have closed by the agent worker goroutine that created them prior to execution, which is in the same goroutine. So it's a bit of a mystery how the script wrapper could remain open.

We could just add more retry to work around this, but it would like to find the cause. So, I've done two things in this PR:

  • Retry executing the hook if we get a fork/exec error
  • If the retry is exhausted, log the error and try to determine what process has the file open

That latter will only work if:

  • the process that has the file open is long-lived enough for the error handling to catch it with the file open, and
  • the agent has permission to read the open file descriptors of the other processes

So we can't be certain what the other process is, but at least we should be able to prove or disprove that it is the agent itself.

Detecting other processes only works on Linux because it uses the procfs. For other Unixes, we can use /dev/fd to check that the agent process does not have the file open. I'm not sure if this will work if the two goroutines are scheduled on different OS threads, however.

@triarius triarius force-pushed the pdp-1510-retry-and-log-hook-forkexec-errors branch from 99df9c1 to 18ffa43 Compare August 28, 2023 03:26
internal/job/executor.go Outdated Show resolved Hide resolved
internal/job/executor.go Outdated Show resolved Hide resolved
@triarius
Copy link
Contributor Author

@moskyb after chatting with @pda, we decided to feed back errors from failing to close files that are opened in NewScriptWrapper. If the files are kept open by the agent, that's the likely cause.

@triarius triarius requested a review from moskyb August 28, 2023 06:05
@triarius triarius merged commit add7e26 into main Aug 29, 2023
@triarius triarius deleted the pdp-1510-retry-and-log-hook-forkexec-errors branch August 29, 2023 01:32
DrJosh9000 added a commit that referenced this pull request Apr 22, 2024
Since #2325, we learned that the "text file busy" error reported by the customer was probably golang/go#22315.

This change makes the retry and log pathway more specific to that error, and retries more times more frequently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants