Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Cmd IO handling #1937

Merged
merged 1 commit into from
Feb 20, 2024
Merged

Update Cmd IO handling #1937

merged 1 commit into from
Feb 20, 2024

Conversation

helsaawy
Copy link
Contributor

@helsaawy helsaawy commented Oct 17, 2023

Have internal\cmd.Cmd ignore relay and close errors from the underlying IO channel being closed, since not all
io.Reader/io.Writers return an io.EOF if the are closed during an IO operation.

This standardizes behavior between an hcs/gcs Process (which use a go-winio.win32File for their IO channels, and return io.EOF for read and write operations on a closed handle) and JobProcess (which uses an os.Pipe that instead return an os.ErrClosed).

Additionally, ignore errors from closing an already-closed std IO channel.

Update Cmd.Wait to return a known error value if it times out waiting on IO copy after the command exits (and update TestCmdStuckIo to check for that error).
Prior, the test checked for an io.ErrClosedPipe, which:

  1. is not the best indicator that IO is stuck; and
  2. is now ignored as an error value raised durin IO relay.

@helsaawy helsaawy requested a review from a team as a code owner October 17, 2023 22:29
internal/cmd/cmd.go Outdated Show resolved Hide resolved
@kevpar
Copy link
Member

kevpar commented Oct 19, 2023

This standardizes behavior between an hcs/gcs Process (which use a go-winio.win32File for their IO channels, and return io.EOF for read and write operations on a closed handle) and JobProcess (which uses an io.Pipe that instead return an io.ErrClosedPipe).

I'm having trouble verifying what you're fixing here. Can you give more detail on the bug you're addressing? From what I can see, job containers use an os.Pipe, not an io.Pipe.

@helsaawy
Copy link
Contributor Author

I'm having trouble verifying what you're fixing here. Can you give more detail on the bug you're addressing? From what I can see, job containers use an os.Pipe, not an io.Pipe.

thats a typo on my side, sorry (stuckIOProcess in cmd_test.go uses io.Pipe, which returns io.ErrClosedPipe).
os.Pipe returns os.ErrClosed (the underlying file descriptor returns poll.ErrFileClosing when reading from a pipe, which is then converted to os.ErrClosed; this is valided by them checking for ErrClosed in the pipe tests, and not io.EOF).

Ultimately, that means waiting on a Cmd created in a JobContainer can fail with read |0: file already closed, which is not in line with behavior of other ProcessHosts.

// Terminate the IO if the copy does not complete in the requested time.
timeoutErrCh := make(chan error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a separate channel for timeout error, can we just have a separate error variable that will be set only in the <-t.C: block? That way we don't need to worry about closing the channel and the following code block:

if tErr := <-timeoutErrCh; ioErr == nil {
    ioErr = tErr
}

can just be

if ioErr == nil {
    ioErr = tErr
}

This approach seems easier to read/understand for me at least. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to block on timeoutErrCh regardless, so we need some mechanism to have Wait actually wait for the timeout

the alternatives (imo) would be to add a function that selects on <-allDoneCh and ctx.Done(), and pass it a context with the CopyAfterExitTimeout or something similar, which i would rather save for when we refactor Cmd wholesale to properly thread contexts everywhere

internal/cmd/io.go Outdated Show resolved Hide resolved
Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on some testing I'm a little confused about the error handling behavior you're fixing.

I created a test that opens a connected Reader/Writer, then closes the Writer, and attempts a read from the Reader. I tried this in cases where they were backed by os.Pipe, and by a named pipe (created via winio). In both cases, the read returned io.EOF.

Do you recall some more context on what you were seeing?

Copy link
Contributor

@ambarve ambarve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment (about using variables instead of channels) that might help simplify the (c *Cmd) Start() function. But don't want to hold the PR for that. LGTM otherwise.

@helsaawy
Copy link
Contributor Author

Based on some testing I'm a little confused about the error handling behavior you're fixing.

I created a test that opens a connected Reader/Writer, then closes the Writer, and attempts a read from the Reader. I tried this in cases where they were backed by os.Pipe, and by a named pipe (created via winio). In both cases, the read returned io.EOF.

Do you recall some more context on what you were seeing?

If you run the host process IO tests from #1964 (specifically TestContainerIO/WCOW_HostProcess and TestContainerExec/WCOW_HostProcess) with and without these changes, they fail on cmd.Wait, which returns
read |0: file already closed
This which ultimately comes from relayIO failing:
level=error msg="Cmd IO relay complete" bytes=10 error="read |0: file already closed" file=stdout

(test src)

@kevpar
Copy link
Member

kevpar commented Feb 7, 2024

If you run the host process IO tests from #1964 (specifically TestContainerIO/WCOW_HostProcess and TestContainerExec/WCOW_HostProcess) with and without these changes, they fail on cmd.Wait, which returns read |0: file already closed This which ultimately comes from relayIO failing: level=error msg="Cmd IO relay complete" bytes=10 error="read |0: file already closed" file=stdout

(test src)

I dug into this a bit. I learned that file already closed is not returned when the other end of the pipe has closed, but rather returned if the reader end is closed, and then we try to read from it. This indicated to me there was some underlying logic error, because why would we ever want to close something and then read from it?

After investigating, it seems you've actually found a bug in how job containers handle IO pipes :D So these test cases like TestContainerIO are great to get in place! However, the change here unfortunately would have hidden the symptoms of the bug, but not fixed the underlying issue.

The real problem here is that job containers are not following the contract expected by (*Cmd).Wait. What's supposed to happen is:

  1. We wait on the Process until it is done.
  2. We wait until either the IO relaying has completed, or we hit the IO timeout. This is to ensure any remaining data in the IO pipes is allowed to drain.
  3. We close our ends of the IO pipes.

In the case of job containers (or anything using internal/exec, really), when the process wait completes, the parent end of the IO pipes are closed automatically! This is why our pipe closed while we were still trying to read from it.

The proper fix here is to not close the parent end of the pipes when the process exits, and let it be handled by (*Cmd).Wait instead. I'm working on a fix for this. Once that's in, we shouldn't need the isClosedIOErr part of this PR anymore.

@kevpar
Copy link
Member

kevpar commented Feb 8, 2024

Here's the PR for the fix I mentioned: #2021

@kevpar
Copy link
Member

kevpar commented Feb 12, 2024

#2021 has been merged. You should be unblocked now.

Update `Cmd.Wait` to return a known error value if it times out waiting
on IO copy after the command exits (and update `TestCmdStuckIo` to check
for that error).
Prior, the test checked for an `io.ErrClosedPipe`, which:
1. is not the best indicator that IO is stuck; and
2. is now ignored as an error value raised during IO relay.

Update `stuckIOProcess` logic in `cmd_test.go` to mirror logic in
`interal/exec.Exec`, using `os.Pipe` for std io that returns an `io.EOF`
(instead of `io.Pipe`, which does not).

Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@helsaawy helsaawy merged commit 7458e58 into microsoft:main Feb 20, 2024
18 of 19 checks passed
@helsaawy helsaawy deleted the cmd-io-err branch February 20, 2024 18:56
princepereira pushed a commit to princepereira/hcsshim that referenced this pull request Aug 29, 2024
Update `Cmd.Wait` to return a known error value if it times out waiting
on IO copy after the command exits (and update `TestCmdStuckIo` to check
for that error).
Prior, the test checked for an `io.ErrClosedPipe`, which:
1. is not the best indicator that IO is stuck; and
2. is now ignored as an error value raised during IO relay.

Update `stuckIOProcess` logic in `cmd_test.go` to mirror logic in
`interal/exec.Exec`, using `os.Pipe` for std io that returns an `io.EOF`
(instead of `io.Pipe`, which does not).

Signed-off-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants