Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lldb] Improved lldb-server stability for remote launching #100659

Closed
wants to merge 1 commit into from

Conversation

slydiman
Copy link
Contributor

We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process lldb-server platform, but launched by the another process lldb-server gdbserver. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.

@llvmbot
Copy link
Member

llvmbot commented Jul 25, 2024

@llvm/pr-subscribers-lldb

Author: Dmitry Vasilyev (slydiman)

Changes

We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process lldb-server platform, but launched by the another process lldb-server gdbserver. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.


Full diff: https://github.com/llvm/llvm-project/pull/100659.diff

2 Files Affected:

  • (modified) lldb/source/Host/posix/ProcessLauncherPosixFork.cpp (+4-2)
  • (modified) lldb/source/Host/windows/ProcessLauncherWindows.cpp (+18-2)
diff --git a/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp b/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
index 0a832ebad13a7..637c2846e6bb2 100644
--- a/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
+++ b/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
@@ -201,7 +201,7 @@ struct ForkLaunchInfo {
   execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);
 
 #if defined(__linux__)
-  if (errno == ETXTBSY) {
+  for (int i = 0; i < 50; ++i) {
     // On android M and earlier we can get this error because the adb daemon
     // can hold a write handle on the executable even after it has finished
     // uploading it. This state lasts only a short time and happens only when
@@ -210,7 +210,9 @@ struct ForkLaunchInfo {
     // shell" command in the fork() child before it has had a chance to exec.)
     // Since this state should clear up quickly, wait a while and then give it
     // one more go.
-    usleep(50000);
+    if (errno != ETXTBSY)
+      break;
+    usleep(100000);
     execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);
   }
 #endif
diff --git a/lldb/source/Host/windows/ProcessLauncherWindows.cpp b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
index baa422c15cae2..143f20cb3f510 100644
--- a/lldb/source/Host/windows/ProcessLauncherWindows.cpp
+++ b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
@@ -113,14 +113,30 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
   // command line is not empty, its contents may be modified by CreateProcessW.
   WCHAR *pwcommandLine = wcommandLine.empty() ? nullptr : &wcommandLine[0];
 
-  BOOL result = ::CreateProcessW(
+  BOOL result;
+  DWORD last_error = 0;
+  // This is the workaround for the error "The process cannot access the file
+  // because it is being used by another process". Note the executable file is
+  // installed to the target by the process `lldb-server platform`, but launched
+  // by the process `lldb-server gdbserver`. Sometimes system may block the file
+  // for some time after copying.
+  for (int i = 0; i < 50; ++i) {
+    result = ::CreateProcessW(
       wexecutable.c_str(), pwcommandLine, NULL, NULL, TRUE, flags, env_block,
       wworkingDirectory.size() == 0 ? NULL : wworkingDirectory.c_str(),
       &startupinfo, &pi);
+    if (!result) {
+      last_error = ::GetLastError();
+      if (last_error != ERROR_SHARING_VIOLATION)
+        break;
+      ::Sleep(100);
+    } else
+      break;
+  }
 
   if (!result) {
     // Call GetLastError before we make any other system calls.
-    error.SetError(::GetLastError(), eErrorTypeWin32);
+    error.SetError(last_error, eErrorTypeWin32);
     // Note that error 50 ("The request is not supported") will occur if you
     // try debug a 64-bit inferior from a 32-bit LLDB.
   }

Copy link

github-actions bot commented Jul 25, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@@ -201,7 +201,7 @@ struct ForkLaunchInfo {
execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);

#if defined(__linux__)
if (errno == ETXTBSY) {
for (int i = 0; i < 50; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be a good idea to make the timeout configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is redundant. Currently lldb-server contains a hardcoded timeout in many places. Note the total timeout 10 seconds will cause the error "Sending vRun packet failed". So 5 seconds is good enough for the stable connection and do the trick on slow machines.

We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process `lldb-server platform`, but launched by the another process `lldb-server gdbserver`. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.
@slydiman slydiman force-pushed the lldb-server-ETXTBSY branch from 6e24f75 to 21fd03f Compare July 25, 2024 22:46
slydiman added a commit to slydiman/llvm-project that referenced this pull request Jul 25, 2024
…t mapping

Removed fork(). Used threads and the common thread-safe port map for all platform connections.

Updated lldb::FileSystem to use llvm::vfs::createPhysicalFileSystem() with an own virtual working directory per thread.

This patch depends on llvm#100659, llvm#100666.

This patch fixes llvm#97537, llvm#90923, llvm#56346.

lldb-server has been tested on Windows with 50 connections and 100 processes launched simultaneously. Tested also the cross build with Linux x86_64 host and Linux Aarch64 target.
Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround specifically mentions Android (M). The problem there was that the ADB daemon (part of the OS, which we use for faster file uploads) was implemented (had a bug) such that the file could remain open for a short while (between fork() and exec()) in another process when the daemon is accepting a new connection.

We fixed that issue, but we since the this comes with the device, we still had to work around the problem in old devices. We're finally reaching the point when we could remove the workaround (android <=M is less that 3% of active devices), so I wouldn't want to extend it without understanding the problem further.

As I understand it, you're using the regular lldb-server platform process to implement file uploads, which I think means the problem should be fully within our control. We basically need to make sure that:

  • lldb waits for the vFile:close response before sending the vRun packet
  • lldb-server actually closes the file handle (all of them -- i.e., makes sure it does not leak it) before sending the vFile:close response

If that's not what's happening right now, then we ought to fix it.

@labath
Copy link
Collaborator

labath commented Jul 26, 2024

If that's not what's happening right now, then we ought to fix it.

Or at least, understand why is that impossible.

@slydiman
Copy link
Contributor Author

slydiman commented Jul 26, 2024

@labath Of course lldb waits for the vFile:close response before sending the vRun packet and lldb-server actually closes the file handle (all of them). No any leaks. Otherwise this workaround wouldn't work.

The behavior is the same on Linux and Windows targets. I launched 100 connections and 200 processes simultaneously on Windows (lldb-server gdbserver + a test app). I got 3..10 fails because of the error ERROR_SHARING_VIOLATION. After this patch I got 0..3 fails for 100 connections and 0 fails for 50 connections. After closing the copied file probably the system may cache it some time per process. The file may be blocked by the built-in antivirus for some time. It is hard to figure out the exact reason.

We have a buildbot to run cross API tests in 8 threads with Linux Aarch64 target. All tests are green with the current (single thread) lldb-server. But we got randomly failed 50..60 tests (of 1190) with the multithreading version of lldb-server. Probably the multithreading version is just little bit faster and system did not unlock the executable in time. We noticed that usually failed tests use simple and tiny executables. But this fact does not help to explain the reason of the problem. We got 100% green tests after this patch.

@labath
Copy link
Collaborator

labath commented Jul 29, 2024

That may be how things works on windows(*), but I'm pretty sure it's not how things work on linux. A much more likely scenario is:

  1. thread 1 opens a file for writing (file descriptor A) and starts writing it
  2. thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))
  3. thread 1 finishes writing, closes fd A (in its/parent process)
  4. thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

This isn't "operating system keeping the file open longer", this is us doing it (to be fair, the operating system is making it pretty hard to avoid doing that). And while this isn't an absolute thing (the workaround isn't as bad as the select thing, and it may also be possible to implement this in a way that the workaround isn't needed), I think this is a good reason to prefer a multiprocess implementation (where this situation does not occur because the fd cannot leak into another process).

All tests are green with the current (single thread) lldb-server.

Does this refer the the forking implementation you get with the --server flag, or the serial implementation which only handles one connection at a time (without the flag)? Because if it's the former, this is a good proof that the scenario above is the (only) problem here.

(*) I just don't know enough about the system to have an informed opinion. This isn't the first time I've heard about the antivirus hypothesis, but I find that somewhat surprising, as that could mean that something like $CC hello.cc && ./a.out could fail (what we're doing here isn't fundamentally different than that).

@labath
Copy link
Collaborator

labath commented Jul 29, 2024

  • thread 1 opens a file for writing (file descriptor A) and starts writing it

  • thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))

  • thread 1 finishes writing, closes fd A (in its/parent process)

  • thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs execve (due to CLOEXEC).

@slydiman
Copy link
Contributor Author

slydiman commented Jul 29, 2024

Does this refer the the forking implementation you get with the --server flag, or the serial implementation which only handles one connection at a time (without the flag)?

I mean the --server flag. It is very hard to reproduce this issue with the serial implementation because it is much slower.

  • thread 1 opens a file for writing (file descriptor A) and starts writing it
  • thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))
  • thread 1 finishes writing, closes fd A (in its/parent process)
  • thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs execve (due to CLOEXEC).

Your scenario is impossible. How FD will be closed by execve() if execve() failed with ETXTBSY?

  • lldb-server platform (a thread 1 or just a process) creates, writes and closes the file. The FD is closed and may be reused by the system. The client lldb received the response OK for vFile:close request.
  • The same thread 1 launched gdb server (fork+execve). The FD of the created and closed file cannot be copied any way.
  • The client lldb connects to gdb server and send the request vRun
  • gdb server did not create or write the executable. It never had the FD of this file and did not inherit it. gdb server just fork a child process and try to call execve(), but it failed with ETXTBSY.
  • gdb server waits little bit and try to call execve() again. It is successful after several attempts. No one closed the mythical FD during this time.

@slydiman
Copy link
Contributor Author

that could mean that something like $CC hello.cc && ./a.out could fail (what we're doing here isn't fundamentally different than that).

The difference is that cc creates a.out and exits itself. But lldb-server platform is still running after creating the executable. Something must be flushed?

@slydiman
Copy link
Contributor Author

See also golang/go#22315

@labath
Copy link
Collaborator

labath commented Jul 29, 2024

Does this refer the the forking implementation you get with the --server flag, or the serial implementation which only handles one connection at a time (without the flag)?

I mean the --server flag. It is very hard to reproduce this issue with the serial implementation because it is much slower.

Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation --server flag. Is that correct ?

  • thread 1 opens a file for writing (file descriptor A) and starts writing it
  • thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))
  • thread 1 finishes writing, closes fd A (in its/parent process)
  • thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs execve (due to CLOEXEC).

There are two threads and (at least two execve()s) happening here. I'm referring to the one on thread 2 (specifically, the fork child of thread 2). Your description describes what happens on one thread (well, one line of execution, corresponding to one e.g. test). Let me try this again. I'm just going to take your description, copy it twice and interleave it (italic is for one line of execution bold is for the second one, regular text is my commentary):

  1. lldb-server platform (a thread 1 or just a process) creates, writes
  2. note that at this point the file remains open in the lldb-platform process. This is going to be our mythical FD
  3. lldb-server platform (a thread 1 or just a process) creates, writes and closes the file. The FD is closed and may be reused by the system. The client lldb received the response OK for vFile:close request.
  4. The same thread 1 launched gdb server (fork
  5. note the fork creates a new process. The process is going to have a copy of the mythical FD
  6. and closes the file. The FD is closed and may be reused by the system. The client lldb received the response OK for vFile:close request.
  7. note the mythical FD is only closed in the parent process. It still exists in the fork child
  8. The same thread 1 launched gdb server (fork+execve). The FD of the created and closed file cannot be copied any way.
  9. We don't need to make a copy of the fd here. The copy was made earlier.
  10. gdb server did not create or write the executable. It never had the FD of this file and did not inherit it. gdb server just fork a child process and try to call execve(), but it failed with ETXTBSY.
  11. We get ETXTBSY because the FD is still open in the process forked on step 4
  12. +execve). The FD of the created and closed file cannot be copied any way.
  13. The mythical fd is fully closed after the execve() call
  14. gdb server waits little bit and try to call execve() again. It is successful after several attempts. No one closed the mythical FD during this time.
  15. Yes, they did.

@labath
Copy link
Collaborator

labath commented Jul 29, 2024

that could mean that something like $CC hello.cc && ./a.out could fail (what we're doing here isn't fundamentally different than that).

The difference is that cc creates a.out and exits itself. But lldb-server platform is still running after creating the executable. Something must be flushed?

Ok, this wasn't the best analogy. I still stand by my analysis of the problem though.

See also golang/go#22315

That's exactly the problem I'm describing here. And I'm considering something like golang/go#22315 (comment) as the solution (if we really do go through with this). The bug is that the FD gets leaked, and the fix is to make sure it doesn't get leaked. Waiting is a workaround because there's no guarantee that whoever we leak it to will close it.. ever. The only reason the workaround is here is because the bug was in third party code we can't change (everywhere.. we did change it, but only for new androids)

@slydiman
Copy link
Contributor Author

Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation --server flag. Is that correct ?

Right. Initially I have marked #100670 as dependent on this.

Ok, agreed. So, we can try to use O_CLOFORK.
And simple solution is to call execve() as fast as possible and wait some time in case of ETXTBSY (this PR).

@labath
Copy link
Collaborator

labath commented Jul 29, 2024

Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation --server flag. Is that correct ?

Right. Initially I have marked #100670 as dependent on this.

Ok, agreed. So, we can try to use O_CLOFORK. And simple solution is to call execve() as fast as possible and wait some time in case of ETXTBSY (this PR).

Umm.. by "current" I meant the current implementation that's in the llvm repository, so I'm not sure if we're agreeing to anything (yet).

O_CLOFORK doesn't exist (that's the really mythical part). I wish it did though...

I don't think we can call execve appreciably faster than we already do. I'm still not sure if I am ok with the wait workaround, but I think it could wait until we settle some other things first. For one, I'd like to hear your opinion on my port mapping alternative.

@slydiman
Copy link
Contributor Author

I'd like to hear your opinion on my port mapping alternative.

#100670 (comment)

slydiman added a commit to slydiman/llvm-project that referenced this pull request Jul 29, 2024
@slydiman
Copy link
Contributor Author

I have moved this patch to #100670.

@slydiman slydiman closed this Jul 29, 2024
slydiman added a commit to slydiman/llvm-project that referenced this pull request Aug 1, 2024
…t mapping

Removed fork(). Used threads and the common thread-safe port map for all platform connections.

Updated lldb::FileSystem to use llvm::vfs::createPhysicalFileSystem() with an own virtual working directory per thread.

This patch depends on llvm#100659, llvm#100666.

This patch fixes llvm#97537, llvm#90923, llvm#56346.

lldb-server has been tested on Windows with 50 connections and 100 processes launched simultaneously. Tested also the cross build with Linux x86_64 host and Linux Aarch64 target.
slydiman added a commit to slydiman/llvm-project that referenced this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants