[lldb] Improved lldb-server stability for remote launching #100659

slydiman · 2024-07-25T21:53:38Z

We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process lldb-server platform, but launched by the another process lldb-server gdbserver. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.

llvmbot · 2024-07-25T21:54:09Z

@llvm/pr-subscribers-lldb

Author: Dmitry Vasilyev (slydiman)

Changes

We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process lldb-server platform, but launched by the another process lldb-server gdbserver. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.

Full diff: https://github.com/llvm/llvm-project/pull/100659.diff

2 Files Affected:

(modified) lldb/source/Host/posix/ProcessLauncherPosixFork.cpp (+4-2)
(modified) lldb/source/Host/windows/ProcessLauncherWindows.cpp (+18-2)

diff --git a/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp b/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
index 0a832ebad13a7..637c2846e6bb2 100644
--- a/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
+++ b/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
@@ -201,7 +201,7 @@ struct ForkLaunchInfo {
   execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);
 
 #if defined(__linux__)
-  if (errno == ETXTBSY) {
+  for (int i = 0; i < 50; ++i) {
     // On android M and earlier we can get this error because the adb daemon
     // can hold a write handle on the executable even after it has finished
     // uploading it. This state lasts only a short time and happens only when
@@ -210,7 +210,9 @@ struct ForkLaunchInfo {
     // shell" command in the fork() child before it has had a chance to exec.)
     // Since this state should clear up quickly, wait a while and then give it
     // one more go.
-    usleep(50000);
+    if (errno != ETXTBSY)
+      break;
+    usleep(100000);
     execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);
   }
 #endif
diff --git a/lldb/source/Host/windows/ProcessLauncherWindows.cpp b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
index baa422c15cae2..143f20cb3f510 100644
--- a/lldb/source/Host/windows/ProcessLauncherWindows.cpp
+++ b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
@@ -113,14 +113,30 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
   // command line is not empty, its contents may be modified by CreateProcessW.
   WCHAR *pwcommandLine = wcommandLine.empty() ? nullptr : &wcommandLine[0];
 
-  BOOL result = ::CreateProcessW(
+  BOOL result;
+  DWORD last_error = 0;
+  // This is the workaround for the error "The process cannot access the file
+  // because it is being used by another process". Note the executable file is
+  // installed to the target by the process `lldb-server platform`, but launched
+  // by the process `lldb-server gdbserver`. Sometimes system may block the file
+  // for some time after copying.
+  for (int i = 0; i < 50; ++i) {
+    result = ::CreateProcessW(
       wexecutable.c_str(), pwcommandLine, NULL, NULL, TRUE, flags, env_block,
       wworkingDirectory.size() == 0 ? NULL : wworkingDirectory.c_str(),
       &startupinfo, &pi);
+    if (!result) {
+      last_error = ::GetLastError();
+      if (last_error != ERROR_SHARING_VIOLATION)
+        break;
+      ::Sleep(100);
+    } else
+      break;
+  }
 
   if (!result) {
     // Call GetLastError before we make any other system calls.
-    error.SetError(::GetLastError(), eErrorTypeWin32);
+    error.SetError(last_error, eErrorTypeWin32);
     // Note that error 50 ("The request is not supported") will occur if you
     // try debug a 64-bit inferior from a 32-bit LLDB.
   }

github-actions · 2024-07-25T21:57:17Z

✅ With the latest revision this PR passed the C/C++ code formatter.

bulbazord · 2024-07-25T21:57:08Z

lldb/source/Host/posix/ProcessLauncherPosixFork.cpp

@@ -201,7 +201,7 @@ struct ForkLaunchInfo {
  execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);

 #if defined(__linux__)
-  if (errno == ETXTBSY) {
+  for (int i = 0; i < 50; ++i) {


Maybe it would be a good idea to make the timeout configurable?

I think it is redundant. Currently lldb-server contains a hardcoded timeout in many places. Note the total timeout 10 seconds will cause the error "Sending vRun packet failed". So 5 seconds is good enough for the stable connection and do the trick on slow machines.

We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process `lldb-server platform`, but launched by the another process `lldb-server gdbserver`. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.

…t mapping Removed fork(). Used threads and the common thread-safe port map for all platform connections. Updated lldb::FileSystem to use llvm::vfs::createPhysicalFileSystem() with an own virtual working directory per thread. This patch depends on llvm#100659, llvm#100666. This patch fixes llvm#97537, llvm#90923, llvm#56346. lldb-server has been tested on Windows with 50 connections and 100 processes launched simultaneously. Tested also the cross build with Linux x86_64 host and Linux Aarch64 target.

labath

The workaround specifically mentions Android (M). The problem there was that the ADB daemon (part of the OS, which we use for faster file uploads) was implemented (had a bug) such that the file could remain open for a short while (between fork() and exec()) in another process when the daemon is accepting a new connection.

We fixed that issue, but we since the this comes with the device, we still had to work around the problem in old devices. We're finally reaching the point when we could remove the workaround (android <=M is less that 3% of active devices), so I wouldn't want to extend it without understanding the problem further.

As I understand it, you're using the regular lldb-server platform process to implement file uploads, which I think means the problem should be fully within our control. We basically need to make sure that:

lldb waits for the vFile:close response before sending the vRun packet
lldb-server actually closes the file handle (all of them -- i.e., makes sure it does not leak it) before sending the vFile:close response

If that's not what's happening right now, then we ought to fix it.

labath · 2024-07-26T09:14:47Z

If that's not what's happening right now, then we ought to fix it.

Or at least, understand why is that impossible.

slydiman · 2024-07-26T15:21:18Z

@labath Of course lldb waits for the vFile:close response before sending the vRun packet and lldb-server actually closes the file handle (all of them). No any leaks. Otherwise this workaround wouldn't work.

The behavior is the same on Linux and Windows targets. I launched 100 connections and 200 processes simultaneously on Windows (lldb-server gdbserver + a test app). I got 3..10 fails because of the error ERROR_SHARING_VIOLATION. After this patch I got 0..3 fails for 100 connections and 0 fails for 50 connections. After closing the copied file probably the system may cache it some time per process. The file may be blocked by the built-in antivirus for some time. It is hard to figure out the exact reason.

We have a buildbot to run cross API tests in 8 threads with Linux Aarch64 target. All tests are green with the current (single thread) lldb-server. But we got randomly failed 50..60 tests (of 1190) with the multithreading version of lldb-server. Probably the multithreading version is just little bit faster and system did not unlock the executable in time. We noticed that usually failed tests use simple and tiny executables. But this fact does not help to explain the reason of the problem. We got 100% green tests after this patch.

labath · 2024-07-29T08:14:07Z

That may be how things works on windows(*), but I'm pretty sure it's not how things work on linux. A much more likely scenario is:

thread 1 opens a file for writing (file descriptor A) and starts writing it
thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))
thread 1 finishes writing, closes fd A (in its/parent process)
thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

This isn't "operating system keeping the file open longer", this is us doing it (to be fair, the operating system is making it pretty hard to avoid doing that). And while this isn't an absolute thing (the workaround isn't as bad as the select thing, and it may also be possible to implement this in a way that the workaround isn't needed), I think this is a good reason to prefer a multiprocess implementation (where this situation does not occur because the fd cannot leak into another process).

All tests are green with the current (single thread) lldb-server.

Does this refer the the forking implementation you get with the --server flag, or the serial implementation which only handles one connection at a time (without the flag)? Because if it's the former, this is a good proof that the scenario above is the (only) problem here.

(*) I just don't know enough about the system to have an informed opinion. This isn't the first time I've heard about the antivirus hypothesis, but I find that somewhat surprising, as that could mean that something like $CC hello.cc && ./a.out could fail (what we're doing here isn't fundamentally different than that).

labath · 2024-07-29T08:17:15Z

thread 1 opens a file for writing (file descriptor A) and starts writing it

thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))

thread 1 finishes writing, closes fd A (in its/parent process)

thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs execve (due to CLOEXEC).

slydiman · 2024-07-29T12:13:46Z

Does this refer the the forking implementation you get with the --server flag, or the serial implementation which only handles one connection at a time (without the flag)?

I mean the --server flag. It is very hard to reproduce this issue with the serial implementation because it is much slower.

thread 1 opens a file for writing (file descriptor A) and starts writing it

thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))

thread 1 finishes writing, closes fd A (in its/parent process)

thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs execve (due to CLOEXEC).

Your scenario is impossible. How FD will be closed by execve() if execve() failed with ETXTBSY?

lldb-server platform (a thread 1 or just a process) creates, writes and closes the file. The FD is closed and may be reused by the system. The client lldb received the response OK for vFile:close request.
The same thread 1 launched gdb server (fork+execve). The FD of the created and closed file cannot be copied any way.
The client lldb connects to gdb server and send the request vRun
gdb server did not create or write the executable. It never had the FD of this file and did not inherit it. gdb server just fork a child process and try to call execve(), but it failed with ETXTBSY.
gdb server waits little bit and try to call execve() again. It is successful after several attempts. No one closed the mythical FD during this time.

slydiman · 2024-07-29T12:30:27Z

that could mean that something like $CC hello.cc && ./a.out could fail (what we're doing here isn't fundamentally different than that).

The difference is that cc creates a.out and exits itself. But lldb-server platform is still running after creating the executable. Something must be flushed?

slydiman · 2024-07-29T12:43:55Z

See also golang/go#22315

labath · 2024-07-29T12:45:23Z

Does this refer the the forking implementation you get with the --server flag, or the serial implementation which only handles one connection at a time (without the flag)?

I mean the --server flag. It is very hard to reproduce this issue with the serial implementation because it is much slower.

Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation --server flag. Is that correct ?

thread 1 opens a file for writing (file descriptor A) and starts writing it

thread 2 starts launching a gdb server. It calls fork(), which creates another process with a copy of fd A (the fd has the CLOEXEC flag set, but not CLOFORK (the flag doesn't exist))

thread 1 finishes writing, closes fd A (in its/parent process)

thread 1launches gdb-server, gdb-server tries to launch the file it has just written, gets ETXTBSY because the fd is still open in the child process

In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs execve (due to CLOEXEC).

There are two threads and (at least two execve()s) happening here. I'm referring to the one on thread 2 (specifically, the fork child of thread 2). Your description describes what happens on one thread (well, one line of execution, corresponding to one e.g. test). Let me try this again. I'm just going to take your description, copy it twice and interleave it (italic is for one line of execution bold is for the second one, regular text is my commentary):

lldb-server platform (a thread 1 or just a process) creates, writes
note that at this point the file remains open in the lldb-platform process. This is going to be our mythical FD
lldb-server platform (a thread 1 or just a process) creates, writes and closes the file. The FD is closed and may be reused by the system. The client lldb received the response OK for vFile:close request.
The same thread 1 launched gdb server (fork
note the fork creates a new process. The process is going to have a copy of the mythical FD
and closes the file. The FD is closed and may be reused by the system. The client lldb received the response OK for vFile:close request.
note the mythical FD is only closed in the parent process. It still exists in the fork child
The same thread 1 launched gdb server (fork+execve). The FD of the created and closed file cannot be copied any way.
We don't need to make a copy of the fd here. The copy was made earlier.
gdb server did not create or write the executable. It never had the FD of this file and did not inherit it. gdb server just fork a child process and try to call execve(), but it failed with ETXTBSY.
We get ETXTBSY because the FD is still open in the process forked on step 4
+execve). The FD of the created and closed file cannot be copied any way.
The mythical fd is fully closed after the execve() call
gdb server waits little bit and try to call execve() again. It is successful after several attempts. No one closed the mythical FD during this time.
Yes, they did.

labath · 2024-07-29T12:54:02Z

that could mean that something like $CC hello.cc && ./a.out could fail (what we're doing here isn't fundamentally different than that).

The difference is that cc creates a.out and exits itself. But lldb-server platform is still running after creating the executable. Something must be flushed?

Ok, this wasn't the best analogy. I still stand by my analysis of the problem though.

See also golang/go#22315

That's exactly the problem I'm describing here. And I'm considering something like golang/go#22315 (comment) as the solution (if we really do go through with this). The bug is that the FD gets leaked, and the fix is to make sure it doesn't get leaked. Waiting is a workaround because there's no guarantee that whoever we leak it to will close it.. ever. The only reason the workaround is here is because the bug was in third party code we can't change (everywhere.. we did change it, but only for new androids)

slydiman · 2024-07-29T13:04:03Z

Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation --server flag. Is that correct ?

Right. Initially I have marked #100670 as dependent on this.

Ok, agreed. So, we can try to use O_CLOFORK.
And simple solution is to call execve() as fast as possible and wait some time in case of ETXTBSY (this PR).

labath · 2024-07-29T13:23:25Z

Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation --server flag. Is that correct ?

Right. Initially I have marked #100670 as dependent on this.

Ok, agreed. So, we can try to use O_CLOFORK. And simple solution is to call execve() as fast as possible and wait some time in case of ETXTBSY (this PR).

Umm.. by "current" I meant the current implementation that's in the llvm repository, so I'm not sure if we're agreeing to anything (yet).

O_CLOFORK doesn't exist (that's the really mythical part). I wish it did though...

I don't think we can call execve appreciably faster than we already do. I'm still not sure if I am ok with the wait workaround, but I think it could wait until we settle some other things first. For one, I'd like to hear your opinion on my port mapping alternative.

slydiman · 2024-07-29T13:55:28Z

I'd like to hear your opinion on my port mapping alternative.

#100670 (comment)

slydiman · 2024-07-29T16:59:55Z

I have moved this patch to #100670.

…t mapping Removed fork(). Used threads and the common thread-safe port map for all platform connections. Updated lldb::FileSystem to use llvm::vfs::createPhysicalFileSystem() with an own virtual working directory per thread. This patch depends on llvm#100659, llvm#100666. This patch fixes llvm#97537, llvm#90923, llvm#56346. lldb-server has been tested on Windows with 50 connections and 100 processes launched simultaneously. Tested also the cross build with Linux x86_64 host and Linux Aarch64 target.

slydiman requested review from DavidSpickett, walter-erquinigo and labath July 25, 2024 21:53

slydiman requested a review from JDevlieghere as a code owner July 25, 2024 21:53

llvmbot added the lldb label Jul 25, 2024

bulbazord reviewed Jul 25, 2024

View reviewed changes

slydiman force-pushed the lldb-server-ETXTBSY branch from 6e24f75 to 21fd03f Compare July 25, 2024 22:46

slydiman mentioned this pull request Jul 25, 2024

[lldb] Multithreading lldb-server works on Windows now; fixed gdb port mapping #100670

Closed

labath requested changes Jul 26, 2024

View reviewed changes

slydiman added a commit to slydiman/llvm-project that referenced this pull request Jul 29, 2024

Merged with llvm#100659

e69a04e

slydiman closed this Jul 29, 2024

slydiman added a commit to slydiman/llvm-project that referenced this pull request Aug 1, 2024

Merged with llvm#100659

17916b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lldb] Improved lldb-server stability for remote launching #100659

[lldb] Improved lldb-server stability for remote launching #100659

slydiman commented Jul 25, 2024

llvmbot commented Jul 25, 2024

github-actions bot commented Jul 25, 2024 •

edited

Loading

bulbazord Jul 25, 2024

slydiman Jul 25, 2024

labath left a comment

labath commented Jul 26, 2024

slydiman commented Jul 26, 2024 •

edited

Loading

labath commented Jul 29, 2024

labath commented Jul 29, 2024

slydiman commented Jul 29, 2024 •

edited

Loading

slydiman commented Jul 29, 2024

slydiman commented Jul 29, 2024

labath commented Jul 29, 2024

labath commented Jul 29, 2024

slydiman commented Jul 29, 2024

labath commented Jul 29, 2024

slydiman commented Jul 29, 2024

slydiman commented Jul 29, 2024

[lldb] Improved lldb-server stability for remote launching #100659

[lldb] Improved lldb-server stability for remote launching #100659

Conversation

slydiman commented Jul 25, 2024

llvmbot commented Jul 25, 2024

github-actions bot commented Jul 25, 2024 • edited Loading

bulbazord Jul 25, 2024

Choose a reason for hiding this comment

slydiman Jul 25, 2024

Choose a reason for hiding this comment

labath left a comment

Choose a reason for hiding this comment

labath commented Jul 26, 2024

slydiman commented Jul 26, 2024 • edited Loading

labath commented Jul 29, 2024

labath commented Jul 29, 2024

slydiman commented Jul 29, 2024 • edited Loading

slydiman commented Jul 29, 2024

slydiman commented Jul 29, 2024

labath commented Jul 29, 2024

labath commented Jul 29, 2024

slydiman commented Jul 29, 2024

labath commented Jul 29, 2024

slydiman commented Jul 29, 2024

slydiman commented Jul 29, 2024

github-actions bot commented Jul 25, 2024 •

edited

Loading

slydiman commented Jul 26, 2024 •

edited

Loading

slydiman commented Jul 29, 2024 •

edited

Loading