-
Notifications
You must be signed in to change notification settings - Fork 12.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lldb] Improved lldb-server stability for remote launching #100659
Conversation
@llvm/pr-subscribers-lldb Author: Dmitry Vasilyev (slydiman) ChangesWe faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process Full diff: https://github.com/llvm/llvm-project/pull/100659.diff 2 Files Affected:
diff --git a/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp b/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
index 0a832ebad13a7..637c2846e6bb2 100644
--- a/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
+++ b/lldb/source/Host/posix/ProcessLauncherPosixFork.cpp
@@ -201,7 +201,7 @@ struct ForkLaunchInfo {
execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);
#if defined(__linux__)
- if (errno == ETXTBSY) {
+ for (int i = 0; i < 50; ++i) {
// On android M and earlier we can get this error because the adb daemon
// can hold a write handle on the executable even after it has finished
// uploading it. This state lasts only a short time and happens only when
@@ -210,7 +210,9 @@ struct ForkLaunchInfo {
// shell" command in the fork() child before it has had a chance to exec.)
// Since this state should clear up quickly, wait a while and then give it
// one more go.
- usleep(50000);
+ if (errno != ETXTBSY)
+ break;
+ usleep(100000);
execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp);
}
#endif
diff --git a/lldb/source/Host/windows/ProcessLauncherWindows.cpp b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
index baa422c15cae2..143f20cb3f510 100644
--- a/lldb/source/Host/windows/ProcessLauncherWindows.cpp
+++ b/lldb/source/Host/windows/ProcessLauncherWindows.cpp
@@ -113,14 +113,30 @@ ProcessLauncherWindows::LaunchProcess(const ProcessLaunchInfo &launch_info,
// command line is not empty, its contents may be modified by CreateProcessW.
WCHAR *pwcommandLine = wcommandLine.empty() ? nullptr : &wcommandLine[0];
- BOOL result = ::CreateProcessW(
+ BOOL result;
+ DWORD last_error = 0;
+ // This is the workaround for the error "The process cannot access the file
+ // because it is being used by another process". Note the executable file is
+ // installed to the target by the process `lldb-server platform`, but launched
+ // by the process `lldb-server gdbserver`. Sometimes system may block the file
+ // for some time after copying.
+ for (int i = 0; i < 50; ++i) {
+ result = ::CreateProcessW(
wexecutable.c_str(), pwcommandLine, NULL, NULL, TRUE, flags, env_block,
wworkingDirectory.size() == 0 ? NULL : wworkingDirectory.c_str(),
&startupinfo, &pi);
+ if (!result) {
+ last_error = ::GetLastError();
+ if (last_error != ERROR_SHARING_VIOLATION)
+ break;
+ ::Sleep(100);
+ } else
+ break;
+ }
if (!result) {
// Call GetLastError before we make any other system calls.
- error.SetError(::GetLastError(), eErrorTypeWin32);
+ error.SetError(last_error, eErrorTypeWin32);
// Note that error 50 ("The request is not supported") will occur if you
// try debug a 64-bit inferior from a 32-bit LLDB.
}
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
@@ -201,7 +201,7 @@ struct ForkLaunchInfo { | |||
execve(info.argv[0], const_cast<char *const *>(info.argv), info.envp); | |||
|
|||
#if defined(__linux__) | |||
if (errno == ETXTBSY) { | |||
for (int i = 0; i < 50; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be a good idea to make the timeout configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is redundant. Currently lldb-server contains a hardcoded timeout in many places. Note the total timeout 10 seconds will cause the error "Sending vRun packet failed". So 5 seconds is good enough for the stable connection and do the trick on slow machines.
We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process `lldb-server platform`, but launched by the another process `lldb-server gdbserver`. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.
6e24f75
to
21fd03f
Compare
…t mapping Removed fork(). Used threads and the common thread-safe port map for all platform connections. Updated lldb::FileSystem to use llvm::vfs::createPhysicalFileSystem() with an own virtual working directory per thread. This patch depends on llvm#100659, llvm#100666. This patch fixes llvm#97537, llvm#90923, llvm#56346. lldb-server has been tested on Windows with 50 connections and 100 processes launched simultaneously. Tested also the cross build with Linux x86_64 host and Linux Aarch64 target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workaround specifically mentions Android (M). The problem there was that the ADB daemon (part of the OS, which we use for faster file uploads) was implemented (had a bug) such that the file could remain open for a short while (between fork() and exec()) in another process when the daemon is accepting a new connection.
We fixed that issue, but we since the this comes with the device, we still had to work around the problem in old devices. We're finally reaching the point when we could remove the workaround (android <=M is less that 3% of active devices), so I wouldn't want to extend it without understanding the problem further.
As I understand it, you're using the regular lldb-server platform
process to implement file uploads, which I think means the problem should be fully within our control. We basically need to make sure that:
- lldb waits for the
vFile:close
response before sending thevRun
packet - lldb-server actually closes the file handle (all of them -- i.e., makes sure it does not leak it) before sending the
vFile:close
response
If that's not what's happening right now, then we ought to fix it.
Or at least, understand why is that impossible. |
@labath Of course lldb waits for the vFile:close response before sending the vRun packet and lldb-server actually closes the file handle (all of them). No any leaks. Otherwise this workaround wouldn't work. The behavior is the same on Linux and Windows targets. I launched 100 connections and 200 processes simultaneously on Windows (lldb-server gdbserver + a test app). I got 3..10 fails because of the error ERROR_SHARING_VIOLATION. After this patch I got 0..3 fails for 100 connections and 0 fails for 50 connections. After closing the copied file probably the system may cache it some time per process. The file may be blocked by the built-in antivirus for some time. It is hard to figure out the exact reason. We have a buildbot to run cross API tests in 8 threads with Linux Aarch64 target. All tests are green with the current (single thread) lldb-server. But we got randomly failed 50..60 tests (of 1190) with the multithreading version of lldb-server. |
That may be how things works on windows(*), but I'm pretty sure it's not how things work on linux. A much more likely scenario is:
This isn't "operating system keeping the file open longer", this is us doing it (to be fair, the operating system is making it pretty hard to avoid doing that). And while this isn't an absolute thing (the workaround isn't as bad as the
Does this refer the the forking implementation you get with the (*) I just don't know enough about the system to have an informed opinion. This isn't the first time I've heard about the antivirus hypothesis, but I find that somewhat surprising, as that could mean that something like |
In this scenario, waiting does make the exec succeed, because the FD in the forked process will not stay open very long. It will get closed as soon as the process runs |
I mean the
Your scenario is impossible. How FD will be closed by execve() if execve() failed with ETXTBSY?
|
The difference is that cc creates a.out and exits itself. But |
See also golang/go#22315 |
Ok, so if I'm reading this right you're saying you saw no ETXTBSY errors with the current implementation
There are two threads and (at least two execve()s) happening here. I'm referring to the one on thread 2 (specifically, the fork child of thread 2). Your description describes what happens on one thread (well, one line of execution, corresponding to one e.g. test). Let me try this again. I'm just going to take your description, copy it twice and interleave it (italic is for one line of execution bold is for the second one, regular text is my commentary):
|
Ok, this wasn't the best analogy. I still stand by my analysis of the problem though.
That's exactly the problem I'm describing here. And I'm considering something like golang/go#22315 (comment) as the solution (if we really do go through with this). The bug is that the FD gets leaked, and the fix is to make sure it doesn't get leaked. Waiting is a workaround because there's no guarantee that whoever we leak it to will close it.. ever. The only reason the workaround is here is because the bug was in third party code we can't change (everywhere.. we did change it, but only for new androids) |
Right. Initially I have marked #100670 as dependent on this. Ok, agreed. So, we can try to use O_CLOFORK. |
Umm.. by "current" I meant the current implementation that's in the llvm repository, so I'm not sure if we're agreeing to anything (yet). O_CLOFORK doesn't exist (that's the really mythical part). I wish it did though... I don't think we can call execve appreciably faster than we already do. I'm still not sure if I am ok with the wait workaround, but I think it could wait until we settle some other things first. For one, I'd like to hear your opinion on my port mapping alternative. |
|
I have moved this patch to #100670. |
…t mapping Removed fork(). Used threads and the common thread-safe port map for all platform connections. Updated lldb::FileSystem to use llvm::vfs::createPhysicalFileSystem() with an own virtual working directory per thread. This patch depends on llvm#100659, llvm#100666. This patch fixes llvm#97537, llvm#90923, llvm#56346. lldb-server has been tested on Windows with 50 connections and 100 processes launched simultaneously. Tested also the cross build with Linux x86_64 host and Linux Aarch64 target.
We faced the issue running cross api tests in 8 threads. The executable is installed to the target by the process
lldb-server platform
, but launched by the another processlldb-server gdbserver
. We got the error ETXTBSY on Linux and ERROR_SHARING_VIOLATION on Windows. It seems the known issue and ProcessLauncherPosixFork.cpp already contains the workaround, but it is not enough. Updated the workaround with the total timeout 5 seconds and added the same workaround to ProcessLauncherWindows.cpp too.