-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.NET Core 3.0 RTM Freeze on Ubuntu 16.04 OpenVZ #13475
Comments
Looks like a gc bug, most threads are waiting for gc done (enter cooperative mode), but the gc thread is blocking at ThreadSuspend::SuspendRuntime. |
Thank you for collecting the dump and stack traces. The problem is Thread 10:
This thread is in cooperative mode and the GC is waiting for it to get to a safe spot. Would you be able to find out why it is not making progress? Is it stuck spinning inside |
I was able to reproduce the problem by copying a custom built coreclr (tag v3.0.0) to .dotnet/shared/Microsoft.NETCore.App/3.0.0-preview6-27804-01 and run ./build.sh. When use release build, gdb gives following frames when freeze occurs (see thread 14, looks like the same location cause this problem):
When using debug build, some assertion failed instead of freeze:
|
CoreCLR is coupled with the rest of the stack. Tag v3.0.0 and preview6 are many months apart. It is quite possible that CoreCLR from v3.0.0 tag is not compatible with preview6 and the crashes you are seeing are caused by mismatched bits. Are you able to reproduce the problems with exactly matching bits?
Any change you can get this under debugger and capture the stacktrace of the crash? |
After I set tools.dotnet to "3.0.100", and overwrite custom built coreclr to .dotnet/shared/Microsoft.NETCore.App/3.0.0, no assertion failure occurs, both debug and release build will just freeze. Looks like you're right about the mismatched bits. To invesgate the freeze issue, I modify FORCEINLINE LPVOID Remove()
{
LIMITED_METHOD_CONTRACT;
if(root == NULL) return NULL; // No need for acquiring the lock, there's nothing to remove.
printf("remove lock acquiring %p %d\n", GetThread(), lock != 0);
AcquireLock();
printf("remove lock acquired %p %d\n", GetThread(), lock != 0);
Entry* ret = (Entry*)root;
if(ret)
{
root = ret->next;
count -= 1;
}
ReleaseLock();
printf("remove lock released %p %d\n", GetThread(), lock != 0);
return ret;
}
FORCEINLINE void Insert( LPVOID mem )
{
LIMITED_METHOD_CONTRACT;
printf("insert lock acquiring %p %d\n", GetThread(), lock != 0);
AcquireLock();
printf("insert lock acquired %p %d\n", GetThread(), lock != 0);
Entry* entry = (Entry*)mem;
entry->next = root;
root = entry;
count += 1;
ReleaseLock();
printf("insert lock released %p %d\n", GetThread(), lock != 0);
} Then I saw following messages when freeze occurs:
Looks like somewhere set the lock member to 1, or some weird out of order execution bug occurs :/ |
After add address of
so some thread using RecycledListInfo instance at incorrect memory address? |
The reason is I wrote a simple c program to verify it:
output:
In .NET Core 2.2, GetRecycleMemoryInfo will use Link: The fix should be replace Confirmed the fix works on my environment, should I create a pull request? |
Yes, that would be great! Thank you a lot for tracking this down. cc @janvorli Regression introduced by dotnet/coreclr#23824 |
Until this fix lands in a release, a workaround can be performed by using As a workaround, I defined a // coreclr-27955-workaround.c
int sched_getcpu(void) {
return 0;
} Compiled it: gcc -shared -fPIC coreclr-27955-workaround.c -o libcoreclr-27955-workaround.so
sudo cp libcoreclr-27955-workaround.so /usr/local/lib Then ran my app with the LD_PRELOAD=/usr/local/lib/libcoreclr-27955-workaround.so ASPNETCORE_ENVIRONMENT=Production ASPNETCORE_URLS=http://*:54561 ./Foo It worked! |
Ports change #26873 to release 3.1 branch. On OpenVZ virtualized linux, GetCurrentProcessorNumber which uses sched_getcpu() can return a value greater than the number of processors reported by sched_getaffinity with CPU_COUNT or sysconf(_SC_NPROCESSORS_ONLN). For example, taskset -c 2,3 ./MyApp will make CPU_COUNT be 2 but sched_getcpu() can return 2 or 3, and OpenVZ kernel can make sysconf(_SC_NPROCESSORS_ONLN) return a limited cpu count but sched_getcpu() still report the real processor number. Example of affinity vs current CPU id on OpenVZ: nproc: 8 nprocOnline: 1 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 5 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 5
Ports change #26873 to release 3.1 branch. On OpenVZ virtualized linux, GetCurrentProcessorNumber which uses sched_getcpu() can return a value greater than the number of processors reported by sched_getaffinity with CPU_COUNT or sysconf(_SC_NPROCESSORS_ONLN). For example, taskset -c 2,3 ./MyApp will make CPU_COUNT be 2 but sched_getcpu() can return 2 or 3, and OpenVZ kernel can make sysconf(_SC_NPROCESSORS_ONLN) return a limited cpu count but sched_getcpu() still report the real processor number. Example of affinity vs current CPU id on OpenVZ: nproc: 8 nprocOnline: 1 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 2 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 5 affinity: 1, 0, 0, 0, 0, 0, 0, 0, cpuid: 5
@Daniel15, thanks, it definitely does! it also helped me run dotnet core 3.1 hwapp on FreeBSD using Linux®️ Binary Compatibility emulation: https://cirrus-ci.com/build/6435715873505280. Note that we can skip the cc @wfurt, the self-contained repro is here, independent of qemu, libvirt and other solutions we were exploring. the "catch" is that all the build-time dependencies |
@am11 I'm glad you found the workaround useful :) For what it's worth, for https://dnstools.ws/ I'm deploying this workaround to all servers as part of an Ansible playbook: - name: Temporary hack for https://github.com/dotnet/coreclr/issues/27955
get_url:
url: https://d.ls/dotnet/bugs/libcoreclr-27955-workaround.so
dest: /usr/local/lib/libcoreclr-27955-workaround.so
mode: "0755"
checksum: sha256:be021161c98f69367745cd7d821b8175b6466eb5fa5921ecee3dcb6f9ff8f150 Then in my systemd file I have: # Workaround for https://github.com/dotnet/coreclr/issues/27955
Environment=LD_PRELOAD=/usr/local/lib/libcoreclr-27955-workaround.so |
Fixed in 3.1.2 |
The helloworld app in Linux chroot of FreeBSD still failed when invoking the compiler (at |
FreeBSD hang is likely a different underlying problem. Could you please open a new issue on it? |
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See actions#574 for more information. Fixes actions#574
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See actions#574 for more information. Fixes actions#574
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See actions#574 for more information. Fixes actions#574
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See actions#574 for more information. Fixes actions#574
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See actions#574 for more information. Fixes actions#574
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See actions#574 for more information. Fixes actions#574
This change updates the .NET Core SDK used by the Actions Runner to version 3.1.302 to address the issues that are caused by the following issue: dotnet/runtime#13475 See #574 for more information. Fixes #574
Today I installed .NET Core 3.0 RTM (3.0.100) on my ubuntu 16.04.6 x86-64 vps, when I trying to build a hello world console project, the dotnet process with following arguments freeze:
The previous version (.NET Core 2.2) doesn't have this issue.
Then I trying to build coreclr from source to find why it happends.
I clone coreclr repository, checkout release/3.1 branch, then run build.sh, and it freeze again.
The command line:
I attach the process with gdb and run
thread apply all bt
, here is the output messages, looks like .NET Core 3.0 have some deadlock bug on ubuntu 16.04, both RTM and preview 7.The text was updated successfully, but these errors were encountered: