Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better diagnostic for native .dll loading errors (pass through dlerror) #7737

Closed
alexanderuv opened this issue Mar 28, 2017 · 29 comments
Closed
Assignees
Labels
area-Interop-coreclr enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@alexanderuv
Copy link

Hi
I am trying to write a .NET Core wrapper for the LLVM C Frontend. However I am having trouble with the runtime finding the dylib.

Failed   LLVMWrap.Tests.BasicTests.TestBasicSumMethod
Error Message:
 System.DllNotFoundException : Unable to load DLL 'libLLVM.dylib': The specified module could not be found.
 (Exception from HRESULT: 0x8007007E)
Stack Trace:
   at LLVMWrap.LLVMAPI.LLVMInt32Type()
   at LLVMWrap.Core.Types.LLVMInt32Type..ctor() in /Users/alexander/github/LLVMWrap/LLVMWrapCore/Core/Types/LLVMInt32Type.cs:line 11
   at LLVMWrap.Core.Types.LLVMInt32Type.<>c.<.cctor>b__4_0() in /Users/alexander/github/LLVMWrap/LLVMWrapCore/Core/Types/LLVMInt32Type.cs:line 7
   at System.Lazy`1.CreateValue()
   at System.Lazy`1.LazyInitValue()
   at LLVMWrap.Core.Types.LLVMInt32Type.get_Instance() in /Users/alexander/github/LLVMWrap/LLVMWrapCore/Core/Types/LLVMInt32Type.cs:line 14
   at LLVMWrap.Tests.BasicTests.TestBasicSumMethod() in /Users/alexander/github/LLVMWrap/LLVMWrap.Tests/BasicTests.cs:line 15

I've tried the project paths, the bin/debug/target/ path and nothing. Any clues about what I might be missing? or is this a bug?

@shahid-pk
Copy link
Contributor

shahid-pk commented Mar 28, 2017

i think dll import looks for .so or dylab in the os default location. for example i did not have to provide anything but just the name of libxcb for xwindows on linux and for cairo on linux.
https://github.com/shahid-pk/NetX/blob/master/NetX/Interop/LibXcb.cs
https://github.com/shahid-pk/NetX/blob/master/NetX/Interop/LibCairo.cs

the .so libs are in the their default location where the package manager installs by default.

@adityamandaleeka
Copy link
Member

I believe the code paths involved in loading your dylib should just be using dlopen. Information about how dlopen probes for libraries on macOS is at this man page.

If that doesn't help, please provide a minimal sample and we can help you figure out what's going wrong.

@alexanderuv
Copy link
Author

@adityamandaleeka I looked at the documentation (thanks for that) but I couldn't make it work. According to the documentation, it will look in the current directory or an absolute path. I have tried both but none work. Please find the sample here: https://github.com/alexanderuv/LLVMWrap

@gkhanna79
Copy link
Member

CC @yizhang82

@adityamandaleeka
Copy link
Member

@alexanderuv I tried out your sample and I believe the reason it's not able to load libLLVM.dylib is that libLLVM.dylib is also referencing libLTO.dylib, which isn't found.

@adityamandaleeka
Copy link
Member

This load failure isn't a CoreCLR bug, but it would be great if we could provide a good way for users to be able to diagnose this type of failure without having to debug or use external tools. The error message you saw (The specified module could not be found.) obviously wasn't very helpful.

@alexanderuv
Copy link
Author

Great! Thanks for the help!
It would definitely be nice to have a more helpful error message. I guess I would have to debug CoreCLR to find that message right?

@adityamandaleeka
Copy link
Member

Today you either have to monitor file system access by the program, or have a debug build of CoreCLR with WARN messages being printed in order to see what's going on (or of course attach a debugger).

@janvorli
Copy link
Member

I think I have suggested a long time ago to change the error message to "The specified module or one of its dependencies could not be found." That way it would hint the user to verify that the reported .dylib / .so has all the dependencies it needs. On Linux, "ldd -r" is the best way. On OSX, the "otool -L" command should work.
@alexanderuv would it have helped you to pinpoint the issue if the message was modified like I've said?

@yizhang82
Copy link
Contributor

yizhang82 commented Mar 31, 2017

I don't think the runtime should add OS specific diagnose support for dll loading. How about adding a fwlink that points to a doc that shows the tips/techniques/tools to diagnose such issues? The tools mentioned here (procmon, windbg+loader snap, ldd, otool) is a good start point. We can easily append that in the error message.

@alexanderuv
Copy link
Author

@janvorli I think that would be much more helpful than what it says right now. Also that is not platform specific so if that's as far as one can go, it would be fine

@maitredede
Copy link

Hi, I think that if you can get the error message, you can add it.
For example, I had a problem loading a library on linux, and it would have saved me lots of time if I had the error message why the library failed. I wrote a small native program to call myself dlopen/dlerror to figure out why the lib was not loading...

@jkotas jkotas changed the title DllImport probing paths? Better diagnostic for native .dll loading errors Nov 9, 2017
@eerhardt
Copy link
Member

eerhardt commented Dec 6, 2017

We encountered

Unhandled Exception: System.DllNotFoundException: Unable to load DLL 'MonoPosixHelper': The specified module or one of its dependencies could not be found.
(Exception from HRESULT: 0x8007007E)
   at Mono.Unix.Native.Syscall.lstat(String file_name, Stat& buf)
   at Mono.Unix.UnixFileSystemInfo.TryGetFileSystemEntry(String path, UnixFileSystemInfo& entry)
   at Mono.Unix.UnixFileSystemInfo.GetFileSystemEntry(String path)
   at TestMonoPosix.Program.Main(String[] args) in d:\Projects\TestMonoPosix\Program.cs:line 13
Aborted (core dumped)

When running a program on a newly mounted disk. The application worked fine on the main disk, but it would fail as above when the application was copied to the mounted disk.

Using strace, gave insight into why the assembly couldn't be loaded:

open("/datadisks/disk1/TestMonoPosix/runtimes/linux-x64/native/libMonoPosixHelper.so", O_RDONLY|O_CLOEXEC) = 24
read(24, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\271\0\0\0\0\0\0"..., 832) = 832
fstat(24, {st_mode=S_IFREG|0744, st_size=900791, ...}) = 0
mmap(NULL, 2362920, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 24, 0) = -1 EPERM (Operation not permitted)
close(24)  

From reading the ‘man mmap’ pages:

EPERM The prot argument asks for PROT_EXEC but the mapped area belongs to a file on a filesystem that was mounted no-exec.

Which was the problem - the disk was mounted no-exec. Remounting it with the exec option allowed the application to start running again.

Better diagnostics info in the exception would have helped speed up this investigation and narrow the problem sooner.

@danmoseley
Copy link
Member

@jkotas @janvorli short of literally running tools like strace, ldd etc for the user is there more information available to us that we can provide?

@adityamandaleeka mentions WRN messages in debug builds -- are these the ones he has in mind? If so could we usefully include the dlerror() in the exception ?

C:\git\coreclr\src\pal\src\loader\module.cpp:
 1110:         WARN("dlclose() call failed! error message is \"%s\"\n", dlerror());
 1178:         WARN("Call to DllMain (%p) got an unhandled exception; ignoring.\n", module->pDllMain);
 1222:         WARN("Invalid module handle %p\n", hLibModule);
 1379:         WARN("dlopen() failed; dlerror says '%s'\n", dlerror());

C:\git\coreclr\src\pal\src\loader\modulename.cpp:
   63:         WARN("dladdr() call failed! dlerror says '%s'\n", dlerror());

@danmoseley
Copy link
Member

Perhaps there are other places we could pass dlerror() through eg where we do dlsym like

        TRACE("Symbol %s not found in module %p (named %S), dlerror message is \"%s\"\n",
              lpProcName, module, MODNAME(module), dlerror());

@jkotas
Copy link
Member

jkotas commented Dec 6, 2017

could we usefully include the dlerror() in the exception ?

Yes.

@danmoseley
Copy link
Member

Looks like it would need a little wiring as currently errors are passed back through GetLastError, ie DWORD

        WARN("dlopen() failed; dlerror says '%s'\n", dlerror());
        SetLastError(ERROR_MOD_NOT_FOUND);

It might also be useful to have LoadLibErrorTracker include all errors, not just what it thinks is the highest priority

@danmoseley danmoseley changed the title Better diagnostic for native .dll loading errors Better diagnostic for native .dll loading errors (pass through dlerror) Dec 13, 2017
@danmoseley
Copy link
Member

danmoseley commented Dec 13, 2017

I guess I add something like this that wraps dlerror()

PALIMPORT
BOOL
PALAPI
PAL_GetLoadLibraryError(
     OUT LPWSTR lpBuffer,
     IN DWORD nSize
)

then have this code call it instead of GetLastError()

    HMODULE hmod = PAL_LoadLibraryDirect(name);

    if (hmod == NULL)
    {
        pErrorTracker->TrackErrorCode(GetLastError());
    }

then update LoadLibErrorTracker to accept and throw a string.

@wtgodbe
Copy link
Member

wtgodbe commented Jan 3, 2018

@danmosemsft I'm trying to get a handle on what needs to be done here - were you seeing that it would be sufficient to create the PAL_GetLoadLibraryError function you mentioned above, have it insert the char* returned by dlerror() into the OUT parameter lpBuffer, and replace the call to GetLastError() at https://github.com/dotnet/coreclr/blob/46ab1d132c9ad471d79afa20c188c2f9c85e5f20/src/vm/dllimport.cpp#L5855 with a call to PAL_GetLoadLibraryError? Would we call TrackErrorCode on the invocation of PAL_GetLoadLibraryError, or on the LPWSTR we provided to that function as the OUT parameter lpBuffer? Also, where would be the right place to add PAL_GetLoadLibraryError?

@danmoseley
Copy link
Member

danmoseley commented Jan 4, 2018

PAL_GetLoadLibraryError can just go in module.cpp or somewhere relevant.

My reading of the doc for dlerror(), https://linux.die.net/man/3/dlerror , calling dlerror() clears it out and also subsequent successful use of dlopen() does not clear it out.

So what I suggest is storing a string in the PAL containing the last dlerror. There would be an internal function like ClearLoadLibraryError() to clear it. PAL_GetLoadLibraryError() would return any existing value, or if it was empty, call dlerror() to get and store the value then return that.

Then modifying all places in PAL that call dlopen(), dlsym() or dlclose(), to add ClearLoadLibraryError() before the call.

Then change TrackErrorCode(DWORD); to just TrackErrorCode(); which calls either GetLastError() or GetLoadLibraryError() depending on #if FEATURE_PAL.

If TrackErrorCode() called GetLoadLibraryError() then it should should store the string from that and put that string into the DllNotFoundException when it throws it.

@jkotas does this sound like a reasonable plan?

Some future change could possibly do something similar for EntrypointNotFoundException to include any error from dlsym. But this bug is about DllNotFoundException.

@jkotas
Copy link
Member

jkotas commented Jan 4, 2018

PAL_GetLoadLibraryError() would return any existing value, or if it was empty, call dlerror()

Why wouldn't PAL_GetLoadLibraryError just call dlerror()? We do not need to be in the business of caching it.

Then modifying all places in PAL that call dlopen(), dlsym() or dlclose(), to add ClearLoadLibraryError()

This just needs to be dlopen(), I think.

Otherwise sounds good to me.

@danmoseley
Copy link
Member

Why wouldn't PAL_GetLoadLibraryError just call dlerror()? We do not need to be in the business of caching it.

To be consistent with GetLastError, which you can call repeatedly. The man page says that dlerror() resets when called. I'm happy to not cache it though.

@wtgodbe
Copy link
Member

wtgodbe commented Jan 5, 2018

@danmosemsft at https://github.com/dotnet/coreclr/blob/master/src/vm/dllimport.cpp#L5730, if we replace dwLastError with the result of PAL_GetLoadLibraryError on non-windows, then won't that switch statement not work? I.E. it's only set up to understand an int result. Or should we call GetLastError on every OS, but only (additionally) call PAL_GetLoadLibraryError on non-windows?

@danmoseley
Copy link
Member

@wtgodbe in the Unix path, I do not think you need this "priority" scheme in that class. I would just simply store the string and return. When you are done with the work you could verify that TrackErrorCode() is not getting called more than once.

@wtgodbe
Copy link
Member

wtgodbe commented Jan 11, 2018

@danmosemsft what do you suggest I do about https://github.com/dotnet/coreclr/blob/master/src/vm/dllimport.cpp#L5819-L5823 ? I could add a priority value for ERROR_INVALID_PARAMETER? Or should I assume that case falls under https://github.com/dotnet/coreclr/blob/master/src/vm/dllimport.cpp#L5745-L5748?

@danmoseley
Copy link
Member

Note it's != ERROR_INVALID_PARAMETER, so that will not call TrackErrorCode..?

If your question is about the Linux case, I think we're assuming that priority isn't relevant there and there will be just one call to TRackErrorCode.

@wtgodbe
Copy link
Member

wtgodbe commented Jan 11, 2018

I meant that we need to eliminate calls to GetLastError() outside of TrackErrorCode(), so the call at https://github.com/dotnet/coreclr/blob/master/src/vm/dllimport.cpp#L5819 needs to be removed, meaning we won't have the value of dwLastError to check against ERROR_INVALID_PARAMETER

@danmoseley
Copy link
Member

Ah right. It should be OK to call GetLastError() both there and again inside TrackErrorCode(). It should not reset.

@wtgodbe
Copy link
Member

wtgodbe commented Jan 25, 2018

Resolved by dotnet/coreclr#15831

Unhandled Exception: System.DllNotFoundException: Unable to load shared library 'libhelloworld.so' or one of its dependencies. In order to help diagnose loading problems, consider setting the LD_DEBUG environment variable: liblibhelloworld.so: cannot open shared object file: No such file or directory
at App.add(Int32 x, Int32 y)
at App.Main()
Aborted (core dumped)

@wtgodbe wtgodbe closed this as completed Jan 25, 2018
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Interop-coreclr enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests