Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libdlfaker.so interferes with library specific RPATHs #266

Closed
twhitehead opened this issue Oct 28, 2024 · 9 comments
Closed

libdlfaker.so interferes with library specific RPATHs #266

twhitehead opened this issue Oct 28, 2024 · 9 comments

Comments

@twhitehead
Copy link

twhitehead commented Oct 28, 2024

When libdlfaker.so hooks dlopen it causes the ultimate dlopen call to come from libdlfaker.so instead of the original library. This results in any RPATHs in that library not being used to search for the library being dlopened.

I discovered this on the Digital Research Alliance of Canada's clusters when when using the newest ParaView. Here is a screenshot of showing how ParaView fails to load due a dependent library not finding a library it requires
dlfaker-error

Here is a screenshot showing how the libraries do properly resolve when your remove libdlfaker.so from sitting between the library and glibc
dlfaker-workaround

I realize this is probably a difficult items to resolve, but I opened this ticket to at least document the issue.

That said, I think it is actually technically possible as the final libdlfaker.so call to the actual dlopen is always (except when tracing in enabled) in a tail call position. That is, it looks like

  return dlopen_real(...);

which means your could technically tail call the real dlopen_real (i.e., drop the libdlfaker.so's stack frame and jump to the dlopen_real instead of calling it) so the the top-level return address would be in the original library's address space and glibc would then presumably apply its RPATH.

@twhitehead
Copy link
Author

twhitehead commented Oct 28, 2024

I see gcc has a musttail attribute that can be applied to a function call return

  [[gnu::musttail]] return foo();

Could it be as simple as that (well, a bit more complex actually, as the final call is actually two levels deep with a comment that there must have been a good reason for this...)?

@dcommander
Copy link
Member

This has been brought up before (see #107, #250), and relevant application recipes have been added to work around it. (I could add one for ParaView as well.) It would be nice to work around it more generally. I'll experiment with the musttail attribute.

@bartoldeman
Copy link

Having debugged this with @twhitehead I was wondering about the reason for using libdlfaker by default, and as far as I can see it's meant for programs that dlopen libGL.so instead of linking to it directly. From what I understand this used to be really common, but nowadays maybe a little less? Paraview links to it directly, but I think e.g. MATLAB doesn't.

@dcommander
Copy link
Member

It's perhaps even more common now than it used to be. There are entire frameworks that use dlopen()/dlsym() to load OpenGL functions, so any application built with one of those frameworks would fail if the dlopen() interposer weren't preloaded by default. The goal of VirtualGL is to "just work" for the maximum possible number of applications, i.e. to minimize the number of application recipes required. Thus, it would be really nice to work around the RPATH issue, because that would allow me to remove even more recipes. However, as of this moment, that issue is known to affect only three applications. Furthermore, I'm not sure which version of ParaView started experiencing the issue, but I know that older versions worked fine with VirtualGL.

@bartoldeman
Copy link

For completeness the sequence here is as follows (why Paraview crashes here):

paraview links to libospray.so.2 via libvtkRenderingRayTracing-pv5.11.so. libvtkRenderingRayTracing-pv5.11.so has an RPATH to the dir with libospray.so.2.

libospray.so.2 successfully dlopens libospray_module_cpu.so.2 in the same directory

libospray_module_cpu.so.2 directly links to libispcrt.so.1; the RPATH tolibispcrt.so.1 is in libospray_module_cpu.so.2, but not in the upper levels.

libispcrt.so.1 tries to dlopen libispcrt_device_cpu.so.1 (in the same directory as libispcrt.so.1), which fails with virtualgl, even if libispcrt.so.1 and libospray_module_cpu.so.2 have that in their RPATH. Instead (with libdlfaker) it searches using the top-level RPATH from the paraview executable.

Hope that helps for reference.

@dcommander
Copy link
Member

I tried the same version of ParaView (v5.11.2) that you appear to be using, but I can't reproduce the issue on my machine. Is there something special I need to do to configure it?

@bartoldeman
Copy link

@dcommander I'll try to get a reproducible build recipe for you, since there are probably some subtleties here. For sure it needs OSpray and ISPC as dependencies, with the ISPC runtime in a non-standard location.

@dcommander
Copy link
Member

Unfortunately, it appears as if [[gnu::musttail]] is a C++11 attribute only. I'm not sure which version of GCC introduced it, but it is definitely newer than GCC v13, the newest version on my test machine (and that is newer than the GCC version we use in our CI builds.) Thus, it is a non-starter at the moment.

If ParaView doesn't use dlopen() to load libGL, then vglrun -nodl would be the appropriate workaround. But I'll reiterate that the default builds of ParaView work fine.

@dcommander
Copy link
Member

BTW, the issue can be reproduced by cleaning the VirtualGL build directory, applying the attached VirtualGL build system patch, rebuilding VirtualGL, and running vglrun {build_dir}/dlfakerut/dlfakerut. With that specific example, the issue can be worked around by running vglrun -ld {build_dir}/dlfakerut {build_dir}/dlfakerut/dlfakerut.

I have never been completely clear as to why the issue occurs. My understanding of $ORIGIN is that it causes the dynamic linker to search for shared libraries in the application's directory, and I'm not sure why preloading a shared library would change that. In fact, glibc seems to look at /proc/self/exe first when resolving $ORIGIN, and VirtualGL definitely does not change /proc/self/exe.

Unless more information emerges that might enable a workaround within VirtualGL, unfortunately there is nothing I can do about this at the moment. The proposed workaround is incompatible with our official build system, and since the issue cannot be reproduced with the default builds of ParaView, it does not warrant an application recipe.

@dcommander dcommander closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants