-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redner crashing or getting stuck #122
Comments
It does not seem to be a memory outage issue, since some iterations do go through, and the GPU monitor shows enough memory available. |
It doesn't need to be minimal but a reproducible example would be helpful. |
COMPUTE_PROFILE=0 cuda-memcheck python debug.py
========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
... :( There are several obstacles in sharing the offending example. I was hoping that the mentions might trigger some thought on what could the issue be. One thing that comes to mind, due to the Having said that, I understand it's the toughest to try and answer without taking a look at the code. Perhaps we could arrange for a 1-1, or, I could try and build the debug mode and step in. |
I've put together a private repo that replicates the problem, to which I've added you. You'll find a commented out inclusion of normals, which, if enabled, crashes. |
This is reproducible for me in both Ubuntu and Windows 10, although in Windows 10 it is ~4X slower. |
Hmm. Unfortunately I couldn't reproduce it on my Ubuntu machine. I did uncomment the computation of the normal. Is this still reproducible for you in 0.4.14? |
I upgraded to the latest version, but I'll check again. Could you share a bit about your specs, pip freeze, driver version, etc? |
I've tested it in colab and another machine and it runs OK. So it's an issue with my machine. It's disheartening that it is reproducible in both OSes :( |
Perhaps a H/W issue? |
Still no luck. I still can't run cuda-memcheck. I made a script out of the shared notebook and run nvprof with it. I'm attaching the output. There's a segfault as early as the camera setup code. Could you perhaps do the same? I.e. run memcheck of nvprof on the code on your machine and see whether it turns out green? As another note, I've come across some github issues where there's a mention of a difference between GPUs on how forgiving they can be in segfaults within atomics. I guess a cuda-memcheck might not be forgiving, but I can't run it successfully. |
I managed to run I Ubuntu one needs to set The error I'm getting is the following one, repeatedly, which seems irrelevant: ========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0x119e12) [0x2f1342]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll [0x758b7]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x2cf71) [0xbe7b1]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x2cdf0) [0xbe630]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x18b60) [0xaa3a0]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x17303) [0xa8b43]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0xba35) [0x9d275]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpGetVersionString + 0x5fa) [0x8f19a]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpContextCreate + 0x7f) [0x8946f]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd (PyInit_redner + 0x1aebcd) [0x2514ad]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x90d1e]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x8f2aa]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x77176]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x5af2d]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallDict + 0x1e3) [0x2ec13]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyCFunction_FastCallDict + 0x2e) [0x1317b6]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x1c0a0) [0x7a830]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_Call + 0x75) [0x14c51]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x1b90c) [0x7a09c]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_FastCallKeywords + 0x3e7) [0x3e957]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_FastCallKeywords + 0x17a) [0x3e6ea]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xd19) [0x46939]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x8b2) [0x47322]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xce7) [0x46907]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x8b2) [0x47322]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyFunction_FastCallDict + 0x1ba) [0x2fe1a]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_Call + 0xd3) [0x14caf]
========= Host Frame:g:\frameworks\Anaconda3\lib\site-packages\torch\lib\torch_python.dll (THPShortStorage_New + 0x2a2df) [0x1eb89f]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x2ff1f) [0x8e6af]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PySlice_New + 0x14d) [0x4e46d]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x1174) [0x47be4]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xbbb) [0x467db]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x4af) [0x46f1f]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeEx + 0x9b) [0xa3ab]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCode + 0x2d) [0xa309]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyArena_Free + 0xa7) [0xa2b3]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_FileExFlags + 0xc5) [0x1a5539]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_SimpleFileExFlags + 0x24c) [0x1a5d60]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_AnyFileExFlags + 0x63) [0x1a5407]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_UnixMain + 0x573) [0xf1393]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_UnixMain + 0x61b) [0xf143b]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x30a54) [0x8f1e4]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_Main + 0x3a5) [0x4fedd]
========= Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_Main + 0x52) [0x4fb8a]
========= Host Frame:g:\frameworks\Anaconda3\python.exe [0x1268]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17974]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a261] |
Same error in linux, a bit more specific. ========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3a0403]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x1e6168]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x49d1d]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x3218d]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x2e183]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x1e0db]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 (rtpContextCreate + 0x55) [0x22b35]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so (_ZN5SceneC1ERK6CameraRKSt6vectorIPK5ShapeSaIS6_EERKS3_IPK8MaterialSaISD_EERKS3_IPK9AreaLightSaISK_EERKSt10shared_ptrIK14EnvironmentMapEbibb + 0x306) [0x154146]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so [0x8dc52]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so [0x44d5d]
========= Host Frame:python (_PyMethodDef_RawFastCallDict + 0x1ac) [0x1c85ec]
========= Host Frame:python [0xe0146]
========= Host Frame:python (PyObject_Call + 0x52) [0x1ccb42]
========= Host Frame:python [0x16e461]
========= Host Frame:python (_PyObject_FastCallKeywords + 0x126) [0x1c9a96]
========= Host Frame:python [0x1367d1]
========= Host Frame:python (_PyEval_EvalFrameDefault + 0x4511) [0x13d361]
========= Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
========= Host Frame:python (_PyFunction_FastCallKeywords + 0x488) [0x1c9358]
========= Host Frame:python [0x136640]
========= Host Frame:python (_PyEval_EvalFrameDefault + 0x4511) [0x13d361]
========= Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
========= Host Frame:python (_PyFunction_FastCallDict + 0x34e) [0x1ca52e]
========= Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/torch/lib/libtorch_python.so (_Z17THPFunction_applyP7_objectS0_ + 0xa7f) [0x6673ef]
========= Host Frame:python (PyCFunction_Call + 0x445) [0x1c8d15]
========= Host Frame:python (_PyEval_EvalFrameDefault + 0x5792) [0x13e5e2]
========= Host Frame:python (_PyFunction_FastCallKeywords + 0x18b) [0x1c905b]
========= Host Frame:python [0x136640]
========= Host Frame:python (_PyEval_EvalFrameDefault + 0x552) [0x1393a2]
========= Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
========= Host Frame:python (PyEval_EvalCode + 0x23) [0x24d163]
========= Host Frame:python [0x2407d3]
========= Host Frame:python (PyRun_FileExFlags + 0x97) [0x240887]
========= Host Frame:python (PyRun_SimpleFileExFlags + 0x17a) [0x24163a]
========= Host Frame:python [0x27918f]
========= Host Frame:python (_Py_UnixMain + 0x2e) [0x27947e]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x271e3]
========= Host Frame:python (_start + 0x2e) [0x1cff9e] |
Yeah these errors are expected (some optix prime internal assumptions). I'll take a look at your script on Wednesday. (I don't think it's the Titan V bug you posted) |
I'm afraid it is a bit more complicated for me. In the 2080 RTX Ti system it just takes far more iterations to crash. |
I've confirmed the same behavior as in my system on a different system that also has a Titan-V. There's some consistency there. |
I've got 3/3 in different Titan-V machines, where the same behavior is exhibited. |
I've had no luck stepping-in in Ubuntu. I kept getting cuda errors for things that would otherwise be error-free :( In Windows, stepping-in revealed a point where including the normals leads to a hang at the 4th iteration (not including the normals completes successfully). I'm attaching the callstack. I'm not sure why the GPU might not complete its compute. Any hints on where to look at more closely are more than welcome. |
Does the same error happen when you remove the distortion parameters? What if you switch the object to the teapot? |
Distortion params: The error precedes the addition of distortion params. The error persisted across newer versions. Teapot: I haven't tried the teapot. The problem goes away when the geometry is "detached" ( So far it seems that it is the combination of MANO and redner that yields the problem. I've hand no issues working excessively with MANO, redner aside. Also, MANO is pure python. My best guess so far is that it is a corner case that challenges the assumptions made by redner wrt input. Other candidates for the error:
|
redner doesn't check the indexing for the shape, so it could be that |
maybe also try to turn off edge sampling by setting |
I'm not sure what checking the indexing means, but the faces are all within bounds (both for geometry and uvs). Is any other kind of geometrical abnormalities able capable of breaking redner? I did try with edge sampling off. It takes both stages disabled to go through! I'm trying again, to see whether it's repeatable. |
It worked a 2nd time! This is proof in my book :) |
It is possible that something went wrong in the silhouette detection inside edge sampling. Have you tried all four combinations? (both on, both off, only primary on, only secondary on) |
I brought down the spp and bounces to 1 and 1, to speed it up a bit:
|
Cool! Most likely something is wrong with the silhouette tree. Either the creation (something in edge_tree.cpp) or the traversal (something in One more test that would be helpful but slightly more involved: go to this line Line 334 in 6991ca3
use_primary_edge_sampling=False and use_secondary_edge_sampling=False . If it still crashes/hangs most likely something is wrong with the tree construction in edge_tree.cpp. Otherwise it's probably the traversal.
For your use case it's also possible that turning off the secondary edge sampling would be suffice for your inverse problem. So if you are in a hurry that might be a solution. For me it would be helpful if we can find out what's wrong together ; ) |
Another quick thing to try is to change this line to increase the traversal buffer size: Line 1123 in 6991ca3
(maybe make it 256 or something) I don't think this will fix it but it's worth trying if it is easy to try. |
Making the condition true did pass. Setting secondary to true again hangs. Increasing the buffer size did not change that. So, it seems it's the traversal. |
I'm seeing there's a while loop inside device code, which might explain the gpu blocking at 100%. Since I see there's a binary search, may I suggest the separation of the binary search algo from the impl itself, to test separately, and, perhaps the use of |
How relevant is the CPU impl? I ran it on cpu-only mode and it went through. |
re CPU impl: I have no idea. If there is a buffer overrun/infinite loop CPU should also crash. |
I went back to the RTX2080 machine to run the full pipeline with secondary=False, but I'm still getting a crash :( I'm trying with edge sampling completely off (it takes a while). |
Wrt to the sample and TitanV, if you add the |
I suspect I'm seeing a similar issue. The failure mode was a |
Are there any updates on this? I am getting the same error with 0.4.28. Only turning off primary and secondary edge sampling lets it pass for me. With only |
I'm afraid I don't have a minimal example that replicates this, as it takes the complexity of what I'm trying to break it.
I'll try to describe the problems I'm facing and ask for help debugging myself, while @BachiLi takes a peek too.
I'm using redner (CUDA) with input an environment map variable (large, 1Kx2K), a Shape that depends on variables, with a texture that depends on variables. "Depends on variables" means there is a differentiable pytorch code that transforms some variables into the redner input. I'm getting crashes with
thrust::system_error
(probably a CUDA segfault) duringbackward
. These seem to go away by forcing all redner inputs to becontiguous
. This leads to redner getting stuck: GPU is 100% but never ends.It seems that environment map and geometry variables don't go well together. One or the other in isolation seems to go through.
Any insights on how to narrow down the error?
The text was updated successfully, but these errors were encountered: