Redner crashing or getting stuck #122

nkyriazis · 2020-04-08T09:30:47Z

I'm afraid I don't have a minimal example that replicates this, as it takes the complexity of what I'm trying to break it.

I'll try to describe the problems I'm facing and ask for help debugging myself, while @BachiLi takes a peek too.

I'm using redner (CUDA) with input an environment map variable (large, 1Kx2K), a Shape that depends on variables, with a texture that depends on variables. "Depends on variables" means there is a differentiable pytorch code that transforms some variables into the redner input. I'm getting crashes with thrust::system_error (probably a CUDA segfault) during backward. These seem to go away by forcing all redner inputs to be contiguous. This leads to redner getting stuck: GPU is 100% but never ends.

It seems that environment map and geometry variables don't go well together. One or the other in isolation seems to go through.

Any insights on how to narrow down the error?

The text was updated successfully, but these errors were encountered:

nkyriazis · 2020-04-08T09:47:47Z

It does not seem to be a memory outage issue, since some iterations do go through, and the GPU monitor shows enough memory available.

BachiLi · 2020-04-08T12:03:14Z

It doesn't need to be minimal but a reproducible example would be helpful. cuda-memcheck is usually good at detecting this kind of bugs.

nkyriazis · 2020-04-08T13:03:55Z

COMPUTE_PROFILE=0 cuda-memcheck python debug.py
========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
...

:(

There are several obstacles in sharing the offending example. I was hoping that the mentions might trigger some thought on what could the issue be. One thing that comes to mind, due to the contiguous change in behavior, is that somewhere in the code there is an assumption about contiguous storage that is not asserted. Then, what would lead to an infinite loop that would keep the GPU busy at 100%? Uninitialized variables, like #93?

Having said that, I understand it's the toughest to try and answer without taking a look at the code. Perhaps we could arrange for a 1-1, or, I could try and build the debug mode and step in.

nkyriazis · 2020-04-09T08:45:32Z

I've put together a private repo that replicates the problem, to which I've added you. You'll find a commented out inclusion of normals, which, if enabled, crashes.

nkyriazis · 2020-04-09T18:57:48Z

This is reproducible for me in both Ubuntu and Windows 10, although in Windows 10 it is ~4X slower.

BachiLi · 2020-04-09T21:48:55Z

Hmm. Unfortunately I couldn't reproduce it on my Ubuntu machine. I did uncomment the computation of the normal. Is this still reproducible for you in 0.4.14?

nkyriazis · 2020-04-10T07:49:48Z

I upgraded to the latest version, but I'll check again. Could you share a bit about your specs, pip freeze, driver version, etc?

nkyriazis · 2020-04-10T08:53:09Z

I've tested it in colab and another machine and it runs OK. So it's an issue with my machine. It's disheartening that it is reproducible in both OSes :(

nkyriazis · 2020-04-10T08:56:22Z

Perhaps a H/W issue?

nkyriazis · 2020-04-12T19:27:20Z

Still no luck. I still can't run cuda-memcheck.

I made a script out of the shared notebook and run nvprof with it. I'm attaching the output. There's a segfault as early as the camera setup code.

Could you perhaps do the same? I.e. run memcheck of nvprof on the code on your machine and see whether it turns out green?

As another note, I've come across some github issues where there's a mention of a difference between GPUs on how forgiving they can be in segfaults within atomics. I guess a cuda-memcheck might not be forgiving, but I can't run it successfully.

nkyriazis · 2020-04-13T07:16:42Z

I managed to run cuda-memcheck, but, even for the smallest of problems, I'm getting out of host memory.

I Ubuntu one needs to set CUDA_MEMCHECK_PATCH_MODULES=1.

The error I'm getting is the following one, repeatedly, which seems irrelevant:

========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0x119e12) [0x2f1342]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll [0x758b7]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x2cf71) [0xbe7b1]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x2cdf0) [0xbe630]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x18b60) [0xaa3a0]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x17303) [0xa8b43]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0xba35) [0x9d275]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpGetVersionString + 0x5fa) [0x8f19a]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpContextCreate + 0x7f) [0x8946f]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd (PyInit_redner + 0x1aebcd) [0x2514ad]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x90d1e]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x8f2aa]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x77176]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x5af2d]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallDict + 0x1e3) [0x2ec13]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyCFunction_FastCallDict + 0x2e) [0x1317b6]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x1c0a0) [0x7a830]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_Call + 0x75) [0x14c51]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x1b90c) [0x7a09c]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_FastCallKeywords + 0x3e7) [0x3e957]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_FastCallKeywords + 0x17a) [0x3e6ea]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xd19) [0x46939]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x8b2) [0x47322]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xce7) [0x46907]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x8b2) [0x47322]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyFunction_FastCallDict + 0x1ba) [0x2fe1a]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_Call + 0xd3) [0x14caf]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\torch\lib\torch_python.dll (THPShortStorage_New + 0x2a2df) [0x1eb89f]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x2ff1f) [0x8e6af]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PySlice_New + 0x14d) [0x4e46d]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x1174) [0x47be4]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xbbb) [0x467db]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x4af) [0x46f1f]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeEx + 0x9b) [0xa3ab]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCode + 0x2d) [0xa309]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyArena_Free + 0xa7) [0xa2b3]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_FileExFlags + 0xc5) [0x1a5539]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_SimpleFileExFlags + 0x24c) [0x1a5d60]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_AnyFileExFlags + 0x63) [0x1a5407]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_UnixMain + 0x573) [0xf1393]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_UnixMain + 0x61b) [0xf143b]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x30a54) [0x8f1e4]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_Main + 0x3a5) [0x4fedd]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_Main + 0x52) [0x4fb8a]
=========     Host Frame:g:\frameworks\Anaconda3\python.exe [0x1268]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17974]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a261]

nkyriazis · 2020-04-13T08:58:12Z

Same error in linux, a bit more specific.

========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3a0403]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x1e6168]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x49d1d]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x3218d]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x2e183]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x1e0db]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 (rtpContextCreate + 0x55) [0x22b35]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so (_ZN5SceneC1ERK6CameraRKSt6vectorIPK5ShapeSaIS6_EERKS3_IPK8MaterialSaISD_EERKS3_IPK9AreaLightSaISK_EERKSt10shared_ptrIK14EnvironmentMapEbibb + 0x306) [0x154146]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so [0x8dc52]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so [0x44d5d]
=========     Host Frame:python (_PyMethodDef_RawFastCallDict + 0x1ac) [0x1c85ec]
=========     Host Frame:python [0xe0146]
=========     Host Frame:python (PyObject_Call + 0x52) [0x1ccb42]
=========     Host Frame:python [0x16e461]
=========     Host Frame:python (_PyObject_FastCallKeywords + 0x126) [0x1c9a96]
=========     Host Frame:python [0x1367d1]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x4511) [0x13d361]
=========     Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
=========     Host Frame:python (_PyFunction_FastCallKeywords + 0x488) [0x1c9358]
=========     Host Frame:python [0x136640]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x4511) [0x13d361]
=========     Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
=========     Host Frame:python (_PyFunction_FastCallDict + 0x34e) [0x1ca52e]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/torch/lib/libtorch_python.so (_Z17THPFunction_applyP7_objectS0_ + 0xa7f) [0x6673ef]
=========     Host Frame:python (PyCFunction_Call + 0x445) [0x1c8d15]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x5792) [0x13e5e2]
=========     Host Frame:python (_PyFunction_FastCallKeywords + 0x18b) [0x1c905b]
=========     Host Frame:python [0x136640]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x552) [0x1393a2]
=========     Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
=========     Host Frame:python (PyEval_EvalCode + 0x23) [0x24d163]
=========     Host Frame:python [0x2407d3]
=========     Host Frame:python (PyRun_FileExFlags + 0x97) [0x240887]
=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x17a) [0x24163a]
=========     Host Frame:python [0x27918f]
=========     Host Frame:python (_Py_UnixMain + 0x2e) [0x27947e]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x271e3]
=========     Host Frame:python (_start + 0x2e) [0x1cff9e]

BachiLi · 2020-04-13T16:24:33Z

Yeah these errors are expected (some optix prime internal assumptions). I'll take a look at your script on Wednesday. (I don't think it's the Titan V bug you posted)

nkyriazis · 2020-04-13T18:52:38Z

I'm afraid it is a bit more complicated for me. In the 2080 RTX Ti system it just takes far more iterations to crash.

nkyriazis · 2020-04-13T20:56:49Z

I've confirmed the same behavior as in my system on a different system that also has a Titan-V. There's some consistency there.

nkyriazis · 2020-04-15T12:05:21Z

I've got 3/3 in different Titan-V machines, where the same behavior is exhibited.

nkyriazis · 2020-04-21T14:47:02Z

I've had no luck stepping-in in Ubuntu. I kept getting cuda errors for things that would otherwise be error-free :(

In Windows, stepping-in revealed a point where including the normals leads to a hang at the 4th iteration (not including the normals completes successfully). I'm attaching the callstack. I'm not sure why the GPU might not complete its compute.

Any hints on where to look at more closely are more than welcome.

BachiLi · 2020-04-21T17:08:02Z

Does the same error happen when you remove the distortion parameters? What if you switch the object to the teapot?

nkyriazis · 2020-04-21T17:34:08Z

Distortion params: The error precedes the addition of distortion params. The error persisted across newer versions.

Teapot: I haven't tried the teapot. The problem goes away when the geometry is "detached" (with torch.no_grad()). So, it seems it is an issue with backprop and Shape. Of course this is one problematic case among many working examples where geometry is parameterized (e.g. optimize object pose).

So far it seems that it is the combination of MANO and redner that yields the problem. I've hand no issues working excessively with MANO, redner aside. Also, MANO is pure python. My best guess so far is that it is a corner case that challenges the assumptions made by redner wrt input.

Other candidates for the error:

PyTorch
Thrust
CUB
MANO

BachiLi · 2020-04-21T17:43:14Z

redner doesn't check the indexing for the shape, so it could be that
other than that it's hard for me to imagine where can go wrong

BachiLi · 2020-04-21T17:44:24Z

maybe also try to turn off edge sampling by setting use_primary_edge_sampling and use_secondary_edge_sampling to False?

nkyriazis · 2020-04-21T18:46:32Z

I'm not sure what checking the indexing means, but the faces are all within bounds (both for geometry and uvs). Is any other kind of geometrical abnormalities able capable of breaking redner?

I did try with edge sampling off. It takes both stages disabled to go through! I'm trying again, to see whether it's repeatable.

nkyriazis · 2020-04-21T19:01:21Z

It worked a 2nd time! This is proof in my book :)

BachiLi · 2020-04-21T19:08:55Z

It is possible that something went wrong in the silhouette detection inside edge sampling. Have you tried all four combinations? (both on, both off, only primary on, only secondary on)

nkyriazis · 2020-04-21T20:33:21Z

I brought down the spp and bounces to 1 and 1, to speed it up a bit:

primary	secondary	result
True	True	hang
False	False	pass
False	True	crash
True	False	pass

BachiLi · 2020-04-21T20:57:43Z

Cool! Most likely something is wrong with the silhouette tree. Either the creation (something in edge_tree.cpp) or the traversal (something in secondary_edge_sampler in edge.cpp) have buffer overrun issues (or both). The inclusion of vertex normal makes the difference because is_silhouette in edge.h now returns different results.

One more test that would be helpful but slightly more involved: go to this line

redner/src/edge.cpp

Line 334 in 6991ca3

if (scene.use_secondary_edge_sampling) {

and make the if condition always true. This forces redner to always construct the silhouette tree even if secondary edge sampling is turned off. Test with use_primary_edge_sampling=False and use_secondary_edge_sampling=False. If it still crashes/hangs most likely something is wrong with the tree construction in edge_tree.cpp. Otherwise it's probably the traversal.

For your use case it's also possible that turning off the secondary edge sampling would be suffice for your inverse problem. So if you are in a hurry that might be a solution. For me it would be helpful if we can find out what's wrong together ; )

BachiLi · 2020-04-21T21:08:22Z

Another quick thing to try is to change this line to increase the traversal buffer size:

redner/src/edge.cpp

Line 1123 in 6991ca3

constexpr auto buffer_size = 128;

(maybe make it 256 or something)
I don't think this will fix it but it's worth trying if it is easy to try.

nkyriazis · 2020-04-22T09:44:09Z

Making the condition true did pass. Setting secondary to true again hangs. Increasing the buffer size did not change that. So, it seems it's the traversal.

nkyriazis · 2020-04-22T10:08:13Z

I'm seeing there's a while loop inside device code, which might explain the gpu blocking at 100%. Since I see there's a binary search, may I suggest the separation of the binary search algo from the impl itself, to test separately, and, perhaps the use of thrust::lower_bound for the former?

nkyriazis · 2020-04-22T11:26:58Z

How relevant is the CPU impl? I ran it on cpu-only mode and it went through.

BachiLi · 2020-04-22T14:04:37Z

re CPU impl: I have no idea. If there is a buffer overrun/infinite loop CPU should also crash.
re binary search: It's a tree traversal and is not easily replaceable by thrust::lower_bound. It is possible that the tree construction phase constructed an invalid tree that causes the indexing of the tree nodes goes out of bounds.

nkyriazis · 2020-04-22T14:07:30Z

I went back to the RTX2080 machine to run the full pipeline with secondary=False, but I'm still getting a crash :(

I'm trying with edge sampling completely off (it takes a while).

nkyriazis · 2020-04-22T18:31:16Z

Wrt to the sample and TitanV, if you add the data['mano'] parameters to the optimization you get a very early crash, that might help figure out what the issue is.

markdjwilliams · 2020-06-30T18:15:27Z

I suspect I'm seeing a similar issue. The failure mode was a thrust::system_error but setting use_secondary_edge_sampling=False alone seems to address it. I have a single object (of about 50,000 triangles) whose deformation is being predicted by a NN. Everything was working without error for quite some time until I extended the existing channels (alpha, uv, triangle_id, barycentric) to include radiance while adding a single area light.

CatoGit · 2021-03-25T15:04:15Z

Are there any updates on this? I am getting the same error with 0.4.28. Only turning off primary and secondary edge sampling lets it pass for me. With only use_secondary_edge_sampling as False it still crashes.

Redner crashing or getting stuck #122

Redner crashing or getting stuck #122

Comments

nkyriazis commented Apr 8, 2020

nkyriazis commented Apr 8, 2020

BachiLi commented Apr 8, 2020

nkyriazis commented Apr 8, 2020

nkyriazis commented Apr 9, 2020

nkyriazis commented Apr 9, 2020

BachiLi commented Apr 9, 2020 • edited Loading

nkyriazis commented Apr 10, 2020

nkyriazis commented Apr 10, 2020

nkyriazis commented Apr 10, 2020

nkyriazis commented Apr 12, 2020

nkyriazis commented Apr 13, 2020 • edited Loading

nkyriazis commented Apr 13, 2020

BachiLi commented Apr 13, 2020 • edited Loading

nkyriazis commented Apr 13, 2020

nkyriazis commented Apr 13, 2020 • edited Loading

nkyriazis commented Apr 15, 2020

nkyriazis commented Apr 21, 2020 • edited Loading

BachiLi commented Apr 21, 2020

nkyriazis commented Apr 21, 2020

BachiLi commented Apr 21, 2020

BachiLi commented Apr 21, 2020 • edited Loading

nkyriazis commented Apr 21, 2020

nkyriazis commented Apr 21, 2020 • edited Loading

BachiLi commented Apr 21, 2020

nkyriazis commented Apr 21, 2020

BachiLi commented Apr 21, 2020 • edited Loading

BachiLi commented Apr 21, 2020

nkyriazis commented Apr 22, 2020

nkyriazis commented Apr 22, 2020

nkyriazis commented Apr 22, 2020

BachiLi commented Apr 22, 2020 • edited Loading

nkyriazis commented Apr 22, 2020 • edited Loading

nkyriazis commented Apr 22, 2020

markdjwilliams commented Jun 30, 2020

CatoGit commented Mar 25, 2021

BachiLi commented Apr 9, 2020 •

edited

Loading

nkyriazis commented Apr 13, 2020 •

edited

Loading

BachiLi commented Apr 13, 2020 •

edited

Loading

nkyriazis commented Apr 13, 2020 •

edited

Loading

nkyriazis commented Apr 21, 2020 •

edited

Loading

BachiLi commented Apr 21, 2020 •

edited

Loading

nkyriazis commented Apr 21, 2020 •

edited

Loading

BachiLi commented Apr 21, 2020 •

edited

Loading

BachiLi commented Apr 22, 2020 •

edited

Loading

nkyriazis commented Apr 22, 2020 •

edited

Loading