Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redner crashing or getting stuck #122

Open
nkyriazis opened this issue Apr 8, 2020 · 35 comments
Open

Redner crashing or getting stuck #122

nkyriazis opened this issue Apr 8, 2020 · 35 comments

Comments

@nkyriazis
Copy link

I'm afraid I don't have a minimal example that replicates this, as it takes the complexity of what I'm trying to break it.

I'll try to describe the problems I'm facing and ask for help debugging myself, while @BachiLi takes a peek too.

I'm using redner (CUDA) with input an environment map variable (large, 1Kx2K), a Shape that depends on variables, with a texture that depends on variables. "Depends on variables" means there is a differentiable pytorch code that transforms some variables into the redner input. I'm getting crashes with thrust::system_error (probably a CUDA segfault) during backward. These seem to go away by forcing all redner inputs to be contiguous. This leads to redner getting stuck: GPU is 100% but never ends.

It seems that environment map and geometry variables don't go well together. One or the other in isolation seems to go through.

Any insights on how to narrow down the error?

@nkyriazis
Copy link
Author

It does not seem to be a memory outage issue, since some iterations do go through, and the GPU monitor shows enough memory available.

@BachiLi
Copy link
Owner

BachiLi commented Apr 8, 2020

It doesn't need to be minimal but a reproducible example would be helpful. cuda-memcheck is usually good at detecting this kind of bugs.

@nkyriazis
Copy link
Author

COMPUTE_PROFILE=0 cuda-memcheck python debug.py
========= CUDA-MEMCHECK
========= Internal Memcheck Error: Initialization failed
...

:(

There are several obstacles in sharing the offending example. I was hoping that the mentions might trigger some thought on what could the issue be. One thing that comes to mind, due to the contiguous change in behavior, is that somewhere in the code there is an assumption about contiguous storage that is not asserted. Then, what would lead to an infinite loop that would keep the GPU busy at 100%? Uninitialized variables, like #93?

Having said that, I understand it's the toughest to try and answer without taking a look at the code. Perhaps we could arrange for a 1-1, or, I could try and build the debug mode and step in.

@nkyriazis
Copy link
Author

I've put together a private repo that replicates the problem, to which I've added you. You'll find a commented out inclusion of normals, which, if enabled, crashes.

@nkyriazis
Copy link
Author

This is reproducible for me in both Ubuntu and Windows 10, although in Windows 10 it is ~4X slower.

@BachiLi
Copy link
Owner

BachiLi commented Apr 9, 2020

Hmm. Unfortunately I couldn't reproduce it on my Ubuntu machine. I did uncomment the computation of the normal. Is this still reproducible for you in 0.4.14?

@nkyriazis
Copy link
Author

I upgraded to the latest version, but I'll check again. Could you share a bit about your specs, pip freeze, driver version, etc?

@nkyriazis
Copy link
Author

I've tested it in colab and another machine and it runs OK. So it's an issue with my machine. It's disheartening that it is reproducible in both OSes :(

@nkyriazis
Copy link
Author

Perhaps a H/W issue?

@nkyriazis
Copy link
Author

Still no luck. I still can't run cuda-memcheck.

I made a script out of the shared notebook and run nvprof with it. I'm attaching the output. There's a segfault as early as the camera setup code.

Could you perhaps do the same? I.e. run memcheck of nvprof on the code on your machine and see whether it turns out green?

As another note, I've come across some github issues where there's a mention of a difference between GPUs on how forgiving they can be in segfaults within atomics. I guess a cuda-memcheck might not be forgiving, but I can't run it successfully.

@nkyriazis
Copy link
Author

nkyriazis commented Apr 13, 2020

I managed to run cuda-memcheck, but, even for the smallest of problems, I'm getting out of host memory.

I Ubuntu one needs to set CUDA_MEMCHECK_PATCH_MODULES=1.

The error I'm getting is the following one, repeatedly, which seems irrelevant:

========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0x119e12) [0x2f1342]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll [0x758b7]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x2cf71) [0xbe7b1]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x2cdf0) [0xbe630]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x18b60) [0xaa3a0]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0x17303) [0xa8b43]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpModelGetInternals + 0xba35) [0x9d275]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpGetVersionString + 0x5fa) [0x8f19a]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\optix_prime.1.dll (rtpContextCreate + 0x7f) [0x8946f]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd (PyInit_redner + 0x1aebcd) [0x2514ad]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x90d1e]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x8f2aa]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x77176]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\redner.cp37-win_amd64.pyd [0x5af2d]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallDict + 0x1e3) [0x2ec13]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyCFunction_FastCallDict + 0x2e) [0x1317b6]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x1c0a0) [0x7a830]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_Call + 0x75) [0x14c51]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x1b90c) [0x7a09c]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_FastCallKeywords + 0x3e7) [0x3e957]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_FastCallKeywords + 0x17a) [0x3e6ea]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xd19) [0x46939]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x8b2) [0x47322]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xce7) [0x46907]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x8b2) [0x47322]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyFunction_FastCallDict + 0x1ba) [0x2fe1a]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyObject_Call + 0xd3) [0x14caf]
=========     Host Frame:g:\frameworks\Anaconda3\lib\site-packages\torch\lib\torch_python.dll (THPShortStorage_New + 0x2a2df) [0x1eb89f]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x2ff1f) [0x8e6af]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PySlice_New + 0x14d) [0x4e46d]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x1174) [0x47be4]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyMethodDef_RawFastCallKeywords + 0xbbb) [0x467db]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalFrameDefault + 0x4af) [0x46f1f]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeWithName + 0x1a2) [0x30192]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCodeEx + 0x9b) [0xa3ab]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyEval_EvalCode + 0x2d) [0xa309]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyArena_Free + 0xa7) [0xa2b3]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_FileExFlags + 0xc5) [0x1a5539]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_SimpleFileExFlags + 0x24c) [0x1a5d60]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyRun_AnyFileExFlags + 0x63) [0x1a5407]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_UnixMain + 0x573) [0xf1393]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_UnixMain + 0x61b) [0xf143b]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (PyErr_NoMemory + 0x30a54) [0x8f1e4]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_Main + 0x3a5) [0x4fedd]
=========     Host Frame:g:\frameworks\Anaconda3\python37.dll (Py_Main + 0x52) [0x4fb8a]
=========     Host Frame:g:\frameworks\Anaconda3\python.exe [0x1268]
=========     Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17974]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x6a261]

@nkyriazis
Copy link
Author

Same error in linux, a bit more specific.

========= Program hit cudaErrorSetOnActiveProcess (error 36) due to "cannot set while device is active in this process" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x3a0403]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x1e6168]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x49d1d]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x3218d]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x2e183]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 [0x1e0db]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/liboptix_prime.so.1 (rtpContextCreate + 0x55) [0x22b35]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so (_ZN5SceneC1ERK6CameraRKSt6vectorIPK5ShapeSaIS6_EERKS3_IPK8MaterialSaISD_EERKS3_IPK9AreaLightSaISK_EERKSt10shared_ptrIK14EnvironmentMapEbibb + 0x306) [0x154146]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so [0x8dc52]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/redner.so [0x44d5d]
=========     Host Frame:python (_PyMethodDef_RawFastCallDict + 0x1ac) [0x1c85ec]
=========     Host Frame:python [0xe0146]
=========     Host Frame:python (PyObject_Call + 0x52) [0x1ccb42]
=========     Host Frame:python [0x16e461]
=========     Host Frame:python (_PyObject_FastCallKeywords + 0x126) [0x1c9a96]
=========     Host Frame:python [0x1367d1]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x4511) [0x13d361]
=========     Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
=========     Host Frame:python (_PyFunction_FastCallKeywords + 0x488) [0x1c9358]
=========     Host Frame:python [0x136640]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x4511) [0x13d361]
=========     Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
=========     Host Frame:python (_PyFunction_FastCallDict + 0x34e) [0x1ca52e]
=========     Host Frame:/home/kyriazis/projects/env_redner_mano/lib/python3.7/site-packages/torch/lib/libtorch_python.so (_Z17THPFunction_applyP7_objectS0_ + 0xa7f) [0x6673ef]
=========     Host Frame:python (PyCFunction_Call + 0x445) [0x1c8d15]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x5792) [0x13e5e2]
=========     Host Frame:python (_PyFunction_FastCallKeywords + 0x18b) [0x1c905b]
=========     Host Frame:python [0x136640]
=========     Host Frame:python (_PyEval_EvalFrameDefault + 0x552) [0x1393a2]
=========     Host Frame:python (_PyEval_EvalCodeWithName + 0x247) [0x1373a7]
=========     Host Frame:python (PyEval_EvalCode + 0x23) [0x24d163]
=========     Host Frame:python [0x2407d3]
=========     Host Frame:python (PyRun_FileExFlags + 0x97) [0x240887]
=========     Host Frame:python (PyRun_SimpleFileExFlags + 0x17a) [0x24163a]
=========     Host Frame:python [0x27918f]
=========     Host Frame:python (_Py_UnixMain + 0x2e) [0x27947e]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x271e3]
=========     Host Frame:python (_start + 0x2e) [0x1cff9e]

@BachiLi
Copy link
Owner

BachiLi commented Apr 13, 2020

Yeah these errors are expected (some optix prime internal assumptions). I'll take a look at your script on Wednesday. (I don't think it's the Titan V bug you posted)

@nkyriazis
Copy link
Author

I'm afraid it is a bit more complicated for me. In the 2080 RTX Ti system it just takes far more iterations to crash.

@nkyriazis
Copy link
Author

nkyriazis commented Apr 13, 2020

I've confirmed the same behavior as in my system on a different system that also has a Titan-V. There's some consistency there.

@nkyriazis
Copy link
Author

I've got 3/3 in different Titan-V machines, where the same behavior is exhibited.

@nkyriazis
Copy link
Author

nkyriazis commented Apr 21, 2020

I've had no luck stepping-in in Ubuntu. I kept getting cuda errors for things that would otherwise be error-free :(

In Windows, stepping-in revealed a point where including the normals leads to a hang at the 4th iteration (not including the normals completes successfully). I'm attaching the callstack. I'm not sure why the GPU might not complete its compute.

Any hints on where to look at more closely are more than welcome.

image

@BachiLi
Copy link
Owner

BachiLi commented Apr 21, 2020

Does the same error happen when you remove the distortion parameters? What if you switch the object to the teapot?

@nkyriazis
Copy link
Author

Distortion params: The error precedes the addition of distortion params. The error persisted across newer versions.

Teapot: I haven't tried the teapot. The problem goes away when the geometry is "detached" (with torch.no_grad()). So, it seems it is an issue with backprop and Shape. Of course this is one problematic case among many working examples where geometry is parameterized (e.g. optimize object pose).

So far it seems that it is the combination of MANO and redner that yields the problem. I've hand no issues working excessively with MANO, redner aside. Also, MANO is pure python. My best guess so far is that it is a corner case that challenges the assumptions made by redner wrt input.

Other candidates for the error:

  • PyTorch
  • Thrust
  • CUB
  • MANO

@BachiLi
Copy link
Owner

BachiLi commented Apr 21, 2020

redner doesn't check the indexing for the shape, so it could be that
other than that it's hard for me to imagine where can go wrong

@BachiLi
Copy link
Owner

BachiLi commented Apr 21, 2020

maybe also try to turn off edge sampling by setting use_primary_edge_sampling and use_secondary_edge_sampling to False?

@nkyriazis
Copy link
Author

I'm not sure what checking the indexing means, but the faces are all within bounds (both for geometry and uvs). Is any other kind of geometrical abnormalities able capable of breaking redner?

I did try with edge sampling off. It takes both stages disabled to go through! I'm trying again, to see whether it's repeatable.

@nkyriazis
Copy link
Author

nkyriazis commented Apr 21, 2020

It worked a 2nd time! This is proof in my book :)

@BachiLi
Copy link
Owner

BachiLi commented Apr 21, 2020

It is possible that something went wrong in the silhouette detection inside edge sampling. Have you tried all four combinations? (both on, both off, only primary on, only secondary on)

@nkyriazis
Copy link
Author

I brought down the spp and bounces to 1 and 1, to speed it up a bit:

primary secondary result
True True hang
False False pass
False True crash
True False pass

@BachiLi
Copy link
Owner

BachiLi commented Apr 21, 2020

Cool! Most likely something is wrong with the silhouette tree. Either the creation (something in edge_tree.cpp) or the traversal (something in secondary_edge_sampler in edge.cpp) have buffer overrun issues (or both). The inclusion of vertex normal makes the difference because is_silhouette in edge.h now returns different results.

One more test that would be helpful but slightly more involved: go to this line

if (scene.use_secondary_edge_sampling) {
and make the if condition always true. This forces redner to always construct the silhouette tree even if secondary edge sampling is turned off. Test with use_primary_edge_sampling=False and use_secondary_edge_sampling=False. If it still crashes/hangs most likely something is wrong with the tree construction in edge_tree.cpp. Otherwise it's probably the traversal.

For your use case it's also possible that turning off the secondary edge sampling would be suffice for your inverse problem. So if you are in a hurry that might be a solution. For me it would be helpful if we can find out what's wrong together ; )

@BachiLi
Copy link
Owner

BachiLi commented Apr 21, 2020

Another quick thing to try is to change this line to increase the traversal buffer size:

constexpr auto buffer_size = 128;

(maybe make it 256 or something)
I don't think this will fix it but it's worth trying if it is easy to try.

@nkyriazis
Copy link
Author

Making the condition true did pass. Setting secondary to true again hangs. Increasing the buffer size did not change that. So, it seems it's the traversal.

@nkyriazis
Copy link
Author

I'm seeing there's a while loop inside device code, which might explain the gpu blocking at 100%. Since I see there's a binary search, may I suggest the separation of the binary search algo from the impl itself, to test separately, and, perhaps the use of thrust::lower_bound for the former?

@nkyriazis
Copy link
Author

How relevant is the CPU impl? I ran it on cpu-only mode and it went through.

@BachiLi
Copy link
Owner

BachiLi commented Apr 22, 2020

re CPU impl: I have no idea. If there is a buffer overrun/infinite loop CPU should also crash.
re binary search: It's a tree traversal and is not easily replaceable by thrust::lower_bound. It is possible that the tree construction phase constructed an invalid tree that causes the indexing of the tree nodes goes out of bounds.

@nkyriazis
Copy link
Author

nkyriazis commented Apr 22, 2020

I went back to the RTX2080 machine to run the full pipeline with secondary=False, but I'm still getting a crash :(

I'm trying with edge sampling completely off (it takes a while).

@nkyriazis
Copy link
Author

Wrt to the sample and TitanV, if you add the data['mano'] parameters to the optimization you get a very early crash, that might help figure out what the issue is.

@markdjwilliams
Copy link

I suspect I'm seeing a similar issue. The failure mode was a thrust::system_error but setting use_secondary_edge_sampling=False alone seems to address it. I have a single object (of about 50,000 triangles) whose deformation is being predicted by a NN. Everything was working without error for quite some time until I extended the existing channels (alpha, uv, triangle_id, barycentric) to include radiance while adding a single area light.

@CatoGit
Copy link

CatoGit commented Mar 25, 2021

Are there any updates on this? I am getting the same error with 0.4.28. Only turning off primary and secondary edge sampling lets it pass for me. With only use_secondary_edge_sampling as False it still crashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants