-
Notifications
You must be signed in to change notification settings - Fork 678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd #370
Comments
Can you please try adding |
Alternatively, try setting the environment variable |
Thanks so much @saran-t. Unfortunately none of these helped. The only other hint I have is that the error usually appears later in the job, roughly after some 100 million environment steps. And it seems to appear in all actors roughly at the same time in a given job. (It's an RL setup with parallel actors, very similar to the one in section 3.4 here, for example) |
Hm, I'm tempted to blame this on EGL driver bug at this point... |
Hi, are there any updates on this? I'm having the same problem and some of my processes crash after tens of millions of environment steps. Thanks in advance! |
No updates on my end. I'm still having this problem.. |
@vaxenburg Are you using Docker? What GPUs are you using? |
Thanks for responding @kevinzakka! I'm not using Docker, just a conda environment. I think the error occurs on every GPU I've tried so far: A100 SXM4, RTX 2080Ti, T4 PCIe. |
OK good to know, am also seeing this on our cluster and it's killing all my jobs roughly every 1.4M steps. |
If someone could try this on an Intel or AMD GPU too that would be really useful in determining whether this is an Nvidia driver issue... |
I think it would be good to first try and reproduce this with a few lines of code. I've only run into this issue in my custom composer env. It would be good to see if it happens in one of the suite envs. |
Is there a task in the suite that uses vision (i.e., requires rendering)? |
Are you using a specific Nvidia driver version on all those experiments? |
This one, on all machines: |
I also ran into this problem consistently on all my machines, with different Nvidia cards but all of them have driver version 520.56.06
|
Update: I downgraded to CUDA 11.7 and the error went away. |
I'm pretty sure I was getting this error also with CUDA 11.4 before (but not sure about the driver version.) |
nvidia-smi -> I basically created a new environment, installed cuda with conda and the errors seems to have gone away. |
FWIW, if my memory is correct, with this older CUDA 11.4, the error was occurring for me after some ~1.5M steps per actor. With the new CUDA 12.0 it seems to occur about 10x earlier, after ~0.15M steps. So maybe downgrading CUDA just postpones the error. Did you try running a job much longer, beyond the point where it was crashing with your previous CUDA version? |
I did indeed witness that last week, but so far I've been running 5 to 10M jobs and nothing has been crashing. This is indeed weird and I can try changing the driver again to see if that triggers the issue. |
It's interesting that the CUDA version seems to affect a rendering-related error. One thing we could try is run a fake job that only does rendering without using CUDA at all. |
For whatever reason, this consistently reproduces the error for me: from multiprocessing import Process
def foo():
env = get_environment_with_camera_observable()
p = Process(target=foo)
p.start() where |
Oooh, consistent reproducibility! @saran-t, @nimrod-gileadi, maybe with this we can finally diagnose this bug?? |
Here is a complete self-contained snippet that, for me, consistently errors with from multiprocessing import Process
from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors
class Task(composer.Task):
def __init__(self):
self._arena = floors.Floor()
self._walker = CMUHumanoid(
observable_options={'egocentric_camera': {'enabled': True}})
spawn_site = self._arena.mjcf_model.worldbody.add('site')
self._walker.create_root_joints(
self._arena.attach(self._walker, spawn_site))
spawn_site.remove()
@property
def root_entity(self):
return self._arena
def get_reward(self, physics):
return 1.
_ = composer.Environment(task=Task()) # Won't error without this line?!?
def foo():
env = composer.Environment(task=Task())
p = Process(target=foo)
p.start() # Errors here. |
Any updates on this issue? Running into this issue when using multiple parallel environments (~1000) in RL training. Tried setting DISABLE_RENDER_THREAD_OFFLOADING=1 but no difference. Any suggestions on how to fix the issue? Thanks! |
Thank you @vaxenburg for sharing the script to reproduce the problem. In this particular case, I suspect that this is caused by the first call _ = composer.Environment(task=Task()) # Won't error without this line?!? initialized some EGL resources, while p = Process(target=foo) inherit the resources (which it shouldn't) as a subprocess (because by default python I was able to get rid of the issue in your particular case, by switching start method of import multiprocessing
from multiprocessing import Process
from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors
class Task(composer.Task):
def __init__(self):
self._arena = floors.Floor()
self._walker = CMUHumanoid(
observable_options={'egocentric_camera': {
'enabled': True
}})
spawn_site = self._arena.mjcf_model.worldbody.add('site')
self._walker.create_root_joints(
self._arena.attach(self._walker, spawn_site))
spawn_site.remove()
@property
def root_entity(self):
return self._arena
def get_reward(self, physics):
return 1.
def foo():
env = composer.Environment(task=Task())
if __name__ == "__main__":
multiprocessing.set_start_method("spawn")
_ = composer.Environment(task=Task()) # Won't error without this line?!?
p = Process(target=foo)
p.start() |
Cool, it works for me !!! I am parallelizing After passing the parameter I would greatly appreciate it if there is any expert could explain the mechanism and side effect behind this. |
FWIW, I was running into this error in a custom composer environment that did not use any cameras but did use a terrain height map |
Is there any update on this? I get the same error when running DreamerV3 on dm_control. I can provide a stack trace if needed. |
I'm getting this error while using the EGL rendering backend:
mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd
The error is rare (but still fatal for a long job) so not sure if I can provide a script to reproduce. I believe this might be a known issue though? Thanks so much!
The full traceback:
The text was updated successfully, but these errors were encountered: