mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd #370

vaxenburg · 2022-12-08T17:10:16Z

I'm getting this error while using the EGL rendering backend:

mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd

The error is rare (but still fatal for a long job) so not sure if I can provide a script to reproduce. I believe this might be a known issue though? Thanks so much!

The full traceback:

    result = self.run_episode()
  File "/acme/environment_loop.py", line 94, in run_episode
    timestep = self._environment.reset()
  File "/acme/wrappers/base.py", line 55, in reset
    return self._environment.reset()
  File "/acme/wrappers/single_precision.py", line 40, in reset
    return self._convert_timestep(self._environment.reset())
  File "/dm_control/composer/environment.py", line 339, in reset
    return self._reset_attempt()
  File "/dm_control/composer/environment.py", line 349, in _reset_attempt
    self._recompile_physics_and_update_observables()
  File "/dm_control/composer/environment.py", line 243, in _recompile_physics_and_update_observables
    self._observation_updater.reset(self._physics_proxy, self._random_state)
  File "/dm_control/composer/observation/updater.py", line 165, in reset
    enabled.observation_callable())
  File "/dm_control/composer/observation/observable/mjcf.py", line 266, in get_observation
    pixels = physics.render(
  File "/dm_control/mujoco/engine.py", line 216, in render
    camera = Camera(
  File "/dm_control/mujoco/engine.py", line 695, in __init__
    if self._physics.contexts.mujoco is not None:
  File "/dm_control/mujoco/engine.py", line 526, in contexts
    self._make_rendering_contexts()
  File "/dm_control/mujoco/engine.py", line 512, in _make_rendering_contexts
    mujoco_context = wrapper.MjrContext(self.model, render_context)
  File "/dm_control/mujoco/wrapper/core.py", line 606, in __init__
    ptr = ctx.call(mujoco.MjrContext, model.ptr, font_scale)
  File "/dm_control/_render/executor/render_executor.py", line 138, in call
    return func(*args, **kwargs)
mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd
(EnvironmentLoop pid=593400) Exception ignored in: <function MjrContext.__del__ at 0x149e24064280>
(EnvironmentLoop pid=593400) Traceback (most recent call last):
(EnvironmentLoop pid=593400)   File "/dm_control/mujoco/wrapper/core.py", line 636, in __del__
(EnvironmentLoop pid=593400)     self.free()
(EnvironmentLoop pid=593400)   File "/dm_control/mujoco/wrapper/core.py", line 624, in free
(EnvironmentLoop pid=593400)     ptr = self.ptr
(EnvironmentLoop pid=593400)   File "/dm_control/mujoco/wrapper/core.py", line 614, in ptr
(EnvironmentLoop pid=593400)     return self._ptr()
(EnvironmentLoop pid=593400) AttributeError: 'MjrContext' object has no attribute '_ptr'```

The text was updated successfully, but these errors were encountered:

saran-t · 2022-12-09T19:26:49Z

Can you please try adding EGL.eglReleaseThread() right after EGL.eglDestroyContext here ?

saran-t · 2022-12-09T19:29:10Z

Alternatively, try setting the environment variable DISABLE_RENDER_THREAD_OFFLOADING=1, but please try one thing at a time so that we know which one(s) (if any) works.

vaxenburg · 2022-12-14T15:39:11Z

Thanks so much @saran-t. Unfortunately none of these helped. The only other hint I have is that the error usually appears later in the job, roughly after some 100 million environment steps. And it seems to appear in all actors roughly at the same time in a given job. (It's an RL setup with parallel actors, very similar to the one in section 3.4 here, for example)

saran-t · 2022-12-14T15:40:59Z

Hm, I'm tempted to blame this on EGL driver bug at this point...

fedeceola · 2023-01-15T10:15:16Z

Hi, are there any updates on this? I'm having the same problem and some of my processes crash after tens of millions of environment steps. Thanks in advance!

vaxenburg · 2023-01-16T04:24:10Z

No updates on my end. I'm still having this problem..

kevinzakka · 2023-02-21T23:26:52Z

@vaxenburg Are you using Docker? What GPUs are you using?

vaxenburg · 2023-02-22T00:48:32Z

Thanks for responding @kevinzakka! I'm not using Docker, just a conda environment. I think the error occurs on every GPU I've tried so far: A100 SXM4, RTX 2080Ti, T4 PCIe.

kevinzakka · 2023-02-22T00:50:29Z

OK good to know, am also seeing this on our cluster and it's killing all my jobs roughly every 1.4M steps.

saran-t · 2023-02-22T01:15:39Z

If someone could try this on an Intel or AMD GPU too that would be really useful in determining whether this is an Nvidia driver issue...

kevinzakka · 2023-02-23T20:58:15Z

I think it would be good to first try and reproduce this with a few lines of code. I've only run into this issue in my custom composer env. It would be good to see if it happens in one of the suite envs.

vaxenburg · 2023-02-23T21:31:55Z

Is there a task in the suite that uses vision (i.e., requires rendering)?

breakds · 2023-03-01T21:46:11Z

Thanks for responding @kevinzakka! I'm not using Docker, just a conda environment. I think the error occurs on every GPU I've tried so far: A100 SXM4, RTX 2080Ti, T4 PCIe.

Are you using a specific Nvidia driver version on all those experiments?

vaxenburg · 2023-03-01T22:05:19Z

This one, on all machines:
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |

breakds · 2023-03-01T22:38:33Z

This one, on all machines: | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |

I also ran into this problem consistently on all my machines, with different Nvidia cards but all of them have driver version 520.56.06

NVIDIA-SMI 520.56.06 Driver Version: 520.56.06 CUDA Version: 11.8

kevinzakka · 2023-03-14T20:11:28Z

Update: I downgraded to CUDA 11.7 and the error went away.

vaxenburg · 2023-03-14T21:36:14Z

I'm pretty sure I was getting this error also with CUDA 11.4 before (but not sure about the driver version.)

kevinzakka · 2023-03-14T21:38:00Z

nvidia-smi -> NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7

I basically created a new environment, installed cuda with conda and the errors seems to have gone away.

vaxenburg · 2023-03-14T21:54:21Z

FWIW, if my memory is correct, with this older CUDA 11.4, the error was occurring for me after some ~1.5M steps per actor. With the new CUDA 12.0 it seems to occur about 10x earlier, after ~0.15M steps. So maybe downgrading CUDA just postpones the error. Did you try running a job much longer, beyond the point where it was crashing with your previous CUDA version?

kevinzakka · 2023-03-14T23:44:32Z

I did indeed witness that last week, but so far I've been running 5 to 10M jobs and nothing has been crashing. This is indeed weird and I can try changing the driver again to see if that triggers the issue.

vaxenburg · 2023-03-15T00:08:05Z

It's interesting that the CUDA version seems to affect a rendering-related error. One thing we could try is run a fake job that only does rendering without using CUDA at all.

vaxenburg · 2023-08-18T22:29:49Z

For whatever reason, this consistently reproduces the error for me:

from multiprocessing import Process

def foo():
    env = get_environment_with_camera_observable()

p = Process(target=foo)
p.start()

where get_environment_with_camera_observable is a function that returns your custom composer environment that has an MJCFCamera as one of the observables.

yuvaltassa · 2023-08-18T23:02:43Z

Oooh, consistent reproducibility!

@saran-t, @nimrod-gileadi, maybe with this we can finally diagnose this bug??

vaxenburg · 2023-08-21T17:48:01Z

Here is a complete self-contained snippet that, for me, consistently errors with mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd

from multiprocessing import Process

from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors

class Task(composer.Task):
    
    def __init__(self):
        self._arena = floors.Floor()
        self._walker = CMUHumanoid(
            observable_options={'egocentric_camera': {'enabled': True}})
        spawn_site = self._arena.mjcf_model.worldbody.add('site')
        self._walker.create_root_joints(
            self._arena.attach(self._walker, spawn_site))
        spawn_site.remove()

    @property
    def root_entity(self):
        return self._arena

    def get_reward(self, physics):
        return 1.

_ = composer.Environment(task=Task())  # Won't error without this line?!?

def foo():
    env = composer.Environment(task=Task())

p = Process(target=foo)
p.start()  # Errors here.

Haichao-Zhang · 2023-10-24T22:14:50Z

Any updates on this issue? Running into this issue when using multiple parallel environments (~1000) in RL training. Tried setting DISABLE_RENDER_THREAD_OFFLOADING=1 but no difference. Any suggestions on how to fix the issue? Thanks!

breakds · 2023-11-10T23:11:17Z

Here is a complete self-contained snippet that, for me, consistently errors with mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd

from multiprocessing import Process

from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors

class Task(composer.Task):
    
    def __init__(self):
        self._arena = floors.Floor()
        self._walker = CMUHumanoid(
            observable_options={'egocentric_camera': {'enabled': True}})
        spawn_site = self._arena.mjcf_model.worldbody.add('site')
        self._walker.create_root_joints(
            self._arena.attach(self._walker, spawn_site))
        spawn_site.remove()

    @property
    def root_entity(self):
        return self._arena

    def get_reward(self, physics):
        return 1.

_ = composer.Environment(task=Task())  # Won't error without this line?!?

def foo():
    env = composer.Environment(task=Task())

p = Process(target=foo)
p.start()  # Errors here.

Thank you @vaxenburg for sharing the script to reproduce the problem. In this particular case, I suspect that this is caused by the first call

_ = composer.Environment(task=Task())  # Won't error without this line?!?

initialized some EGL resources, while

p = Process(target=foo)

inherit the resources (which it shouldn't) as a subprocess (because by default python forks). When the subprocess start to initialize EGL again, something bad happened.

I was able to get rid of the issue in your particular case, by switching start method of multiprocessing to spawn,

import multiprocessing

from multiprocessing import Process

from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors


class Task(composer.Task):

    def __init__(self):
        self._arena = floors.Floor()
        self._walker = CMUHumanoid(
            observable_options={'egocentric_camera': {
                'enabled': True
            }})
        spawn_site = self._arena.mjcf_model.worldbody.add('site')
        self._walker.create_root_joints(
            self._arena.attach(self._walker, spawn_site))
        spawn_site.remove()

    @property
    def root_entity(self):
        return self._arena

    def get_reward(self, physics):
        return 1.


def foo():
    env = composer.Environment(task=Task())


if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")

    _ = composer.Environment(task=Task())  # Won't error without this line?!?

    p = Process(target=foo)
    p.start()

mantle2048 · 2023-11-21T05:38:10Z

Here is a complete self-contained snippet that, for me, consistently errors with mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd

from multiprocessing import Process

from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors

class Task(composer.Task):
    
    def __init__(self):
        self._arena = floors.Floor()
        self._walker = CMUHumanoid(
            observable_options={'egocentric_camera': {'enabled': True}})
        spawn_site = self._arena.mjcf_model.worldbody.add('site')
        self._walker.create_root_joints(
            self._arena.attach(self._walker, spawn_site))
        spawn_site.remove()

    @property
    def root_entity(self):
        return self._arena

    def get_reward(self, physics):
        return 1.

_ = composer.Environment(task=Task())  # Won't error without this line?!?

def foo():
    env = composer.Environment(task=Task())

p = Process(target=foo)
p.start()  # Errors here.

Thank you @vaxenburg for sharing the script to reproduce the problem. In this particular case, I suspect that this is caused by the first call

_ = composer.Environment(task=Task())  # Won't error without this line?!?

initialized some EGL resources, while

p = Process(target=foo)

inherit the resources (which it shouldn't) as a subprocess (because by default python forks). When the subprocess start to initialize EGL again, something bad happened.

I was able to get rid of the issue in your particular case, by switching start method of multiprocessing to spawn,

import multiprocessing

from multiprocessing import Process

from dm_control import composer
from dm_control.locomotion.walkers.cmu_humanoid import CMUHumanoid
from dm_control.locomotion.arenas import floors


class Task(composer.Task):

    def __init__(self):
        self._arena = floors.Floor()
        self._walker = CMUHumanoid(
            observable_options={'egocentric_camera': {
                'enabled': True
            }})
        spawn_site = self._arena.mjcf_model.worldbody.add('site')
        self._walker.create_root_joints(
            self._arena.attach(self._walker, spawn_site))
        spawn_site.remove()

    @property
    def root_entity(self):
        return self._arena

    def get_reward(self, physics):
        return 1.


def foo():
    env = composer.Environment(task=Task())


if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")

    _ = composer.Environment(task=Task())  # Won't error without this line?!?

    p = Process(target=foo)
    p.start()

Cool, it works for me !!!

I am parallelizing dm_control suite environments using AsyncVectorEnv in Gymnasium .

After passing the parameter spawn to the context, render() appears to be able to return the rbg_array.

I would greatly appreciate it if there is any expert could explain the mechanism and side effect behind this.

vaxenburg · 2024-07-09T15:42:02Z

FWIW, I was running into this error in a custom composer environment that did not use any cameras but did use a terrain height map hfield. Skipping these lines during the arena regeneration at train time solved the problem.

RaghuSpaceRajan · 2024-11-25T17:01:33Z

Is there any update on this? I get the same error when running DreamerV3 on dm_control. I can provide a stack trace if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd #370

mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd #370

vaxenburg commented Dec 8, 2022

saran-t commented Dec 9, 2022

saran-t commented Dec 9, 2022

vaxenburg commented Dec 14, 2022

saran-t commented Dec 14, 2022

fedeceola commented Jan 15, 2023

vaxenburg commented Jan 16, 2023

kevinzakka commented Feb 21, 2023

vaxenburg commented Feb 22, 2023

kevinzakka commented Feb 22, 2023

saran-t commented Feb 22, 2023

kevinzakka commented Feb 23, 2023

vaxenburg commented Feb 23, 2023

breakds commented Mar 1, 2023 •

edited

Loading

vaxenburg commented Mar 1, 2023

breakds commented Mar 1, 2023

kevinzakka commented Mar 14, 2023

vaxenburg commented Mar 14, 2023

kevinzakka commented Mar 14, 2023

vaxenburg commented Mar 14, 2023

kevinzakka commented Mar 14, 2023

vaxenburg commented Mar 15, 2023

vaxenburg commented Aug 18, 2023

yuvaltassa commented Aug 18, 2023

vaxenburg commented Aug 21, 2023

Haichao-Zhang commented Oct 24, 2023 •

edited

Loading

breakds commented Nov 10, 2023

mantle2048 commented Nov 21, 2023

vaxenburg commented Jul 9, 2024

RaghuSpaceRajan commented Nov 25, 2024

mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd #370

mujoco.FatalError: Offscreen framebuffer is not complete, error 0x8cdd #370

Comments

vaxenburg commented Dec 8, 2022

saran-t commented Dec 9, 2022

saran-t commented Dec 9, 2022

vaxenburg commented Dec 14, 2022

saran-t commented Dec 14, 2022

fedeceola commented Jan 15, 2023

vaxenburg commented Jan 16, 2023

kevinzakka commented Feb 21, 2023

vaxenburg commented Feb 22, 2023

kevinzakka commented Feb 22, 2023

saran-t commented Feb 22, 2023

kevinzakka commented Feb 23, 2023

vaxenburg commented Feb 23, 2023

breakds commented Mar 1, 2023 • edited Loading

vaxenburg commented Mar 1, 2023

breakds commented Mar 1, 2023

kevinzakka commented Mar 14, 2023

vaxenburg commented Mar 14, 2023

kevinzakka commented Mar 14, 2023

vaxenburg commented Mar 14, 2023

kevinzakka commented Mar 14, 2023

vaxenburg commented Mar 15, 2023

vaxenburg commented Aug 18, 2023

yuvaltassa commented Aug 18, 2023

vaxenburg commented Aug 21, 2023

Haichao-Zhang commented Oct 24, 2023 • edited Loading

breakds commented Nov 10, 2023

mantle2048 commented Nov 21, 2023

vaxenburg commented Jul 9, 2024

RaghuSpaceRajan commented Nov 25, 2024

breakds commented Mar 1, 2023 •

edited

Loading

Haichao-Zhang commented Oct 24, 2023 •

edited

Loading