Mesh/material memory leak #208

AndrejOrsula · 2021-04-03T17:47:32Z

Maybe related to gazebosim/gz-rendering#39

After investigating this issue more, it is more suited to be inside https://github.com/ignitionrobotics/ign-rendering as it occurs for both GUI and camera sensors. Please move the issue there if possible.

Environment

OS Version: Ubuntu 20.04
Ignition - built from source
- Both Dome and Edifice experience this issue
Ogre Version:
- Both ogre and ogre2 (2.1) experience this issue

Description

Expected behavior: Various models can be added and removed arbitrary number of times without running out of memory.
Actual behavior: Mesh/material (texture) is not freed after a model is removed. It affects both GUI and camera sensors. For GUI, reinserting the same model repeatedly introduces some extra memory usage, but at much smaller magnitude (negligible in the grand scheme of things).

I should note that this behaviour might be advantageous for headless simulation if an environment repeatedly utilises a limited number of models, as reinsertion of a model is much faster if its resources already loaded to memory. Duality of bug/feature is real with this one. Therefore, having an option to allow both behaviours might be preferable, i.e option A - unload all resources after removing model, option B - keep the resources (with some policy that makes sure the system does not run out of memory/exceed some threshold).

Steps to reproduce

ign gazebo -s
ign gazebo -g
Open RAM usage monitor of your choice. Optionally, monitor VRAM usage as well.
Run the following example script ign_gui_memory_leak_reproducibility_script.bash (gist)
- This script iteratively inserts and removes models with mesh geometry and material texture using service calls
- Note: Script downloads models from Fuel (total of ~650MB)
- Note: About ~4GB RAM will be leaked with the utilised 50 models

ign_gui_memory_leak_reproducibility_script.bash

#!/usr/bin/env bash

## Get list of first <n_models> models from <owner> (using Fuel)
owner="googleresearch"
n_models=50 # 50 corresponds to ~650MB disk usage and ~4GB RAM (leaked) usage, use less if your system does not allow it
echo "Getting list of '$n_models' models owned by '$owner'. This might take few minutes during the first execution... Please be patient :)"
model_names=$(ign fuel list -o $owner -t model -r | head -n $n_models | sed -n -e 's/^.*models\///p')

## Download models if needed (done separately because "/world/default/create" service might timeout)
for model_name in $model_names; do
    if [[ ! -d "$HOME/.ignition/fuel/fuel.ignitionrobotics.org/$owner/models/$model_name" ]]; then
        model_uri="https://fuel.ignitionrobotics.org/1.0/$owner/models/$model_name"
        echo "Downloading model '$model_uri'"
        ign fuel download -t model -u "$model_uri" &
    fi
done
for job in $(jobs -p); do
    wait $job
    echo "Model downloaded"
done

world_name="default"
# world_name="camera_sensor"
for model_name in $model_names; do
    ## Spawn model
    echo "Spawning model $model_name"
    model_uri="https://fuel.ignitionrobotics.org/1.0/$owner/models/$model_name"
    ign service -s "/world/$world_name/create" --timeout 5000 \
        --reqtype ignition.msgs.EntityFactory \
        --reptype ignition.msgs.Boolean \
        -r 'sdf_filename: "'$model_uri'" name: "'$model_name'"' 1>/dev/null
    sleep 0.5
    ## Remove model
    echo "Removing model $model_name"
    ign service -s "/world/$world_name/remove" --timeout 5000 \
        --reqtype ignition.msgs.Entity \
        --reptype ignition.msgs.Boolean \
        -r 'type: 2 name: "'$model_name'"' 1>/dev/null
    echo ""
done

Service calls are used here to easy reproducibility. This issue occurs also when using C++ API directly (I originally experienced the issue while using gym-ignition).

Output

Below is a video of performing the steps above. Notice also that the aligned bounding boxes of objects remain visible if object was removed while selected (and they cannot be removed). Speculation: This might be the small negligible amount of memory that accumulates on model reinsertion for GUI (or part of it).

simplescreenrecorder-2021-04-03_19.32.30.mp4

I tried to investigate the issue with Heaptrack, however, only a fraction of the leaked memory gets logged (as far as I can see). Peak resident memory (RSS) matches the total RAM usage (4.8GB), but I was not able to figure out what the largest contributor is. The mesh/texture data is not logged. I am not sure if it's caused by having Ruby in the loop or because rendering engine is loaded as plugin?

Log file: heaptrack.ruby.3172402.gz

Summary:

Consumption (each spike is insertion of a new model):

Overlapping collision geometries

While making the reproducibility example, I also noticed that if I resume the simulation after all the model insertions/deletions, the server freezes completely and outputs a bunch of ODE collision-related messages. I have seen these before when two or more models have their collision geometry largely overlapping. Therefore, it seems the collision geometry is not immediately removed from server when the simulation is paused. Is this a design choice or a bug?

ODE Message 2: Trimesh-trimesh contach hash table bucket overflow - close contacts might not be culled in AddContactToNode() [collision_trimesh_trimesh.cpp:224]

The text was updated successfully, but these errors were encountered:

AndrejOrsula · 2021-04-04T10:41:15Z

There is actually a memory leak inside server too, but at much smaller scale. Using the same procedure, here is a Heaptrack of ign gazebo -sr.

Log file: heaptrack.ruby.2904453.gz

Summary:

Consumption:

AndrejOrsula · 2021-04-04T10:57:13Z

Investigating further, the full issue occurs also for camera sensors totalling to ~5GB memory leaked (headless), and adding up to ~10GB leaked for 50 models if GUI is also opened. Therefore, the issue might be inside ign-rendering after all. Heaptrack for ign gazebo camera_sensor.sdf -sr is below (sorry for spam)

Log file:
heaptrack.ruby.2949814.gz

Summary:

I updated the issue description to reflect this discovery.

ahcorde · 2021-05-13T11:07:00Z

I have been digging a little bit more about this problem and I think I found the source.

There are two places where we handle meshes

ign-common with the MeshManager which is going to keep the meshes, load them and we are also able to fetch them.
ign-rendering we will use the MeshManager from Ignition-common and we fetch the mesh and convert it to the right RenderEngine (Ogre, Optix, etc,).

The first thing to take in account is that we don't have a method in the MeshManager Class to remove meshes, this will make imposible to delete some kind of memory

I made a simple C++ example to load and destroy some meshes using only the MeshManager (I added a method to remove them, I will create the PR soon). And as you can see in the following image we are able to destroy the memory properly.

The problem is inside ign-rendering, when we destroy the mesh using this new method there is no call to destroy the Ogre material which in this particular case will contain a texture image (quite big) which makes the memory to growth but we are not cleaning this up.

In the following image we can see how we create the meshes and at the end I try to destroy then but we are only able to destroy the memory in the MeshManager but not in ign-rendering.

TODO

In the MeshManager class there is no method to remove any mesh. Added method to remove meshes from the MeshManager gz-common#222
Remove the OgreMaterial
- ign-rendering Destroy material when a mesh is deleted gz-rendering#324
- ign-gazebo Remove meshes from Gazebo gz-sim#824

FYI @iche033

ahcorde · 2021-05-14T21:30:26Z

Some more details:

We have some duplicated materials. When we create the mesh and we load the material in memory in particular in this methos bool OgreMeshFactory::LoadImpl(const MeshDescriptor &_desc). We create "general " Material which is passed to the submeshes, I think this is because we can have some submeshes with the same material but with different properties.

Then when we try to remove the mesh which potencially has some submeshes with textures we are going to remove the texture associated with the submesh but we are not removing this general material which will live in the memory forever or when we remove the engine with the method Destroy.

ahcorde · 2021-05-18T20:04:05Z

I added this PR in ign-gazebo gazebosim/gz-sim#824 that allows to remove the mesh when an entity is removed or the UserCommand remove is called.

The following image shows how the memory is removed

NOTE. This is using a world without any sensors. Which means the materials are not loaded, and there is no leak.

ahcorde · 2021-05-18T20:24:11Z

I openned this other draft PR gazebosim/gz-rendering#324

With this one we should be able to remove the material from the memory there are still a memory leak which I'm not able to identify.

Server

Client

NOTE As you can see in both cases there is a small memory leak.

ahcorde · 2021-05-20T07:43:04Z

The small memory leak is solved:

Server

Client

According with an offline discussion with @iche033, These changes may affect the performance in some special cases such us:

When repeatedly spawning large models in ignition
A world with hundreds of the same model

diegoferigo · 2021-05-20T07:47:26Z

Awesome catch @AndrejOrsula! I also experienced in the past the same behavior in a similar setting, but without debugging it this much, I ended up to an implementation where the simulator is completely destroyed and re-created every time. This of course introduces a non-negligible computation overhead, but in my experience:

It helps with contacts. I never debugged this is detail, but large-scale simulations have different results in the two cases (it could be related to ODE singletons? Maybe switching to bullet's contact detection helps here?):
1. Unique simulator instance with many model removal-insertion
2. Many simulator instances created / destroyed when the simulation ends, each of them with a single model removal-insertion
Plugins cannot be unloaded, tracked in Unload model plugins when the model is removed from the world gz-sim#113. Therefore, with many plugins, this might generate another memory leak.

@ahcorde really a great work! Just to know, do the models you tested have plugins?

ahcorde · 2021-05-20T07:50:58Z

@diegoferigo,

I'm using the models from the script that @AndrejOrsula included in the issue (googleresearch). I don't think these models include any kind of plugins.

diegoferigo · 2021-05-20T07:56:34Z

Ok thanks for the clarification, I found the script only now. Again, nicely done 🎉 Very likely meshes have a much greater impact on memory than plugins. This fix made my day :)

AndrejOrsula · 2021-05-20T09:50:39Z

Thank you for investigating and mitigating this issue @ahcorde!

According with an offline discussion with @iche033, These changes may affect the performance in some special cases such us:

When repeatedly spawning large models in ignition

A world with hundreds of the same model

This is very true. From my experience, it is currently much faster to spawn a model with mesh geometry and image texture if its assets are already loaded to the memory. I don't have specific numbers, but it is especially noticeable for high-res textures (even if stored on SSD).

This behaviour is definitely beneficial if a limited number of diverse models is used. Currently, the only problem that really occurs is when a system runs out of RAM.

I forgot to mention in the original description that utilised VRAM is kept at a steady size just few MB below the available size (at least for CUDA). I assume it is due to some form of smart buffer poll management in OpenGL/OGRE? Having a similar solution for RAM would allow to keep the advantage of faster re-insertion of models while addressing the issue with lack of memory. I have no idea about the feasibility of such implementation, but having a maximum memory limit for assets (especially textures) would definitely be nice - with some policy that would free old and unused data [limit default to system memory].

ntfshard · 2024-06-11T11:33:45Z

Hello everyone
I also faced with a memory leak problem and after several experiments I noticed that memory leaking by a small amount with some consistent speed even if I'm running empty simulation. (I know that this thread is quite old and something already changed)
With valgrind and very raw xml report viewer and quite frequently saw a traces which ends inside QtQml library:

{"obj":"/usr/local/libexec/valgrind/vgpreload_memcheck-amd64-linux.so","fn":"operator new[](unsigned long)","dir":"/root/valgrind-3.23.0/coregrind/m_replacemalloc","file":"vg_replace_malloc.c","line":729}
{"obj":"/root/qtbase/lib/libQt5Qml.so.5.15.14","fn":"QQmlContextData::initFromTypeCompilationUnit(QQmlRefPointer<QV4::ExecutableCompilationUnit> const&, int)"}
{"obj":"/root/qtbase/lib/libQt5Qml.so.5.15.14","fn":"QQmlObjectCreator::create(int, QObject*, QQmlInstantiationInterrupt*, int)"}
{"obj":"/root/qtbase/lib/libQt5Qml.so.5.15.14","fn":"QQmlComponentPrivate::beginCreate(QQmlContextData*)"}
...

(I also saw some other problems, not related to Qt, but seems they are constant memory leak, not increasing)

And on a Qt bugtracker there is a bug https://bugreports.qt.io/browse/QTBUG-119301 with quite similar reference to my observation with initFromTypeCompilationUnit() (or at least IMHO)

QQmlObjectCreator class seems quite complex, and I can't find a place where context data member are should be destroyted (actually it should be called context->destroy() method afaiu), but maybe it shared with something.
(Qt 5.15.14, built from source; Ubuntu22 has 5.15.3 I guess, with similar result AFAIR)

Not sure how this observation can exactly help, but just in case I decide to share it. Tested on gz-gui7_7.2.2 tag.
Please take it with grain of salt C:

AndrejOrsula added the bug Something isn't working label Apr 3, 2021

AndrejOrsula mentioned this issue Apr 3, 2021

Mesh/texture (material) memory leak AndrejOrsula/drl_grasping#20

Closed

2 tasks

AndrejOrsula mentioned this issue Apr 4, 2021

[Ogre1.x] Mesh material memory leak gazebosim/gz-rendering#39

Closed

chapulina added performance Runtime performance bug Something isn't working and removed bug Something isn't working labels Apr 5, 2021

chapulina self-assigned this Apr 5, 2021

AndrejOrsula changed the title ~~GUI mesh/material memory leak~~ Mesh/material memory leak Apr 6, 2021

ahcorde mentioned this issue May 14, 2021

Added method to remove meshes from the MeshManager gazebosim/gz-common#222

Merged

8 tasks

ahcorde mentioned this issue May 18, 2021

Remove meshes from Gazebo gazebosim/gz-sim#824

Draft

8 tasks

ahcorde mentioned this issue May 18, 2021

Destroy material when a mesh is deleted gazebosim/gz-rendering#324

Merged

8 tasks

traversaro mentioned this issue May 20, 2021

Some Collada Materials not correctly loaded when using Ogre >= 1.10 gazebosim/gazebo-classic#3002

Open

AndrejOrsula mentioned this issue Aug 31, 2021

ex_enjoy_pretrained_agent.bash hanging in fresh Docker image AndrejOrsula/drl_grasping#90

Closed

chapulina removed their assignment Nov 15, 2021

chapulina added the help wanted Extra attention is needed label Nov 15, 2021

andrefdre mentioned this issue Mar 26, 2023

Real time factor in Gazebo decreases to low values (0.05) during dataset acquisition DanielCoelho112/synfeal#96

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mesh/material memory leak #208

Mesh/material memory leak #208

AndrejOrsula commented Apr 3, 2021 •

edited

Loading

AndrejOrsula commented Apr 4, 2021

AndrejOrsula commented Apr 4, 2021 •

edited

Loading

ahcorde commented May 13, 2021 •

edited

Loading

ahcorde commented May 14, 2021

ahcorde commented May 18, 2021

ahcorde commented May 18, 2021 •

edited

Loading

ahcorde commented May 20, 2021

diegoferigo commented May 20, 2021

ahcorde commented May 20, 2021

diegoferigo commented May 20, 2021

AndrejOrsula commented May 20, 2021

ntfshard commented Jun 11, 2024 •

edited

Loading

Mesh/material memory leak #208

Mesh/material memory leak #208

Comments

AndrejOrsula commented Apr 3, 2021 • edited Loading

Environment

Description

Steps to reproduce

Output

Overlapping collision geometries

AndrejOrsula commented Apr 4, 2021

AndrejOrsula commented Apr 4, 2021 • edited Loading

ahcorde commented May 13, 2021 • edited Loading

TODO

ahcorde commented May 14, 2021

ahcorde commented May 18, 2021

ahcorde commented May 18, 2021 • edited Loading

Server

Client

ahcorde commented May 20, 2021

Server

Client

diegoferigo commented May 20, 2021

ahcorde commented May 20, 2021

diegoferigo commented May 20, 2021

AndrejOrsula commented May 20, 2021

ntfshard commented Jun 11, 2024 • edited Loading

AndrejOrsula commented Apr 3, 2021 •

edited

Loading

AndrejOrsula commented Apr 4, 2021 •

edited

Loading

ahcorde commented May 13, 2021 •

edited

Loading

ahcorde commented May 18, 2021 •

edited

Loading

ntfshard commented Jun 11, 2024 •

edited

Loading