Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI Freezing Randomly - RHEL 9.4 #5877

Closed
A6i8 opened this issue May 30, 2024 · 9 comments
Closed

UI Freezing Randomly - RHEL 9.4 #5877

A6i8 opened this issue May 30, 2024 · 9 comments

Comments

@A6i8
Copy link

A6i8 commented May 30, 2024

Version: Gaffer 1.4.3.0-linux-gcc9
Third-party tools: Arnold
Third-party modules: None

Linux version: 5.14.0-427.16.1.el9_4.x86_64
mockbuild@iad1-prod-build001.bld.equ.rockylinux.org](mailto:mockbuild@iad1-prod-build001.bld.equ.rockylinux.org)) (gcc (GCC)
ldd (GNU libc) : 2.34

Description

UI freezing randomly e.g. (Selecting node, while changing layout, while selecting Catalogue )

Nothing else is running in background.

I also enable IECORE_LOG_LEVEL: "DEBUG" but nothing related to UI or back trace.

Can you please help me how i can get logs and debug the problem.

Thanks.

@johnhaddon
Copy link
Member

johnhaddon commented May 30, 2024

Can you please help me how i can get logs and debug the problem.

First, determine the process ID of the Gaffer process, by typing ps -ef | grep gaffer in a terminal. Then run eu-stack -p <PID>, where PID is the process ID from the first command. This will print out a stack trace from every Gaffer thread, which is typically very useful for diagnosing hangs. If you could attach that output to this issue that would be very helpful.

Here's an example running those commands on my system :

image

@A6i8
Copy link
Author

A6i8 commented May 30, 2024

Hi johnhaddon,

Thanks for your quick response.

I also check with Gaffer 1.4.5 same thing happening.

Please find the attached error logs.
Gaffer_1.4.3.error.log
Gaffer_1.4.5.error.log

Thanks

@johnhaddon
Copy link
Member

Oof, this one is nasty. Thanks for the logs - they makes things pretty clear.

What's not clear is why this is happening for you repeatedly but not for anyone else yet. In theory it could definitely happen to anyone, but it seems to require that a Python-derived Node be destroyed on a background thread due to garbage collection, and at a very inconvenient time. Even when deleted, most nodes are still owned by the UI thread's undo queue so are unlikely to be disposed of in this way. I wonder if you have any custom code at all, and if any of that might make this more likely?

@A6i8
Copy link
Author

A6i8 commented May 30, 2024

Hello John,

Thank you for your quick response. yes we have added a few Python expressions for automation, it's difficult to figure out which node might cause the UI freeze, we are currently looking into it, i have also attached the file for your reference, Please add a Geo, Shader and a HDR in the lights for the file to work. In case if you find anything i would love to hear your thoughts.

Thankyou!

Bear_SHD_v001_t01.zip

@johnhaddon
Copy link
Member

Thanks for the file - we'll see if we can reproduce the problem here. Quick note though : I'm about to go on holiday for a few days, so won't get a chance until at least next Tuesday.

As a short term workaround, I'd be curious to know if running this helps reduce the frequency of the problem :

IECore.RefCounted.garbageCollectionThreshold = 10000

You could either do that in the PythonEditor or in a ~/gaffer/startup/gui/foo.py file.

@A6i8
Copy link
Author

A6i8 commented Jun 3, 2024

IECore.RefCounted.garbageCollectionThreshold = 10000 that helps

@johnhaddon
Copy link
Member

johnhaddon commented Jun 6, 2024

UI freezing randomly e.g. (Selecting node, while changing layout, while selecting Catalogue )

Question : has this ever happened without changing the layout at some point beforehand (event if the freeze occurs when doing something else later)? I'm trying to figure out what might account for the stacktrace, and my main suspects at the moment are some internal nodes in some of the UI. But unless you've either changed the layout or removed something from it, I think I might be looking in the wrong place.

@johnhaddon
Copy link
Member

I've managed to reproduce this quite simply now :

  1. Get a Catalogue with a bunch of images in it.
  2. Remove the ImageInspector from the layout.
  3. Select and deselect the Catalogue a few times.

johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jun 7, 2024
The problematic sequence of operations was this :

1. Destroy Editor. But Settings node lives on, because it is a wrapped   RefCounted object and hence requires garbage collection.
2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread.
3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug.
4. Disconnections cause cancellation of background tasks associated with
the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode
is still found due to the (about to be removed) connection to the
`__scriptNode` plug.
5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread.
6. The UI thread then waits for the task to finish, and we have complete deadlock.

This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed.

Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76.

But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node.

Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime.

Fixes GafferHQ#5877
@johnhaddon
Copy link
Member

I believe this is fixed by #5893. Test builds for that should be available here shortly : https://github.com/GafferHQ/gaffer/actions/runs/9416506352. It would be great to know if they work for you @A6i8 (without the garbageCollectionThreshold = 10000 workaround in place).

johnhaddon added a commit to johnhaddon/gaffer that referenced this issue Jun 12, 2024
The problematic sequence of operations was this :

1. Destroy Editor. But Settings node lives on, because it is a wrapped   RefCounted object and hence requires garbage collection.
2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread.
3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug.
4. Disconnections cause cancellation of background tasks associated with
the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode
is still found due to the (about to be removed) connection to the
`__scriptNode` plug.
5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread.
6. The UI thread then waits for the task to finish, and we have complete deadlock.

This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed.

Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76.

But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node.

Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime.

Fixes GafferHQ#5877
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants