-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UI Freezing Randomly - RHEL 9.4 #5877
Comments
First, determine the process ID of the Gaffer process, by typing Here's an example running those commands on my system : |
Hi johnhaddon, Thanks for your quick response. I also check with Gaffer 1.4.5 same thing happening. Please find the attached error logs. Thanks |
Oof, this one is nasty. Thanks for the logs - they makes things pretty clear. What's not clear is why this is happening for you repeatedly but not for anyone else yet. In theory it could definitely happen to anyone, but it seems to require that a Python-derived Node be destroyed on a background thread due to garbage collection, and at a very inconvenient time. Even when deleted, most nodes are still owned by the UI thread's undo queue so are unlikely to be disposed of in this way. I wonder if you have any custom code at all, and if any of that might make this more likely? |
Hello John, Thank you for your quick response. yes we have added a few Python expressions for automation, it's difficult to figure out which node might cause the UI freeze, we are currently looking into it, i have also attached the file for your reference, Please add a Geo, Shader and a HDR in the lights for the file to work. In case if you find anything i would love to hear your thoughts. Thankyou! |
Thanks for the file - we'll see if we can reproduce the problem here. Quick note though : I'm about to go on holiday for a few days, so won't get a chance until at least next Tuesday. As a short term workaround, I'd be curious to know if running this helps reduce the frequency of the problem :
You could either do that in the PythonEditor or in a |
|
Question : has this ever happened without changing the layout at some point beforehand (event if the freeze occurs when doing something else later)? I'm trying to figure out what might account for the stacktrace, and my main suspects at the moment are some internal nodes in some of the UI. But unless you've either changed the layout or removed something from it, I think I might be looking in the wrong place. |
I've managed to reproduce this quite simply now :
|
The problematic sequence of operations was this : 1. Destroy Editor. But Settings node lives on, because it is a wrapped RefCounted object and hence requires garbage collection. 2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread. 3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug. 4. Disconnections cause cancellation of background tasks associated with the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode is still found due to the (about to be removed) connection to the `__scriptNode` plug. 5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread. 6. The UI thread then waits for the task to finish, and we have complete deadlock. This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed. Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76. But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node. Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime. Fixes GafferHQ#5877
I believe this is fixed by #5893. Test builds for that should be available here shortly : https://github.com/GafferHQ/gaffer/actions/runs/9416506352. It would be great to know if they work for you @A6i8 (without the |
The problematic sequence of operations was this : 1. Destroy Editor. But Settings node lives on, because it is a wrapped RefCounted object and hence requires garbage collection. 2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread. 3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug. 4. Disconnections cause cancellation of background tasks associated with the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode is still found due to the (about to be removed) connection to the `__scriptNode` plug. 5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread. 6. The UI thread then waits for the task to finish, and we have complete deadlock. This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed. Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76. But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node. Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime. Fixes GafferHQ#5877
Version: Gaffer 1.4.3.0-linux-gcc9
Third-party tools: Arnold
Third-party modules: None
Linux version: 5.14.0-427.16.1.el9_4.x86_64
mockbuild@iad1-prod-build001.bld.equ.rockylinux.org](mailto:mockbuild@iad1-prod-build001.bld.equ.rockylinux.org)) (gcc (GCC)
ldd (GNU libc) : 2.34
Description
UI freezing randomly e.g. (Selecting node, while changing layout, while selecting Catalogue )
Nothing else is running in background.
I also enable
IECORE_LOG_LEVEL: "DEBUG"
but nothing related to UI or back trace.Can you please help me how i can get logs and debug the problem.
Thanks.
The text was updated successfully, but these errors were encountered: