Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max physics delta time and scene load kraken #9

Closed
gotmachine opened this issue Sep 5, 2021 · 14 comments
Closed

Max physics delta time and scene load kraken #9

gotmachine opened this issue Sep 5, 2021 · 14 comments
Labels
kspBug Identified KSP issue

Comments

@gotmachine
Copy link
Contributor

gotmachine commented Sep 5, 2021

While it is true that Unity physics stability isn't affected by the max physics delta time, there is a purely KSP related issue that can occasionally result in vessels being torn apart on scene loads when the max delta time is increased.

I never managed to identify the precise cause, but it likely related to KSP using an IEnumerator implementation of the Part.Start() method (could also be caused by another coroutine, there are quite a few used in various subsytems/partmodules initialization methods).

When you increase the max delta time, this allow Unity to put many more FixedUpdates between each Update(). This will be especially true on scene loads, when your CPU is overloaded by all the initialization code.

Since those IEnumerators are tied to the Update() cycle, this mean that many more physics updates (FixedUpdate) will happen before things are fully initialized.

There are checks in place that prevent physics and all the related stuff (FlightIntegrator, floating origin, etc) to really kick in before everything is fully initialized (the "physics easing" thing), but there is likely something they missed, somewhere.

It would have been much more safe for them to implement the IEnumerators/Coroutines in the FixedUpdate() cycle (ie, doing yield return new WaitForFixedUpdate() instead of yield return null)

This is all theoretical, I never really took the time to verify it, but increasing the max delta time definitely increase the likeliness of Kraken events on scene loads, and that is a very likely root cause.

TODO

  • Try to find a reproducible case (large part count vessel ? heavily modded install ? autostruts ?)
  • Create flight scene addon that temporarily force the max physics delta time to 0.02 for the 4-5 first frames on scene load
gotmachine added a commit that referenced this issue Sep 9, 2021
Prevent occasional kraken events on flight scene load (see issue #9 )

The bug manifest itself depending on the CPU load.
It is more likely to happen when :
- A large part count vessel is loaded
- Autostruts and docking ports are used
- The player has a slow CPU
- Additional mods are doing initialization-like code in their first FixedUpdate
- The "max physics delta-time per frame" KSP setting has been increased from the default 0.04 value

This bug is due to KSP mixing `yield return null` coroutines (depending on the `Update` cycle) for initialization code in some components with `FixedUpdate` cycle initialization code in other components. Due to the `Time.maximumDeltaTime` unity setting being set to a higher value than
`Time.fixedDeltaTime`, depending on the CPU load, multiple `FixedUpdate` can happen for a single `Update`. This will wreck havoc on the intended initialization synchronization between the aforementioned components. I didn't identify the exact code execution path that lead to those kraken events, but I suspect the issue is KSP creating and activating autostruts joints before the regular part connection joints are created.

While the issue can potentially happen any time a part is instantiated in flight, I can only reproduce it somewhat consistently on flight scene load (probably because this is the most CPU-heavy case). The implemented fix works by temporarily setting `Time.maximumDeltaTime` to 0.02 for the 50 first frames of a flight scene load, ensuring perfect `FixedUpdate`/`Update` synchronization. This does not cover all the other potential cases (unloaded vessels becoming loaded, EVA construction...), as setting `Time.maximumDeltaTime` must be done a frame in advance to take effect, so when such an event happen, it's already too late. A proper fix would be to identify the exact component using the problematic `yield return null` coroutines, and patch those calls to a `yield return new WaitForFixedUpdate()`. Unfortunately, the likely culprit is the `Part.Start()` method, and altering such a critical component isn't trivial and likely to cause other side effects.
@gotmachine
Copy link
Contributor Author

gotmachine commented Sep 9, 2021

The bug manifest itself depending on the CPU load.
It is more likely to happen when :

  • A large part count vessel is loaded
  • Autostruts and docking ports are used
  • The player has a slow CPU
  • Additional mods are slowing things by doing initialization-like code in their firsts Update/FixedUpdate
  • The "max physics delta-time per frame" KSP setting has been increased from the default 0.04 value

This bug is due to KSP mixing yield return null coroutines (depending on the Update cycle) for initialization code in some components with FixedUpdate cycle initialization code/coroutines in other components. Due to the Time.maximumDeltaTime unity setting being set to a higher value than Time.fixedDeltaTime, depending on the CPU load, multiple FixedUpdate calls can happen back-to-back between every Update. This will wreck havoc on the intended initialization synchronization between the aforementioned components. I didn't identify the exact code execution path that lead to those kraken events, but I suspect the issue is KSP creating and activating autostruts joints before the regular part connection joints are created.

While the issue can potentially happen any time a part is instantiated in flight, I can only reproduce it somewhat consistently on flight scene loads (probably because this is the most CPU-heavy case), and only if the scene load doesn't trigger a "physics easing" phase (landed vessels).

The implemented fix works by temporarily setting Time.maximumDeltaTime to 0.02 for the 50 first frames of a flight scene load, ensuring perfect FixedUpdate/Update synchronization. This does not cover all the other potential cases (unloaded vessels becoming loaded, EVA construction...), as setting Time.maximumDeltaTime must be done a frame in advance to take effect, so when such an event happen, it's already too late.

A proper fix would be to identify the exact component using the problematic yield return null coroutines, and patch those calls to a yield return new WaitForFixedUpdate(). Unfortunately, the likely culprit is the Part.Start() method, and altering such a critical component isn't trivial and likely to cause other side effects.

@Lisias
Copy link

Lisias commented Sep 9, 2021

You found another Race Condition. The threads are being "synchronised" by brute force, using the scheduling behaviour detected by trial and error (i.e., scrubbing until it fits).

But this makes the code extremely brittle - it will work on a very specific environment, and once this scenario changes, things will start to play havoc everywhere: too much CPU cores, or too few, what else the computer is running at the time, you name it.

You will not find a faulty code, because the problem is missing code.

And this is probably one of the reasons they had a bad time porting this thing to Consoles - from the OS to the running environment, everything is radically different.

@gotmachine
Copy link
Contributor Author

gotmachine commented Sep 9, 2021

This has absolutely nothing do to with threading. The Unity game loop is always running on a single thread.
Not a single piece of the KSP codebase is using any form of multithreading.

I suggest you take a look at https://docs.unity3d.com/Manual/ExecutionOrder.html

@Lisias
Copy link

Lisias commented Sep 9, 2021

Unity uses a thingy called "Co-Routines" - a cheap way of multiprocessing using single-threads.

The mechanics are different, but the purpose is the same: get some degree of parallelism on the code, even if simulated using cooperative 'multitasking'.

There're similar problems on both approaches: synchronisation is one of them - and the behaviour you describe (changing the delta-physics inducing to crashes) is a clear symptom of lack of synchronisation on code running in parallel - even when such parallelism is simulated by co-routines.

@Lisias
Copy link

Lisias commented Sep 9, 2021

Something like that is being described here:

https://www.feelouttheform.net/unity3d-coroutine-synchronous/

Whoever is borking due an ongoing initialisation , ideally it should wait for that initialisation to finish first instead of relying on counting running cycles, expecting things get initialised by the time it runs ("scrubbing until it fits").

This is essentially concurrent code trying to work without a critical section (see concurrent processes theory).

On a multi-thread, multi-CPU system, the only way out of the mess is by mutexes and semaphores. Since you are living on a single-thread world, simpler mechanisms are available to you (one of them, on the link above). This is more like a event driven system, but it solves the problem the same.

@gotmachine
Copy link
Contributor Author

Nope. Coroutines are executed on the main thread, there isn't any parallelism involved or intended, and their execution order is perfectly deterministic.

@Lisias
Copy link

Lisias commented Sep 9, 2021

See the link I posted.

The problem here is concurrency.

@Lisias
Copy link

Lisias commented Sep 9, 2021

On a second though, the concurrency itself may be the problem - jobs that could be easily serialised are entering into concurrency by error.

This is what you are talking about?

There's a (correct) way to simply rearrange the section of code that do the autostruts?

@gotmachine
Copy link
Contributor Author

The link you posted is about making an unity coroutine executing "all at once" (which BTW is silly since you can just implement a normal enumerator and a while (enumerator.MoveNext()) loop for that) The term "synchronous" is used abusively to signify that the coroutine will run to completion immediately, instead of having the yield statements pausing until the next update or fixedupdate.

Coroutine are not involving any threading. Coroutines are a tool that basically allow you to run some code over the course of multiple FixedUpdate or Update cycles. They always run on the main thread. Nothing in the Unity game loop is running outside of the main thread. I suggest you read the official Unity documentation before jumping into crazy theories.

But to answer you, yes there is a mistake in the KSP codebase. Specifically, the Part.Start() coroutine should never have been implemented as yielding on Update (the yield return null instruction). But it's likely a mistake they did very early, and fixing it would likely have side effects. KSP suffers from a bunch of very core issues related to early decisions to make some things from Update and others from FixedUpdate (for example, VesselPrecalculate is running from one cycle or another depending on the packed state of the vessel). There isn't much they or we can do about it, unfortunately.

The implemented fix does resolve the issue that I experienced or were reported to me, without any risk of side effects. The other potential issues are theoretical, and I don't think there is actually real world scenarios that trigger them.

@Lisias
Copy link

Lisias commented Sep 9, 2021

We are battling on words - besides I have the feeling that we are on the same page.

Threading is just the most common (nowadays) form of concurrency (and since I live with threads everyday, I end up using the terms as they were the same, my mistake).

Co-routines are another form of concurrency. With same of the problems.

In a way or another, take what I said as criticism to KSP code, not yours.

@gotmachine
Copy link
Contributor Author

Co-routines are another form of concurrency. With same of the problems.

Well, not much more concurrent than any other Unity component like Vessels or PartModule (and to generalize, any program running arbitrary code on a main infinite loop). Yes, there are order of execution potential issues, but both the causes and solutions have nothing in common with anything remotely resembling what happens with threads.

@Lisias
Copy link

Lisias commented Sep 9, 2021

Yes, there are order of execution potential issues, but both the causes and solutions have nothing in common with anything remotely resembling what happens with threads.

That's the part I'm getting some trouble to understand. If by changing the delta-physics (that, for all practical purposes, it's like the quantum on a round-robin scheduler) you have problems, it's my understanding that the core of the problem is not the correct handling of some kind of concurrency.

Even if the concurrency itself is the bug, with some change somewhere else breaking a deterministic chain of events that now needs to be synchronised somehow.

@gotmachine
Copy link
Contributor Author

gotmachine commented Sep 10, 2021

I'm not changing the physics delta time. I'm changing this setting : https://docs.unity3d.com/ScriptReference/Time-maximumDeltaTime.html

I will explain it in detail, since documenting yourself seems too hard a task.

Unity runs a game loop. It's basically an infinite loop where each iteration represent the "time slice" of a visual frame. Each iteration, game logic is called, then lower level stuff is called (user input, graphics rendering, etc). Each iteration, 2 methods are called on all existing components in the scene : FixedUpdate() and Update().

Those two methods are separate because the physics engine needs to run at fixed delta time (0.02s/50 ups in KSP), while the framerate is variable (typically 60 ups). Every iteration is always calling Update() once, it's the method that is intended to be implemented to handle everything that isn't interacting with the physics engine (for example, UI stuff or animations). However, to handle the fact that the "time slice" of the iteration is variable, unity will decide to call the FixedUpdate() once, multiple times, or not at all per iteration. In a typical Unity game, you don't really have to worry about all that.

But KSP isn't a typical Unity game. I wont expand on the details, but due to how it stretches its usage of the physics engine far outside its intended usage and capabilities, the implementations have to use a variety of tricks (not to say hacks) that make things quite messy to handle. Arguably, those tricks could be implemented way better, but it's easier to criticize now that we have 10 years of feedback and experience. In my opinion, they did quite a decent job at figuring all this from scratch. The consequences we are experiencing now are a classic technical debt problem.

Anyway, the end result is that KSP, as an Unity engine implementation, has a very unusual level of cross dependency between components, and consequently order-of-initialization issues. Coroutines are a perfectly fine way to handle those issues. In short, instead of initializating your components in the Start() method, you delay some of that initialization a frame latter. As long as you keep documented what components are delayed by how many frames and why, it's a perfectly fine way to handle that issue, and in fact, it's more or less the only way to handle it. Unity has a system to handle components execution order, but it doesn't offer the granularity that would be required to solve what KSP needs.

Now, what the actual issue is then ? Well, you can choose to have your coroutines called from either Update or FixedUpdate. And we have seen that those two cycles aren't necessarily in sync (ie, there is no garantee that you are getting one FixedUpdate per each Update). So if you are using initialization coroutines, it's critical that you implement all of them in the same cycle.

This what the KSP issue is. The Part.Start() coroutine is implemented in Update, while almost all others are implemented in FixedUpdate(). I can perfectly see what happened here. The Part component is likely the first that was developed, so it was fine to implement it in either cycles, and they choose the Update cycle. At this time, they didn't anticipated that they might have to implement delayed initialization for other components, and that those components would be dependent on the FixedUpdate cycle due to them being dependent on physics stuff.

Thing is, most of the time, this isn't an issue, because you usually get one FixedUpdate per Update, especially on the first few frames. By the time they realized that mistake, there was likely too many things depending on the fact that Part.Start() had an Update initialization. Again, Part.Start() is a huge entry point for tons of other stuff.

Now, what my fix is doing to handle that issue ? Simple : it temporarily changes the Unity setting that determine how many FixedUpdate cycles are allowed to be executed per Update cycles, forcing it to be either 0 or 1. Having the "0" case would maybe be problematic too, but it's almost guaranteed to never happen due to both cycles being quite overloaded in the first few frames, so the Unity "scheduler" isn't likely to play a trick on us.

But in a way, you're right, this isn't a "real" fix, it's a band aid.

The real fix would be to change the initialization coroutines yielding on Update so they yield on FixedUpdate. In theory, that would work and since that would just force the "intended design", it should be seamless for everything else. But I'm reluctant to do it, because the main culprit is Part.Start(), and it's very huge central component, that is also an entry point for tons of other things, including all the weird stuff that some plugins may be relying upon. This being said, I will likely try it at some point, at least to see if I discover any obvious issues.

@gotmachine gotmachine added the kspBug Identified KSP issue label Oct 24, 2021
@gotmachine
Copy link
Contributor Author

Fix implemented in for the flight scene load case.

Other occurrences of that issue might exist, and would be covered by the more "core" fix in the https://github.com/KSPModdingLibs/KSPCommunityFixes/tree/PartStartAsFixedUpdateCoroutine branch.

Possible occurrences might be EVA construction and other cases of parts/vessels being created post-scene load. Unless I find a reproducible case of that issue, I won't implement that more general fix, as there are potential side effects that are hard to evaluate.

gotmachine added a commit that referenced this issue Mar 9, 2022
- New bugfix : AutoStrutDrift, see [issue #21](#21).
- New bugfix : PartStartStability, see [issue #9](#9).
- The FlightSceneLoadKraken patch is superseded by the PartStartStability patch, which is now disabled by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kspBug Identified KSP issue
Development

No branches or pull requests

2 participants