-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entity Ids currently have a risk of collision #117
Comments
I think this discussion extends to the use of entity IDs in persisted, serialized files (scenes). Currently, the IDs generated by B with atomically incremented ints as IDs would render these persisted IDs meaningless since the same IDs will be generated every run, meaning collisions are everywhere. Note that entity IDs that are persisted have to be unique across all runs for your entire team, for the duration of your development cycle. |
Some additional points, for completeness sake: From ecs channel, by @qthree:
|
For reference:
|
I know that having a single ID space everywhere is very attractive, but I think it's somewhat inescapable that there will be more than one. Persisting things to disk really wants UUIDs, and network syncing wants substantially less. So I would caution against trying to pick something in the middle (i.e. 32 bits) that isn't ideal for either situation. If you conclude that the same ID used everywhere is likely not going to work out, then I think UUIDs are a pretty clear choice for anything that's persisted. The cost of generating the UUID should be dominated by the cost of whatever other operation is happening that necessitated generating a UUID in the first place (like saving data to disk.) |
FYI the hecs fork readme currently states that IDs are always UUIDs, that should be updated whenever a choice gets made here. |
I'm definitely aware of how other projects do it. I am personally attached the simplicity of a "globally unique entity id". It reduces the internal machinery involved in the ECS implementation (a relatively small win) and it (more importantly) simplifies the conceptual model for users: "create id anywhere and it is unique".
I agree that this is the situation. Based on my current understanding of guids, this would not be a problem. If someone has proof that random UUIDS have a Otherwise lets focus on the performance + code complexity discussion. edit: clarified that this is "random uuids" |
What is the performance difference with using |
if you look at the ahash readme it can hash a 128-bit int in 0.61 nanoseconds. For comparison, fxhash can hash a 64 bit int in... 0.62 nanoseconds. As I understand it hashing is the main operation applied to IDs. So, if we use ahash, there's not much cost there. larger IDs aren't going to be a problem for overall memory usage, that'll be dominated by game assets. Larger IDs could increase register pressure slightly in hot loops. However, in an ecs that groups components into archetypes (like so I suspect that for most workloads (Unfortunately I don't have time to write actual benchmarks rn, but could next week.) |
the u32 collision issue is real enough that i'll fix it today if no-one else will, probably just by bumping to u64 or u128. we can continue to discuss the "best" solution. ill do ecs_bench comparisons to see how much perf is affected. |
I performed a few benchmark tests, possible wrong, but code is at https://github.com/OvermindDL1/entity_bench/ if anyone wants to fix it
These do indeed hold, but you should add that the u128 UUID's will then be larger then the associated incrementing ints. Also, as for 'decentralized', they only are when calculated, to actually be put into the system still requires world synchronization, so that has to happen no matter the case.
Of my quick benchmark on my 13 year old desktop CPU to get an incrementing ID using my ECS's code takes (for u32, u64, and u128, it supports user types as well, great for making a type-strong id): create/create-linear-u32 time: [8.1782 ns 8.4789 ns 8.8042 ns]
create/create-linear-u64 time: [7.5045 ns 7.5696 ns 7.6326 ns]
create/create-linear-u128 time: [7.8254 ns 7.8976 ns 7.9624 ns] To create a u128 via [26.515 ns 26.636 ns 26.777 ns] And that's not even including the time it takes to perform the insertion into the hashmap on world, which when using a create/create-u128-rand-hashmap time: [67.235 ns 67.585 ns 67.937 ns]
create/create-u128-rand-ahash time: [46.223 ns 46.509 ns 46.785 ns]
Except two things, first you 'should' still test for collisions or weird errors can suddenly pop up, thoguh with u128 it is less likely but still bound to happen to the rare user, and second, and the bigger thing, you can load entities multiple times (great for templates) so you have to change their ID's anyway.' Should just write out entities as is, but load them while creating new ID's and reconnect things as appropriate.
Only if you don't remap on load, which you should do anyway because you can load things multiple times (instance a house and all its stuff into multiple locations for example).
Well with u32 I couldn't get a successful benchmark to run at all due to collisions. With u128 then statistically you'd have a 50% chance of collision when 2^64 entities have been allocated. Which given a number of entities, players, and days is bound to happen eventually, and it's a bad user experience for their program to suddenly crash with an lookup error or so. This may be a small chance of happening, but it is still possible in a reasonable time frame (and that chance can be made so so much worse in certain situations, especially with poor quality / fast PRNG's or certain game setups) and just waiting for a ticking time bomb in the code makes one very wary. The engine should be as stable as possible, and random ID's is not as stable as possible.
With just u64 you have a 50% chance of collision after just 4 billion entities created, which over even a modest set of player over a few days is bound to happen. You could make a loop run for a few minutes and likely get a collision.
No need to hash incremental reused ID's, so no hashing cost at all.
Yeah, I hit it multiple times today with my tests... |
This is definitely a concern from a usability perspective. If that is a problem then I think that outweighs any cognitive advantage of "create id anywhere and it is unique" ( though I agree with the beauty of that model if it could ever be relied upon ). |
In order to reach that collision likelihood , where we have a shared game world where each player generates 1,000,000 unique entities per day (which is high), we would need:
Thats a lot of entities and a lot of time |
That's still to 50%, even a small percentage of a single percent will still cause it to occasionally happen due to the amount of people using it in a variety of ways over time. |
128-bit UUIDs are the gold standard for "no one will ever get the same one, ever". They are relied upon in distributed databases for far more important things than games. According to wikipedia: "the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion" (https://en.wikipedia.org/wiki/Universally_unique_identifier) You are about 1000 times more likely than that to be struck by lightining this year. (https://www.weather.gov/safety/lightning-odds). I also think micro-benchmarking the cost of generating them is ignoring that whatever is triggering creating the UUID in the first place will almost certainly dominate the cost of generating the UUID. Even if spawning an entity was nothing more than a 128-byte memcpy, a profiler would show that before it showed generating a UUID. Games don't spawn/despawn thousands of entities per frame during gameplay anyways.
I think this is a really good point If I were designing an engine I would do:
I think 128-bit for in-memory IDs can work, but between @OvermindDL1's comment RE: spawning multiple copies of the same entity in different places in the world, and that it's likely there will be more than one ID space anyways, I would accept having a UUID/in-memory ID mapping table and enjoy the smaller IDs. |
Hmm... Not sure if its of help. But what about defining a namespace for each project and a atomic 32bit increasing number? |
What about particle effects, though? If each particle has to have a transform, don't those need their own entities? I'm not sure if there are different strategies for handling particles or not. |
I wouldn’t put particles in an ECS. It would be common to leave that data in a GPU buffer since it generally doesn’t affect gameplay. It’s also an example of if you are doing an extreme amount of something (like in sand games) you probably want a non-general, highly optimized solution rather than a general purpose solution. |
Spawning templates of things is common, like say a projectile for an FPS, units in an RTS, etc... In addition, I hit another issue with using random entity ID's, some of my map generation iteration time is dominated by the hashmap lookup from looking up an entity data to get its draw information so I can render visible entities and map information. If they were packed near 0 like in my code above then that cost vanishes as it's just an array index. Hashmap lookups in tight loops is a significant performance hit. |
Forgive my newness to the topic, but would a family generator style a la https://skypjack.github.io/2019-02-14-ecs-baf-part-1#unique-identifiers make sense within the Bevy context? |
@RLHerbert no problem. :) Someone correct me if I am wrong, but the identifiers referred to by that post are referring to component identifiers, not entity identifiers as is discussed in this thread. Component identifiers are actually being discussed in #32. The technique for component IDs discussed in that post will also not work very well for component identifiers in our case because those IDs are only unique in the context of the program while it is running, and we need to make sure that components are as unique as possible across even different dynamically loaded plugins, etc. |
Small update here: moving to u128 (#117) does regress world creation performance enough that im now more willing to consider alternatives (see that pr for numbers). I still consider that performance "totally fine" for basically any context, but for some people being as fast as possible is important and I want to take that into account. Ideally if we move back to incrementing indices we do that in a way that does not affect the ux of:
We still need stable ids in those contexts and they should be "free" if possible. I haven't yet made up my mind about this and I'd love everyone else's thoughts. Short term, I'll be merging the pr to resolve collision issues. |
I'm sure you meant to link to this pull request: #393 😊 |
In favor of 128-bit guidsSeveral posts here have talked about the possibility of u128 collisions (you only need to create 2^64 items to get a ~50% chance of collision.) It's actually not nearly that bad, because it doesn't matter how many items you create; it matters how many items you store in a common pool where collisions will be a problem. (I.e. if you're creating 10000 new IDs per second for some particle system, they're probably not very long-lived, so the fact that one of those IDs gets re-used a year later in the same game doesn't actually hurt anything). Against 128-bit guids (or any other random scheme)Determinism is good! Any pure random-number based scheme inherently breaks determinism, which is very bad (for reproducibility, debugging, deterministic multiplayer, lots of stuff.) You can get determinism back by seeding the RNG if you need it, but then you can't use a thread-local RNG. (Which is probably fine actually, a mutex around entity creation isn't a big deal since you need to insert it into global data structures anyway.) |
@jpetkau I agree with your points, but I would argue that determinism isn't always required or desirable. Ex: The amount of time it takes to render a frame isn't deterministic. Requiring it to be deterministic is basically impossible and doesn't buy us much. Game code shouldn't rely on total_time_to_render being the same every frame. I'd argue that this is also true for entity ids. They are "opaque" and game code shouldn't rely on them being the same every time. Doing so creates more problems than it solves. In this case randomizing encourages good behavior, because it makes code that relies on entity id stability fail immediately. Go employs the same approach to hashes / iteration order of hashmaps (or at least, they used to). They explicitly make iteration order nondeterministic to prevent people from relying on it. |
legion 0.3 released recently with what's IMO the most solid serialization support for any Rust ECS today. https://docs.rs/legion/0.3.0/legion/serialize/index.html It has many months of effort behind it with Aclysma and I working to make prefabs work with it, and Tom improving the ergonomics of the APIs. |
@kabergstrom yeah i totally agree that legion's new serialization is nice. We already do something similar on the registry side (although we automatically generate friendly names), but I am now heavily considering an entity id approach similar to legions. This would likely consist of the following changes:
I'll probably tackle this soon so we can move on to other things. |
Also have a look at It allows to rewrite entity IDs when cloning them, including all references by hooking Entity's Clone impl to rewrite the references as they are being cloned. It's what allows prefabs to be cloned while maintaining coherent internal references that point to the newly spawned entities. |
[Sorry, this turned into a ramble that may not be super relevant to the issue any more.]
Yes, "always required" is a much stronger point than I intended to make. I would say it's always valuable, though, just sometimes the value isn't worth the effort to get there. Basically there are two use-cases I had in mind:
Here you have a large chunk of the game logic (but not 100%, e.g. rendering as you noted) which is meant to be exactly deterministic. For a complex game, this is doable, but really hard. (I've shipped this kind of game a couple of times.) You absolutely need deterministic pieces to build out of. And you need to do things like compute a hash over all the deterministic state, so you can immediately detect bugs and narrow down their causes. It already takes a lot of care and discipline to make this work. Having components actively work against you, by making entity IDs or iteration order nondeterministic, makes it pretty much impossible.
This is more of a general comment on the value of being able to reproduce an exact run of some code.
As you point out, nondeterminism as in hash tables (for fuzzing or to avoid certain kinds of DOS attacks) is also valuable. But I think it's even better on top of a deterministic seed. E.g. tests may be nondeterministic as long as they're passing, but as soon as they find a failure, they record the seed so that failure can then be reproduced 100%. But doing that still requires the ability to run deterministically when desired. Also re hash tables--having non-guaranteed iteration order is very often a PITA. There's a reason for the gradual move to "iteration is in order of insertion" hash tables--they're just more useful. Here's an example from some past work:
The result: code that should have been trivial to test ("input A, B, C should be written out unchanged; input D, E, F should be written out like this") was hard to test because the output order was shuffled. If you used the tool on a file, you couldn't just look at a diff to see the results, because the file was shuffled! Logically it was the same, but practically it was useless. Even in the case of rendering (where e.g. frame times are nondeterminstic), there's value in making as much of a game as possible not dependent on that, and in setting up the code so if you were to record and replay the frame times (and player input and other nondeterministic parts), you can reconstruct the rest of the run exactly. Of course it's not worth the effort if the libraries you're building on are fighting you every step of the way, which is why I argue for more determinism in libraries. |
@jpetkau yup those are all good points! Fortunately in this case we now use atomic incrementing indices, so we are now "more deterministic" here. As soon as we introduce parallelism that goes out the window though. I'm closing this issue as it is now "fully resolved'. But feel free to continue discussing related topics here if you all want to 😄 |
As some folks have pointed out in the discord, Bevy ECS uses
rand::<u32>()
to generate entity ids, which gives entities a decent chance of colliding when in the tens of thousands.This needs to be fixed somehow. Per our discord discussion:
I'm not worried at all about random uuid collision, so at this point I think the conversation be about the performance cost of using them.
Someone should run benchmarks for each implementation.
The text was updated successfully, but these errors were encountered: