-
-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relaxes node name sanitization in gltf documents. #45545
Relaxes node name sanitization in gltf documents. #45545
Conversation
What test cases did you use to show it works? |
Tested with a few glb's generated via Blender (one of which is attached to #45544). The more extensive one has a cube named This was intended as a demonstration of the workaround we're using for #45544 (on the 3.2 branch) and a discussion point more than a fully tested patch. Unfortunately I don't know enough about actual limitations on node names in Godot other than what I found enforced here: Line 994 in 5496174
# breaks hints but haven't tracked down why) so I'm not sure what cases to look out for.
I'd be happy to dig/test further if I can get a pointer in the right direction. Wrt. non-English names there should be no change in the existing behavior; anything outside of ASCII and the explicitly listed symbolic characters would still be stripped. |
We had some fun discussion about adding the 💩 emoji and Japanese / Korean / Chinese / etc node names. So that's undecided. However https://github.com/KhronosGroup/glTF/tree/master/specification/2.0#json-encoding says
However, they will still need to be unique. Edited: This expands the list of characters and it's something is hard to discover unless you read or try it, I can understand expanding to other languages, but these list of symbols have other meanings in code, not sure about the interaction with Want to have some other opinions, so I'll get back to you. |
Totally agree, I'm not at all sure what ill effects there could be and would definitely appreciate the guidance. Quickly tested the get_node behavior for the symbols in this PR:
and they are resolved correctly. That being said, if there are any other consequences of allowing these symbols to be part of the node names I think it'd be good to capture them in node.cpp's I think adding unicode support would also be great, having this sanitization apply the superset of node.cpp and glTF limitations feels like it'd be the best user experience. |
Some comments I put in the Godot Contributors chat:
|
Totally agree. I'd hope that we can assume any additional forbidden chars in the glTF spec would've already been stripped/replaced/whatever'd by the time the importer is run (unless you mean the Godot node spec, in which case I'd expect I did notice that names including |
Anyone opposing:
Note this should probably match skeleton validation. |
No objection from my side. We'd just have to track down and fix the behavior I observed with '#' breaking import hints. I'm happy to take a pass at that.
I agree in principle that it's a better user experience to be as permissive as possible but haven't looked at the bone code myself. My understanding is that they're not Node's so I'm not sure if they need something more restrictive than the Node enforcement. That said, I think it'd be a big improvement to centralize bone name sanitization across all the import paths as well for the sake of consistency/ease of documentation. Also, is there a mechanism to test the import behavior automatically that I should leverage? |
See https://github.com/godotengine/godot-tests/tree/master/tests/blend_export for some automated testing. You can add new test cases. |
a83d4c1
to
af7c199
Compare
scene/main/node.h
Outdated
@@ -187,7 +187,9 @@ class Node : public Object { | |||
|
|||
#ifdef TOOLS_ENABLED | |||
friend class SceneTreeEditor; | |||
friend class GLTFDocument; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as to whether _validate_node_name should just be made public instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anything else using it? Do we want other things to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anything else is using it today, but I'm also not sure if there are other cases where sanitization is being performed (maybe in the other import modules?). In my mind it'd be better for consistency's sake if we'd have everything that generates a Node and needs fine control over naming would leverage _validate_node_name
so any future modifications to expand/reduce the restricted characters would be applied uniformly.
Similarly since AnimationPlayer adds a couple additional restrictions, I think we would ideally have a similar public static _validate_animation_name
that can perform animation-specific cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try hard to find at least one other use or invent a use because if nothing uses it, we don't need to expose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try hard to find at least one other use or invent a use because if nothing uses it, we don't need to expose.
You mean making _validate_node_name
public specifically? On the one hand it feels odd to make GLTFDocument
a full friend class just to get access to one method (I assume SceneTreeEditor
does more interesting Node manipulation), but if there's a general trend towards being conservative about making utility methods public that's fine as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glancing at the blame I suspect it was just to facilitate the use in STE
which wants to display an error message if the node name had to be sanitized: 6758b6c
If that's the case it seems like it'd be fine to have it return the string and then do a (admittedly expensive) equality check on the original and returned string in that one spot.
I'm happy to make that change if you think it's worthwhile?
[Edit] We could also have a private version that does an in-place update for use in set_name
since that seems to be the main potentially performance critical usecase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed the later comment regarding moving the validation to String
.
My initial reaction is that approach feels like it smears Node
concerns into a potentially unexpected place (naively if I was looking for node validation and saw a Node
class I'd assume it was there). That said, String
does have things like simplify_path
that arguably would be appropriate in _File
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poking around a bit more I see that there doesn't appear to be a way to expose a static method via ClassDB so if there's a strong desire to make this bindable, moving it to String
with the signature: String String::validate_node_name() const
(where it returns a copy of the string after sanitizing invalid characters) seems like it'd be the most straightforward approach, though it still feels slightly out of place to me.
@akien-mga I'll wait for direction from you/@reduz before going that path, but in the meantime I've pushed a new commit with a script friendly API in case there's a way to bind static
methods that I've missed. If it's not necessary to bind this method then it'd probably make sense to revert this latest commit to avoid the performance penalty of copying the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the scripting bindings don't support static methods, so we have to work it around for now. That's why we tend to (ab)use the String API for some File and Node checks. If we do implement support for static methods, then it would indeed make sense to have this under Node, but for the time being the String approach is the simplest one for a binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the guidance! I've made the change (and added documentation and a basic unit test).
Switched over to leverage Node's validation. Still a couple outstanding issues with Animation handling and the Emoji tests. For Emoji specifically, they are not displayed correctly in Blender or in Godot and they're encoded in the reserved range, causing them to trigger the |
I'll ping @bruvzg and the rocket chat development community. |
If it's triggering this error, the source string is probably UTF-16 (Godot strings are UTF-32) and emojis are encoded as surrogate pairs. It should be decoded using |
Since it's JSON, UTF-16 pair might be stored as "\u0000\u0000" UTF-8 string, we should add handling for this case to the Or it might be some sort of pseudo-UTF-8 (CESU-8, WTF-8 or some other garbage), in this case it's better to strip it. Draft of diff --git a/core/io/json.cpp b/core/io/json.cpp
index bc4527869b..0a6279f9c6 100644
--- a/core/io/json.cpp
+++ b/core/io/json.cpp
@@ -233,6 +233,51 @@ Error JSON::_get_token(const char32_t *p_str, int &index, int p_len, Token &r_to
res |= v;
}
index += 4; //will add at the end anyway
+ if ((res & 0xfffffc00) == 0xd800) {
+ if (p_str[index + 1] != '\\' || p_str[index + 2] != 'u') {
+ r_err_str = "Invalid UTF-16 sequence in string, unpaired lead surrogate";
+ return ERR_PARSE_ERROR;
+ }
+ index += 2;
+ char32_t trail = 0;
+ for (int j = 0; j < 4; j++) {
+ char32_t c = p_str[index + j + 1];
+ if (c == 0) {
+ r_err_str = "Unterminated String";
+ return ERR_PARSE_ERROR;
+ }
+ if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'))) {
+ r_err_str = "Malformed hex constant in string";
+ return ERR_PARSE_ERROR;
+ }
+ char32_t v;
+ if (c >= '0' && c <= '9') {
+ v = c - '0';
+ } else if (c >= 'a' && c <= 'f') {
+ v = c - 'a';
+ v += 10;
+ } else if (c >= 'A' && c <= 'F') {
+ v = c - 'A';
+ v += 10;
+ } else {
+ ERR_PRINT("Bug parsing hex constant.");
+ v = 0;
+ }
+
+ trail <<= 4;
+ trail |= v;
+ }
+ if ((trail & 0xfffffc00) == 0xdc00) {
+ res = (res << 10UL) + trail - ((0xd800 << 10UL) + 0xdc00 - 0x10000);
+ index += 4; //will add at the end anyway
+ } else {
+ r_err_str = "Invalid UTF-16 sequence in string, unpaired lead surrogate";
+ return ERR_PARSE_ERROR;
+ }
+ } else if ((res & 0xfffffc00) == 0xdc00) {
+ r_err_str = "Invalid UTF-16 sequence in string, unpaired trail surrogate";
+ return ERR_PARSE_ERROR;
+ }
} break;
default: { |
It is indeed a UTF-16 pair in the glTF JSON: @bruvzg your attached fix for |
Are there any specific places you wanted more review? I hope that the test suite is able to give you a feedback improvement cycle. |
I think the test suite change is the more interesting bit to review, assuming the general approach here seems reasonable to you. I'll switch that from a WIP to an actual review; there's still a bit of work left to be done and I may need to remove the emoji tests since they won't pass until the fix that @bruvzg suggested is incorporated. |
Well you could add that change to this pr for the sake of easier review. |
Sorry, can you point me at how? I didn't realize you can have a PR span multiple repos (and I'm failing to find out how). |
modules/gltf/gltf_document.cpp
Outdated
|
||
// TODO: Consider adding invalid_characters or a _validate_animation_name to animation_player to mirror Node. | ||
String p_name = _sanitize_scene_name(name); | ||
RegEx regex("([\\[,]+)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest doing a plain .replace()
for the invalid characters instead of using RegEx
. The regex module can be disabled, and here this would trigger compilation errors if you don't guard it with #ifdef MODULE_REGEX_ENABLED
(that's a pre-existing issue in this class though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
I left the regex handling in _sanitize_bone_name
but can do a followup if we decide to relax sanitization there as well.
To make the link bidirectional: tests are in godotengine/godot-tests#21 |
const String String::invalid_node_name_characters = ". : @ / \""; | ||
|
||
String String::validate_node_name() const { | ||
Vector<String> chars = String::invalid_node_name_characters.split(" "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is pre-existing code but I'm not sure if there's much gain having this as a static constant which then gets split()
on each call. Compilers might be able to optimize this I guess? But we could just as well have const String restricted_chars = ".:@/\"";
inside the function and iterate over the characters directly.
CC @bruvzg if you have a suggestion on how best to do that (don't want to get mixed up with const char *
and char32_t
:)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see invalid_node_name_characters
is used in SceneTreeEditor` to show a nice error message. Makes sense then I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same (I'm dubious that this would get memoized by the compiler) but my gut feel is that validate_node_name
is probably not called frequently enough for the split
to have a meaningful impact on performance in typical scenarios. If there are scenarios where very large numbers of nodes are getting created it'd be trivial to modify the error message case (which should be even more rare), but I wanted to keep changes limited to the original bug as much as practical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector<String> chars = String::invalid_node_name_characters.split(" "); | |
// Changes made to the set of invalid characters must also be reflected in the String documentation. | |
const String String::invalid_node_name_characters = ".:@/\""; | |
String String::validate_node_name() const { | |
String invalid_char(&String::invalid_node_name_characters[0], 1); | |
String replacement(""); | |
String name = this->replace(invalid_char, replacement); | |
for (int i = 1; i < String::invalid_node_name_characters.length(); ++i) { | |
invalid_char[0] = String::invalid_node_name_characters[i]; | |
name = name.replace(invalid_char, replacement); | |
} | |
return name; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Core/scene changes look good to me. Could you squash the commits into one?
Would still like confirmation from @fire that the glTF-specific changes are also good.
Test plan: load test suite once and check for irregularities. Then I’ll approve. Getting there later today. |
f526ffc
to
bbe46d1
Compare
Squashed and repushed. |
bbe46d1
to
61cc1c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
Order of operations is:
- Submit PR / Merge _get_token changes.
- _get_token change allows conversion to unknown font glyph because the current behavior of invalid text isn't the correct behavior and this is a way to avoid creating mojibake. https://en.wikipedia.org/wiki/Mojibake
- Merge this 45545 pr.
- Merge the tests linked.
Submitted and merged. |
Thanks! And congrats for your first merged Godot contribution 🎉 |
This doesn't seem trivial to cherry-pick for the |
Expands the set of allowed characters in
_sanitize_scene_name
.This is what my team uses to patch around #45544 though it'd probably be better to use some global method to sanitize names consistently across Godot.
Bugsquad edit: Fixes #45544.