Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxes node name sanitization in gltf documents. #45545

Merged

Conversation

abaire
Copy link
Contributor

@abaire abaire commented Jan 28, 2021

Expands the set of allowed characters in _sanitize_scene_name.

This is what my team uses to patch around #45544 though it'd probably be better to use some global method to sanitize names consistently across Godot.

Bugsquad edit: Fixes #45544.

@Calinou Calinou added bug cherrypick:3.x Considered for cherry-picking into a future 3.x release topic:import labels Jan 28, 2021
@Calinou Calinou added this to the 4.0 milestone Jan 28, 2021
@akien-mga akien-mga requested a review from a team January 28, 2021 22:09
@fire
Copy link
Member

fire commented Jan 28, 2021

What test cases did you use to show it works?

@abaire
Copy link
Contributor Author

abaire commented Jan 29, 2021

Tested with a few glb's generated via Blender (one of which is attached to #45544). The more extensive one has a cube named Cube=!%^&*(){}\\[\\]|<>,~ and a torus named Torus=!%^&*(){}\\[\\]|<>,~-noimp, just verified that the cube is imported as a MeshInstance w/ the full name and the torus is not imported at all.

This was intended as a demonstration of the workaround we're using for #45544 (on the 3.2 branch) and a discussion point more than a fully tested patch. Unfortunately I don't know enough about actual limitations on node names in Godot other than what I found enforced here:

bool Node::_validate_node_name(String &p_name) {
. I'm not sure what the intent was behind the original sanitization (which looks like it initially broke import hints and was relaxed in a followup, and I also found that preserving # breaks hints but haven't tracked down why) so I'm not sure what cases to look out for.

I'd be happy to dig/test further if I can get a pointer in the right direction.

Wrt. non-English names there should be no change in the existing behavior; anything outside of ASCII and the explicitly listed symbolic characters would still be stripped.

@fire
Copy link
Member

fire commented Jan 30, 2021

We had some fun discussion about adding the 💩 emoji and Japanese / Korean / Chinese / etc node names. So that's undecided.

However https://github.com/KhronosGroup/glTF/tree/master/specification/2.0#json-encoding says

Implementation Note: This allows generic glTF client implementations to not have full Unicode support. Application-specific strings (e.g., values of "name" properties or content of extras fields) may use any symbols.

However, they will still need to be unique.

Edited:

This expands the list of characters and it's something is hard to discover unless you read or try it, I can understand expanding to other languages, but these list of symbols have other meanings in code, not sure about the interaction with $ get node.

Want to have some other opinions, so I'll get back to you.

@abaire
Copy link
Contributor Author

abaire commented Jan 30, 2021

This expands the list of characters and it's something is hard to discover unless you read or try it, I can understand expanding to other languages, but these list of symbols have other meanings in code, not sure about the interaction with $ get node.

Totally agree, I'm not at all sure what ill effects there could be and would definitely appreciate the guidance.

Quickly tested the get_node behavior for the symbols in this PR:

	var t1 = $"normal=!%^&*(){}[]|<>,~"
	var t2 = get_node("normal=!%^&*(){}[]|<>,~")

and they are resolved correctly.

That being said, if there are any other consequences of allowing these symbols to be part of the node names I think it'd be good to capture them in node.cpp's invalid_character list, since those seem to be the only things that can't be entered when renaming nodes in the editor.

I think adding unicode support would also be great, having this sanitization apply the superset of node.cpp and glTF limitations feels like it'd be the best user experience.
Quickly testing in the main and 3.2 branches it looks like attempting to add a unicode emoji into a node name isn't prevented, but it's also not displayed correctly in the editor (whereas adding Japanese is correctly displayed). Similar to the symbols added in this PR, both emoji and Japanese will cause $ to fail unless the name is quoted (which appears to work as expected, even with the incorrect display of emoji), though I noticed that in master having a name "文字" causes the editor to behave abnormally.

@akien-mga
Copy link
Member

Some comments I put in the Godot Contributors chat:

I'm not sure why we have to whitelist specific characters instead of just letting Node::set_name() do the validation? It calls Node::_validate_node_name() which strips:

String Node::invalid_character = ". : @ / \"";

The only problem would be if a node has an empty name once validated (e.g. if the name is .:@).
Or are there additional characters forbidden in the spec?

@abaire
Copy link
Contributor Author

abaire commented Feb 2, 2021

Some comments I put in the Godot Contributors chat:

I'm not sure why we have to whitelist specific characters instead of just letting Node::set_name() do the validation? It calls Node::_validate_node_name() which strips:

String Node::invalid_character = ". : @ / \"";

The only problem would be if a node has an empty name once validated (e.g. if the name is .:@).
Or are there additional characters forbidden in the spec?

Totally agree. I'd hope that we can assume any additional forbidden chars in the glTF spec would've already been stripped/replaced/whatever'd by the time the importer is run (unless you mean the Godot node spec, in which case I'd expect set_name to strip them).

I did notice that names including # caused issues with the importer hint processing but never got around to tracking down why.

@fire
Copy link
Member

fire commented Feb 5, 2021

Anyone opposing:

Letting Node::set_name() do the validation?

Note this should probably match skeleton validation.

@abaire
Copy link
Contributor Author

abaire commented Feb 6, 2021

Anyone opposing:

Letting Node::set_name() do the validation?

No objection from my side. We'd just have to track down and fix the behavior I observed with '#' breaking import hints. I'm happy to take a pass at that.

Note this should probably match skeleton validation.

I agree in principle that it's a better user experience to be as permissive as possible but haven't looked at the bone code myself. My understanding is that they're not Node's so I'm not sure if they need something more restrictive than the Node enforcement. That said, I think it'd be a big improvement to centralize bone name sanitization across all the import paths as well for the sake of consistency/ease of documentation.
Do you recall the reason behind the current sanitization approach for bones?

Also, is there a mechanism to test the import behavior automatically that I should leverage?

@fire
Copy link
Member

fire commented Feb 8, 2021

See https://github.com/godotengine/godot-tests/tree/master/tests/blend_export for some automated testing. You can add new test cases.

@abaire abaire force-pushed the relaxes_gltf_name_sanitization branch 3 times, most recently from a83d4c1 to af7c199 Compare February 13, 2021 16:41
@@ -187,7 +187,9 @@ class Node : public Object {

#ifdef TOOLS_ENABLED
friend class SceneTreeEditor;
friend class GLTFDocument;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to whether _validate_node_name should just be made public instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is anything else using it? Do we want other things to use it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anything else is using it today, but I'm also not sure if there are other cases where sanitization is being performed (maybe in the other import modules?). In my mind it'd be better for consistency's sake if we'd have everything that generates a Node and needs fine control over naming would leverage _validate_node_name so any future modifications to expand/reduce the restricted characters would be applied uniformly.

Similarly since AnimationPlayer adds a couple additional restrictions, I think we would ideally have a similar public static _validate_animation_name that can perform animation-specific cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try hard to find at least one other use or invent a use because if nothing uses it, we don't need to expose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try hard to find at least one other use or invent a use because if nothing uses it, we don't need to expose.

You mean making _validate_node_name public specifically? On the one hand it feels odd to make GLTFDocument a full friend class just to get access to one method (I assume SceneTreeEditor does more interesting Node manipulation), but if there's a general trend towards being conservative about making utility methods public that's fine as well.

Copy link
Contributor Author

@abaire abaire Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glancing at the blame I suspect it was just to facilitate the use in STE which wants to display an error message if the node name had to be sanitized: 6758b6c
If that's the case it seems like it'd be fine to have it return the string and then do a (admittedly expensive) equality check on the original and returned string in that one spot.

I'm happy to make that change if you think it's worthwhile?

[Edit] We could also have a private version that does an in-place update for use in set_name since that seems to be the main potentially performance critical usecase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed the later comment regarding moving the validation to String.
My initial reaction is that approach feels like it smears Node concerns into a potentially unexpected place (naively if I was looking for node validation and saw a Node class I'd assume it was there). That said, String does have things like simplify_path that arguably would be appropriate in _File.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking around a bit more I see that there doesn't appear to be a way to expose a static method via ClassDB so if there's a strong desire to make this bindable, moving it to String with the signature: String String::validate_node_name() const (where it returns a copy of the string after sanitizing invalid characters) seems like it'd be the most straightforward approach, though it still feels slightly out of place to me.

@akien-mga I'll wait for direction from you/@reduz before going that path, but in the meantime I've pushed a new commit with a script friendly API in case there's a way to bind static methods that I've missed. If it's not necessary to bind this method then it'd probably make sense to revert this latest commit to avoid the performance penalty of copying the string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the scripting bindings don't support static methods, so we have to work it around for now. That's why we tend to (ab)use the String API for some File and Node checks. If we do implement support for static methods, then it would indeed make sense to have this under Node, but for the time being the String approach is the simplest one for a binding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks for the guidance! I've made the change (and added documentation and a basic unit test).

@abaire
Copy link
Contributor Author

abaire commented Feb 13, 2021

Switched over to leverage Node's validation. Still a couple outstanding issues with Animation handling and the Emoji tests.

For Emoji specifically, they are not displayed correctly in Blender or in Godot and they're encoded in the reserved range, causing them to trigger the Unicode parsing error: Invalid unicode codepoint handlers in ustring.cpp. This results in the creation of nodes that display but can't easily be referenced via getNode since the emoji chars get sanitized to the replacement character.
@fire do you think we should explicitly strip Emoji in the sanitizers or just leave it as semi-supported behavior?

@fire
Copy link
Member

fire commented Feb 13, 2021

I'll ping @bruvzg and the rocket chat development community.

@bruvzg
Copy link
Member

bruvzg commented Feb 13, 2021

Unicode parsing error: Invalid unicode codepoint

If it's triggering this error, the source string is probably UTF-16 (Godot strings are UTF-32) and emojis are encoded as surrogate pairs. It should be decoded using String::utf16(const char16_t *p_utf16, int p_len) method.

@bruvzg
Copy link
Member

bruvzg commented Feb 13, 2021

And there is no emoji font in the editor, but the correct string should not give any errors, emoji will be displayed as it hex code: Screenshot 2021-02-13 at 20 25 44

@bruvzg
Copy link
Member

bruvzg commented Feb 13, 2021

If it's triggering this error, the source string is probably UTF-16

Since it's JSON, UTF-16 pair might be stored as "\u0000\u0000" UTF-8 string, we should add handling for this case to the JSON::_get_token function (right now it's reading it as is, which will cause Invalid unicode codepoint errors).

Or it might be some sort of pseudo-UTF-8 (CESU-8, WTF-8 or some other garbage), in this case it's better to strip it.

Draft of JSON::get token change (untested):

diff --git a/core/io/json.cpp b/core/io/json.cpp
index bc4527869b..0a6279f9c6 100644
--- a/core/io/json.cpp
+++ b/core/io/json.cpp
@@ -233,6 +233,51 @@ Error JSON::_get_token(const char32_t *p_str, int &index, int p_len, Token &r_to
 									res |= v;
 								}
 								index += 4; //will add at the end anyway
+								if ((res & 0xfffffc00) == 0xd800) {
+									if (p_str[index + 1] != '\\' || p_str[index + 2] != 'u') {
+										r_err_str = "Invalid UTF-16 sequence in string, unpaired lead surrogate";
+										return ERR_PARSE_ERROR;
+									}
+									index += 2;
+									char32_t trail = 0;
+									for (int j = 0; j < 4; j++) {
+										char32_t c = p_str[index + j + 1];
+										if (c == 0) {
+											r_err_str = "Unterminated String";
+											return ERR_PARSE_ERROR;
+										}
+										if (!((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F'))) {
+											r_err_str = "Malformed hex constant in string";
+											return ERR_PARSE_ERROR;
+										}
+										char32_t v;
+										if (c >= '0' && c <= '9') {
+											v = c - '0';
+										} else if (c >= 'a' && c <= 'f') {
+											v = c - 'a';
+											v += 10;
+										} else if (c >= 'A' && c <= 'F') {
+											v = c - 'A';
+											v += 10;
+										} else {
+											ERR_PRINT("Bug parsing hex constant.");
+											v = 0;
+										}
+
+										trail <<= 4;
+										trail |= v;
+									}
+									if ((trail & 0xfffffc00) == 0xdc00) {
+										res = (res << 10UL) + trail - ((0xd800 << 10UL) + 0xdc00 - 0x10000);
+										index += 4; //will add at the end anyway
+									} else {
+										r_err_str = "Invalid UTF-16 sequence in string, unpaired lead surrogate";
+										return ERR_PARSE_ERROR;
+									}
+								} else if ((res & 0xfffffc00) == 0xdc00) {
+									r_err_str = "Invalid UTF-16 sequence in string, unpaired trail surrogate";
+									return ERR_PARSE_ERROR;
+								}
 
 							} break;
 							default: {

@abaire
Copy link
Contributor Author

abaire commented Feb 13, 2021

If it's triggering this error, the source string is probably UTF-16

Since it's JSON, UTF-16 pair might be stored as "\u0000\u0000" UTF-8 string, we should add handling for this case to the JSON::_get_token function (right now it's reading it as is, which will cause Invalid unicode codepoint errors).

It is indeed a UTF-16 pair in the glTF JSON: "name" : "Basic_Emoji_\ud83d\ude00\ud83e\udd16",

@bruvzg your attached fix for _get_token fixes this case, the string is now correctly displayed with the hex codes for the two emoji characters - thanks! Is that change something you're working on in a separate PR that I should rebase off of? I can also just disable my emoji-specific test cases until the _get_token improvement is available.

@fire
Copy link
Member

fire commented Feb 14, 2021

Are there any specific places you wanted more review? I hope that the test suite is able to give you a feedback improvement cycle.

@abaire
Copy link
Contributor Author

abaire commented Feb 14, 2021

Are there any specific places you wanted more review? I hope that the test suite is able to give you a feedback improvement cycle.

I think the test suite change is the more interesting bit to review, assuming the general approach here seems reasonable to you. I'll switch that from a WIP to an actual review; there's still a bit of work left to be done and I may need to remove the emoji tests since they won't pass until the fix that @bruvzg suggested is incorporated.

@fire
Copy link
Member

fire commented Feb 14, 2021

Well you could add that change to this pr for the sake of easier review.

@abaire
Copy link
Contributor Author

abaire commented Feb 15, 2021

Well you could add that change to this pr for the sake of easier review.

Sorry, can you point me at how? I didn't realize you can have a PR span multiple repos (and I'm failing to find out how).


// TODO: Consider adding invalid_characters or a _validate_animation_name to animation_player to mirror Node.
String p_name = _sanitize_scene_name(name);
RegEx regex("([\\[,]+)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest doing a plain .replace() for the invalid characters instead of using RegEx. The regex module can be disabled, and here this would trigger compilation errors if you don't guard it with #ifdef MODULE_REGEX_ENABLED (that's a pre-existing issue in this class though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!
I left the regex handling in _sanitize_bone_name but can do a followup if we decide to relax sanitization there as well.

@abaire abaire requested a review from a team as a code owner February 22, 2021 16:59
@abaire
Copy link
Contributor Author

abaire commented Feb 22, 2021

To make the link bidirectional: tests are in godotengine/godot-tests#21

@abaire abaire requested review from a team as code owners February 22, 2021 21:38
const String String::invalid_node_name_characters = ". : @ / \"";

String String::validate_node_name() const {
Vector<String> chars = String::invalid_node_name_characters.split(" ");
Copy link
Member

@akien-mga akien-mga Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is pre-existing code but I'm not sure if there's much gain having this as a static constant which then gets split() on each call. Compilers might be able to optimize this I guess? But we could just as well have const String restricted_chars = ".:@/\""; inside the function and iterate over the characters directly.

CC @bruvzg if you have a suggestion on how best to do that (don't want to get mixed up with const char * and char32_t :)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see invalid_node_name_characters is used in SceneTreeEditor` to show a nice error message. Makes sense then I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same (I'm dubious that this would get memoized by the compiler) but my gut feel is that validate_node_name is probably not called frequently enough for the split to have a meaningful impact on performance in typical scenarios. If there are scenarios where very large numbers of nodes are getting created it'd be trivial to modify the error message case (which should be even more rare), but I wanted to keep changes limited to the original bug as much as practical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Vector<String> chars = String::invalid_node_name_characters.split(" ");
// Changes made to the set of invalid characters must also be reflected in the String documentation.
const String String::invalid_node_name_characters = ".:@/\"";
String String::validate_node_name() const {
String invalid_char(&String::invalid_node_name_characters[0], 1);
String replacement("");
String name = this->replace(invalid_char, replacement);
for (int i = 1; i < String::invalid_node_name_characters.length(); ++i) {
invalid_char[0] = String::invalid_node_name_characters[i];
name = name.replace(invalid_char, replacement);
}
return name;
}

doc/classes/String.xml Outdated Show resolved Hide resolved
Copy link
Member

@akien-mga akien-mga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core/scene changes look good to me. Could you squash the commits into one?

Would still like confirmation from @fire that the glTF-specific changes are also good.

@fire
Copy link
Member

fire commented Feb 24, 2021

Test plan: load test suite once and check for irregularities. Then I’ll approve. Getting there later today.

@abaire abaire force-pushed the relaxes_gltf_name_sanitization branch from f526ffc to bbe46d1 Compare February 24, 2021 15:23
@abaire
Copy link
Contributor Author

abaire commented Feb 24, 2021

Squashed and repushed.
@fire as a heads up the Emoji tests are still dependent on the _get_token patch from @bruvzg which is not included in this PR (so I'd expect those 3 cases to fail). I can either disable those tests, update them to expect the unicode replacement char (matching the current behavior), or let them fail with the expectation that _get_token will be fixed in the near future.

@abaire abaire force-pushed the relaxes_gltf_name_sanitization branch from bbe46d1 to 61cc1c8 Compare February 24, 2021 16:23
Copy link
Member

@fire fire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Order of operations is:

  1. Submit PR / Merge _get_token changes.
  2. _get_token change allows conversion to unknown font glyph because the current behavior of invalid text isn't the correct behavior and this is a way to avoid creating mojibake. https://en.wikipedia.org/wiki/Mojibake
  3. Merge this 45545 pr.
  4. Merge the tests linked.

@bruvzg
Copy link
Member

bruvzg commented Mar 1, 2021

Submit PR / Merge _get_token changes.

Submitted and merged.

@akien-mga akien-mga merged commit 83b1acd into godotengine:master Mar 9, 2021
@akien-mga
Copy link
Member

akien-mga commented Mar 9, 2021

Thanks! And congrats for your first merged Godot contribution 🎉

@akien-mga
Copy link
Member

This doesn't seem trivial to cherry-pick for the 3.2 branch, so it will likely require a dedicated backport PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

glTF import process strips most non-alphanumeric characters when generating Nodes
5 participants