non-ascii utf-8 string encoding error #1331

takahito-tejima · 2020-09-17T09:07:55Z

Description of Issue

When I store non-ascii utf-8 string (such as '日本語' or 'ピクサー') as a string attribute in usd, they always get corrupted when I export into usda.

https://github.com/PixarAnimationStudios/USD/blob/release/pxr/usd/sdf/fileIO_Common.cpp#L679
static const char* hexdigit = "0123456789abcedf";
This table looks somehow shuffled ('e' and 'd'). Is this intentional? (I hope not...)

Steps to Reproduce

prim = stage.DefinePrim('/prim')
prim.CreateAttribute('str', Sdf.ValueTypeNames.String).Set('ピクサー')
stage.ExportToString()
(snip)

custom string str = "\xd3\x83\x94\xd3\x82\xaf\xd3\x82\xb5\xd3\x83\xbc"
this must be
custom string str = "\xe3\x83\x94\xe3\x82\xaf\xe3\x82\xb5\xe3\x83\xbc"
or ideally
custom string str = 'ピクサー'

System Information (OS, Hardware)

Package Versions

20.08

Build Flags

takahito-tejima · 2020-09-17T09:35:10Z

correction: exporting usdc is fine (sorry!). just usda.

takahito-tejima · 2020-09-17T11:45:14Z

another example. this may need a different fix.

t1.usda

#usda 1.0
def "prim"
{
    string s = "É00"
}

usdcat t1.usda > t2.usda

#usda 1.0
def "prim"
{
    string s = "\xc3\x8900"     # this makes sense.  É = 0xc389
}

usdcat t2.usda

#usda 1.0
def "prim"
{
    string s = "\xc3\x00"       # I guess char(\x8900) becomes \00 ?
}

jtran56 · 2020-09-17T23:34:57Z

Filed as internal issue #USD-6353

c64kernal · 2020-09-22T20:41:34Z

Hey @takahito-tejima -- thanks so much! Great catch, we'll try to get this fixed up as soon as we can!

takahito-tejima · 2020-10-22T11:52:08Z

thank you for the fix!
but I don't think that change addresses the second case ("É00" becomes "\xc3\x00" instead of "\xc3\x89\x00").
are there any other fix for that one in 20.11?

gitamohr · 2020-10-22T16:27:48Z

Oh! I missed that second case Takahito. Sorry about that. :-/ I will take a look and see what's going on there.

long hex escape sequences. This was added to match C's treatment of escape sequences in string literals. Unfortunately this means you cannot have a string with a hex code followed by characters that are valid hex digits. For example, the sequence "\x02defaced" would be treated as a single character. In C you can work around this by breaking the literal into two, since they get concatenated after escapes are evaluated. You could write this example as "\x02" "defaced". But this feature does harm rather than good, and no current code relies on this behavior so we're changing it. Now we limit hex constants to at most two digits, and we encourage encoders always to write two digits to ensure the above confusion cannot occur. Fixes #1331 (Internal change: 2121412)

…ed by characters that are hex digits but not part of the hex code correctly. Fixes #1331 (Internal change: 2121413)

sunyab added the pending push label Sep 25, 2020

sunyab added pending release and removed pending push labels Sep 30, 2020

pixar-oss closed this as completed in 0838eca Oct 14, 2020

sunyab removed the pending release label Oct 15, 2020

pixar-oss pushed a commit that referenced this issue Nov 4, 2020

sdf: Add a test to ensure that we round-trip escaped hex-codes follow…

ad73414

…ed by characters that are hex digits but not part of the hex code correctly. Fixes #1331 (Internal change: 2121413)

This was referenced Apr 20, 2023

Replace references to ASCII with Text in USD binaries to account for UTF-8 support #2394

Merged

Replace documentation references to "ASCII" with "text" #2395

Merged

kohakukun pushed a commit to autodesk-forks/USD that referenced this issue Oct 21, 2024

[wasm building on windows (PixarAnimationStudios#1331)

8d6e4c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-ascii utf-8 string encoding error #1331

non-ascii utf-8 string encoding error #1331

takahito-tejima commented Sep 17, 2020 •

edited

Loading

takahito-tejima commented Sep 17, 2020

takahito-tejima commented Sep 17, 2020

jtran56 commented Sep 17, 2020

c64kernal commented Sep 22, 2020

takahito-tejima commented Oct 22, 2020

gitamohr commented Oct 22, 2020

non-ascii utf-8 string encoding error #1331

non-ascii utf-8 string encoding error #1331

Comments

takahito-tejima commented Sep 17, 2020 • edited Loading

Description of Issue

Steps to Reproduce

System Information (OS, Hardware)

Package Versions

Build Flags

takahito-tejima commented Sep 17, 2020

takahito-tejima commented Sep 17, 2020

jtran56 commented Sep 17, 2020

c64kernal commented Sep 22, 2020

takahito-tejima commented Oct 22, 2020

gitamohr commented Oct 22, 2020

takahito-tejima commented Sep 17, 2020 •

edited

Loading