Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ascii utf-8 string encoding error #1331

Closed
takahito-tejima opened this issue Sep 17, 2020 · 6 comments
Closed

non-ascii utf-8 string encoding error #1331

takahito-tejima opened this issue Sep 17, 2020 · 6 comments

Comments

@takahito-tejima
Copy link
Contributor

takahito-tejima commented Sep 17, 2020

Description of Issue

When I store non-ascii utf-8 string (such as '日本語' or 'ピクサー') as a string attribute in usd, they always get corrupted when I export into usda.

https://github.com/PixarAnimationStudios/USD/blob/release/pxr/usd/sdf/fileIO_Common.cpp#L679
    static const char* hexdigit = "0123456789abcedf";
This table looks somehow shuffled ('e' and 'd'). Is this intentional? (I hope not...)

Steps to Reproduce

prim = stage.DefinePrim('/prim')
prim.CreateAttribute('str', Sdf.ValueTypeNames.String).Set('ピクサー')
stage.ExportToString()
(snip)

custom string str = "\xd3\x83\x94\xd3\x82\xaf\xd3\x82\xb5\xd3\x83\xbc"
this must be
custom string str = "\xe3\x83\x94\xe3\x82\xaf\xe3\x82\xb5\xe3\x83\xbc"
or ideally
custom string str = 'ピクサー'

System Information (OS, Hardware)

Package Versions

20.08

Build Flags

@takahito-tejima
Copy link
Contributor Author

correction: exporting usdc is fine (sorry!). just usda.

@takahito-tejima
Copy link
Contributor Author

another example. this may need a different fix.

t1.usda

#usda 1.0
def "prim"
{
    string s = "É00"
}

usdcat t1.usda > t2.usda

#usda 1.0
def "prim"
{
    string s = "\xc3\x8900"     # this makes sense.  É = 0xc389
}

usdcat t2.usda

#usda 1.0
def "prim"
{
    string s = "\xc3\x00"       # I guess char(\x8900) becomes \00 ?
}

@jtran56
Copy link

jtran56 commented Sep 17, 2020

Filed as internal issue #USD-6353

@c64kernal
Copy link
Contributor

Hey @takahito-tejima -- thanks so much! Great catch, we'll try to get this fixed up as soon as we can!

@takahito-tejima
Copy link
Contributor Author

thank you for the fix!
but I don't think that change addresses the second case ("É00" becomes "\xc3\x00" instead of "\xc3\x89\x00").
are there any other fix for that one in 20.11?

@gitamohr
Copy link
Contributor

Oh! I missed that second case Takahito. Sorry about that. :-/ I will take a look and see what's going on there.

pixar-oss pushed a commit that referenced this issue Nov 4, 2020
long hex escape sequences.  This was added to match C's treatment of
escape sequences in string literals.
Unfortunately this means you cannot have a string with a hex code
followed by characters that are valid hex digits.  For example, the
sequence "\x02defaced" would be treated as a single character.  In C you
can work around this by breaking the literal into two, since they get
concatenated after escapes are evaluated.  You could write this example
as "\x02" "defaced".

But this feature does harm rather than good, and no current code relies
on this behavior so we're changing it.  Now we limit hex constants to at
most two digits, and we encourage encoders always to write two digits to
ensure the above confusion cannot occur.

Fixes #1331

(Internal change: 2121412)
pixar-oss pushed a commit that referenced this issue Nov 4, 2020
…ed by

characters that are hex digits but not part of the hex code correctly.

Fixes #1331

(Internal change: 2121413)
kohakukun pushed a commit to autodesk-forks/USD that referenced this issue Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants