fix(#456): fix default charset problem #457

GodMeowIceSun · 2024-05-31T07:19:19Z

the fix of this issue

Signed-off-by: GodMeowIceSun <icesun@icesun.cn>

src/main/java/org/owasp/validator/html/util/ErrorMessageUtil.java

davewichers · 2024-06-03T15:22:51Z

@spassarop - Have you looked at this issue/proposed fix yet? If there is a real problem here it would be nice to fix this and include it in the release we are close to getting done related to the neko-html fix.

kwwall · 2024-06-06T05:22:49Z

ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8. See https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different.

…

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun ***@***.***> wrote: @GodMeowIceSun <https://github.com/GodMeowIceSun> requested your review on: #457 <#457> fix(#456 <#456>): fix default charset problem. — Reply to this email directly, view it on GitHub <#457 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI> . You are receiving this because your review was requested.Message ID: ***@***.***>

spassarop · 2024-06-06T11:40:06Z

I agree with Kevin. Probably the default Windows encoding was used on policy XML because the content of that file was unlikely to have characters beyond that encoding. I am not sure why there were error messages in Chinese but were not tested too see if the text would show correctly. I mean, if the display was tested it would look obviously broken. Probably if an environment does not support UTF-8, AntiSamy usage should not be the main issue. Then, I share Kevin suggestion to update the encoding to UTF-8 on this code change and test it.

…

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall ***@***.***> wrote: ISO-8859-1 is the default encoding for Windows, but just about every other modern OS uses UTF-8. It can especially bite you in the butt when using String.getBytes(). I recommend changing it to use UTF-8. See https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774 for details of how they are different. -kevin On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun ***@***.***> wrote: > @GodMeowIceSun <https://github.com/GodMeowIceSun> requested your review > on: #457 <#457> fix(#456 > <#456>): fix default charset > problem. > > — > Reply to this email directly, view it on GitHub > <#457 (comment)>, or > unsubscribe > < https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI> > . > You are receiving this because your review was requested.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#457 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

GodMeowIceSun · 2024-06-06T11:54:33Z

I agree with Kevin. Probably the default Windows encoding was used on
policy XML because the content of that file was unlikely to have characters
beyond that encoding.

I am not sure why there were error messages in Chinese but were not tested
too see if the text would show correctly. I mean, if the display was tested
it would look obviously broken.

Probably if an environment does not support UTF-8, AntiSamy usage should
not be the main issue.

Then, I share Kevin suggestion to update the encoding to UTF-8 on this code
change and test it.

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall @.***> wrote:

ISO-8859-1 is the default encoding for Windows, but just about every other
modern OS uses UTF-8. It can especially bite you in the butt when using
String.getBytes(). I recommend changing it to use UTF-8.

See

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774
for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***>
wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review
on: #457 #457 fix(#456
#456): fix default charset
problem.

—
Reply to this email directly, view it on GitHub
#457 (comment), or
unsubscribe
<
https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI>

.
You are receiving this because your review was requested.Message ID:
@.***>

—
Reply to this email directly, view it on GitHub
#457 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA
.
You are receiving this because you were mentioned.Message ID:
@.***>

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

kwwall · 2024-06-06T16:26:55Z

@GodMeowIceSun wrote:

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

I did read through this. I think the worst case scenario here is that someone may have to convert the ResourceBundle from ISO-8859-1 to UTF-8. But every Java instance that I've ever worked with, dating back to JDK 1.1, has supported UTF-8, so I don't think this will be a problem, and it's only a 1 line change to check it. If it's a problem, we can convert the encoding for the ResourceBundle. But if it's all in ASCII, we shouldn't even need to do that. I certainly don't think a different minimal JDK version needs to be updated to handle this though.

kwwall · 2024-06-16T21:26:25Z

I'm not recommending it, but I suppose if you really wanted to, you could tweak the code to use a character set based on a system property and just have it default to UTF-8 if not specified. The one thing I do know if UTF-8 will always be available, regardless of whatever language preferences. I suspect that ISO-8859-1 usually would be, but I'm not sure that is 100% the case if you were using Chinese or Kanji, etc. But UTF-8 is used for other internal encodings. E.g., it's often used with cryptography related code, otherwise going between Windows and *nix often causes problems. For example if shows up in these classes:

$ cd $JAVA_HOME
$ grep -r '"UTF-8"' src | grep /crypto/
com/sun/crypto/provider/PBKDF2KeyImpl.java: *    to bytes using UTF-8 character encoding.
com/sun/crypto/provider/PBKDF2KeyImpl.java:        Charset utf8 = Charset.forName("UTF-8");
sun/security/krb5/internal/crypto/dk/AesDkCrypto.java:            saltUtf8 = salt.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:// String.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:        Charset utf8 = Charset.forName("UTF-8");
javax/crypto/CryptoPermissions.java:        parser.read(new BufferedReader(new InputStreamReader(in, "UTF-8")));

I think it is used when stuff needs to be serialized / deserialized so it will work across operating systems, but I can't swear to that as it's been 15-20 years since I looked at that code.

That said, ISO-8859-1 would almost certainly work, but it's just the natural choice that UTF-8 is. Just my $.02 though. In ESAPI, we use UTF-8 for everything, but this is your choice. I just think UTF-8 is more neutral whereas ISO-8859-1 has a Western European slant.

davewichers · 2024-11-18T18:32:23Z

@kwwall - Do you suggest I close this as won't fix?

kwwall · 2024-11-19T04:37:17Z

@kwwall - Do you suggest I close this as won't fix?

I suppose the correct thing to do is for someone to correct all the AntiSamy properties files and XML files as per the Oracle JDK 9 document that @GodMeowIceSun referenced, and use UTF-8 instead of ISO-8859-1. (And if that doesn't work, use a system property like I suggested earlier.)

Ideally, that someone who would change this would be the one who has submitted the issue, but that doesn't seem to be happening here. However, it seems likely that others are impacted. If I were you, I'd give the PR submitter another chance to make these changes. Set a deadline and if it's not met close this PR. But I don't think I would close issue #456 as the issue seems legitimate. If you end up closing this PR because some date is past, then I probably would slap a 'Help Wanted' label and/or 'Good First Issue' label on the issue and set the priority to Low and just leave it open.

davewichers · 2024-11-19T16:26:19Z

@GodMeowIceSun - If you can implement this PR per @kwwall description, we can potentially merge those changes in. But as is, we cannot. Do you think you can make the changes as he describes?

fix(nahsra#456): fix default charset problem

28e4885

Signed-off-by: GodMeowIceSun <icesun@icesun.cn>

GodMeowIceSun force-pushed the fix-issue-456 branch from fcc3270 to 28e4885 Compare May 31, 2024 08:10

GodMeowIceSun closed this May 31, 2024

GodMeowIceSun reopened this May 31, 2024

kwwall reviewed May 31, 2024

View reviewed changes

src/main/java/org/owasp/validator/html/util/ErrorMessageUtil.java Show resolved Hide resolved

GodMeowIceSun requested a review from kwwall June 6, 2024 03:51

davewichers mentioned this pull request Nov 19, 2024

When my runtime environment is a Chinese encoding environment, the list of error messages returned is garbled #456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#456): fix default charset problem #457

fix(#456): fix default charset problem #457

GodMeowIceSun commented May 31, 2024

davewichers commented Jun 3, 2024

kwwall commented Jun 6, 2024 via email

spassarop commented Jun 6, 2024 via email

GodMeowIceSun commented Jun 6, 2024

kwwall commented Jun 6, 2024 •

edited

Loading

kwwall commented Jun 16, 2024

davewichers commented Nov 18, 2024

kwwall commented Nov 19, 2024

davewichers commented Nov 19, 2024

fix(#456): fix default charset problem #457

Are you sure you want to change the base?

fix(#456): fix default charset problem #457

Conversation

GodMeowIceSun commented May 31, 2024

the fix of this issue

davewichers commented Jun 3, 2024

kwwall commented Jun 6, 2024 via email

spassarop commented Jun 6, 2024 via email

GodMeowIceSun commented Jun 6, 2024

kwwall commented Jun 6, 2024 • edited Loading

kwwall commented Jun 16, 2024

davewichers commented Nov 18, 2024

kwwall commented Nov 19, 2024

davewichers commented Nov 19, 2024

kwwall commented Jun 6, 2024 •

edited

Loading