Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(#456): fix default charset problem #457

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

GodMeowIceSun
Copy link

the fix of this issue

#456

Signed-off-by: GodMeowIceSun <icesun@icesun.cn>
@davewichers
Copy link
Collaborator

@spassarop - Have you looked at this issue/proposed fix yet? If there is a real problem here it would be nice to fix this and include it in the release we are close to getting done related to the neko-html fix.

@GodMeowIceSun GodMeowIceSun requested a review from kwwall June 6, 2024 03:51
@kwwall
Copy link
Contributor

kwwall commented Jun 6, 2024 via email

@spassarop
Copy link
Collaborator

spassarop commented Jun 6, 2024 via email

@GodMeowIceSun
Copy link
Author

I agree with Kevin. Probably the default Windows encoding was used on
policy XML because the content of that file was unlikely to have characters
beyond that encoding.

I am not sure why there were error messages in Chinese but were not tested
too see if the text would show correctly. I mean, if the display was tested
it would look obviously broken.

Probably if an environment does not support UTF-8, AntiSamy usage should
not be the main issue.

Then, I share Kevin suggestion to update the encoding to UTF-8 on this code
change and test it.

On Thu, 6 Jun 2024 at 02:23 Kevin W. Wall @.***> wrote:

ISO-8859-1 is the default encoding for Windows, but just about every other
modern OS uses UTF-8. It can especially bite you in the butt when using
String.getBytes(). I recommend changing it to use UTF-8.

See

https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1#7048774
for details of how they are different.

-kevin

On Wed, Jun 5, 2024, 11:52 PM GodMeowIceSun @.***>
wrote:

@GodMeowIceSun https://github.com/GodMeowIceSun requested your review
on: #457 #457 fix(#456
#456): fix default charset
problem.


Reply to this email directly, view it on GitHub
#457 (comment), or
unsubscribe
<
https://github.com/notifications/unsubscribe-auth/AAO6PG4LLTCSVCS2OOT3FNTZF7MG3AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGA3DANJQGU4TAMI>

.
You are receiving this because your review was requested.Message ID:
@.***>


Reply to this email directly, view it on GitHub
#457 (comment), or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AHL3BMPHIOCU4ZAWHBAEX73ZF7W35AVCNFSM6AAAAABISIQENCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJRGQZTKMRSHA
.
You are receiving this because you were mentioned.Message ID:
@.***>

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

@kwwall
Copy link
Contributor

kwwall commented Jun 6, 2024

@GodMeowIceSun wrote:

Perhaps you can take a look at the document I referenced, which proves in reverse that the default encoding of ResourceBundle in the original implementation of Jdk8 was ISO-8859-1, while our project is compatible with Java 8+, and the internationalization method used is ResourceBundle. Perhaps this is the core reason for the problem. I don't think directly modifying it to UTF-8 can solve the problem. Perhaps different versions of JDK need to be treated differently or the Java version should use ISO-8859-1 as the default encoding, which is also a temporary solution

oracle jdk9 doc

I did read through this. I think the worst case scenario here is that someone may have to convert the ResourceBundle from ISO-8859-1 to UTF-8. But every Java instance that I've ever worked with, dating back to JDK 1.1, has supported UTF-8, so I don't think this will be a problem, and it's only a 1 line change to check it. If it's a problem, we can convert the encoding for the ResourceBundle. But if it's all in ASCII, we shouldn't even need to do that. I certainly don't think a different minimal JDK version needs to be updated to handle this though.

@kwwall
Copy link
Contributor

kwwall commented Jun 16, 2024

I'm not recommending it, but I suppose if you really wanted to, you could tweak the code to use a character set based on a system property and just have it default to UTF-8 if not specified. The one thing I do know if UTF-8 will always be available, regardless of whatever language preferences. I suspect that ISO-8859-1 usually would be, but I'm not sure that is 100% the case if you were using Chinese or Kanji, etc. But UTF-8 is used for other internal encodings. E.g., it's often used with cryptography related code, otherwise going between Windows and *nix often causes problems. For example if shows up in these classes:

$ cd $JAVA_HOME
$ grep -r '"UTF-8"' src | grep /crypto/
com/sun/crypto/provider/PBKDF2KeyImpl.java: *    to bytes using UTF-8 character encoding.
com/sun/crypto/provider/PBKDF2KeyImpl.java:        Charset utf8 = Charset.forName("UTF-8");
sun/security/krb5/internal/crypto/dk/AesDkCrypto.java:            saltUtf8 = salt.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:// String.getBytes("UTF-8");
sun/security/krb5/internal/crypto/dk/DkCrypto.java:        Charset utf8 = Charset.forName("UTF-8");
javax/crypto/CryptoPermissions.java:        parser.read(new BufferedReader(new InputStreamReader(in, "UTF-8")));

I think it is used when stuff needs to be serialized / deserialized so it will work across operating systems, but I can't swear to that as it's been 15-20 years since I looked at that code.

That said, ISO-8859-1 would almost certainly work, but it's just the natural choice that UTF-8 is. Just my $.02 though. In ESAPI, we use UTF-8 for everything, but this is your choice. I just think UTF-8 is more neutral whereas ISO-8859-1 has a Western European slant.

@davewichers
Copy link
Collaborator

@kwwall - Do you suggest I close this as won't fix?

@kwwall
Copy link
Contributor

kwwall commented Nov 19, 2024

@kwwall - Do you suggest I close this as won't fix?

I suppose the correct thing to do is for someone to correct all the AntiSamy properties files and XML files as per the Oracle JDK 9 document that @GodMeowIceSun referenced, and use UTF-8 instead of ISO-8859-1. (And if that doesn't work, use a system property like I suggested earlier.)

Ideally, that someone who would change this would be the one who has submitted the issue, but that doesn't seem to be happening here. However, it seems likely that others are impacted. If I were you, I'd give the PR submitter another chance to make these changes. Set a deadline and if it's not met close this PR. But I don't think I would close issue #456 as the issue seems legitimate. If you end up closing this PR because some date is past, then I probably would slap a 'Help Wanted' label and/or 'Good First Issue' label on the issue and set the priority to Low and just leave it open.

@davewichers
Copy link
Collaborator

@GodMeowIceSun - If you can implement this PR per @kwwall description, we can potentially merge those changes in. But as is, we cannot. Do you think you can make the changes as he describes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants