Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify an explicit locale for all to{Lower,Upper}Case calls #17687

Closed
wants to merge 1 commit into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Mar 8, 2023

Java's String#to{Lower,Upper}Case() is locale-dependent, which can
lead to unexpected results in locales with special case mappings in the
ASCII range (e.g. in a Turkish locale, a capital ASCII I lowercases to
a non-ASCII variant of i).

This is prevented by specifying a local without such case mappings. This
commit uses Locale.ROOT as the canonical choice with the same case
mapping behavior as other common locales such as Locale.ENGLISH or
Locale.US.

Follow-up changes could use Guava's Ascii.to{Lower,Upper}Case instead,
but whether this is safe may depend on the context, which makes this
replacement unsuitable to perform across the repo.

Fixes #17541

Java's `String#to{Lower,Upper}Case()` is locale-dependent, which can
lead to unexpected results in locales with special case mappings in the
ASCII range (e.g. in a Turkish locale, a capital ASCII `I` lowercases to
a non-ASCII variant of `i`).

This is prevented by specifying a local without such case mappings. This
commit uses `Locale.ROOT` as the canonical choice with the same case
mapping behavior as other common locales such as `Locale.ENGLISH` or
`Locale.US`.

Follow-up changes could use Guava's `Ascii.to{Lower,Upper}Case` instead,
but whether this is safe may depend on the context, which makes this
replacement unsuitable to perform across the repo.

Fixes bazelbuild#17541
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 8, 2023

@meteorcloudy Could you review this? It's a global change that mostly affects OSS users. Turns out Bazel was very broken with a Turkish locale.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 8, 2023

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Mar 8, 2023
@meteorcloudy
Copy link
Member

Looks a nice fix to me.
@lberki Can you confirm if this is a safe change to make?

Follow-up changes could use Guava's Ascii.to{Lower,Upper}Case instead,
but whether this is safe may depend on the context, which makes this
replacement unsuitable to perform across the repo.

What context specifically? I can see Ascii.to{Lower,Upper}Case is already used in many places.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 8, 2023

@meteorcloudy As far as I understand, Ascii#to{Lower,Upper}Case only changes the case of ASCII characters, whereas String#to{Lower,Upper}Case also applies Unicode case mapping rules (but now a fixed set of those).

I would be surprised if Ascii#to{Lower,Upper}Case didn't end up being the correct choice in essentially all locations I touched, but it wouldn't be as obvious that the change is correct and I certainly wouldn't feel like cherry-picking it.

@meteorcloudy
Copy link
Member

@fmeum Understood, thanks!

@lberki
Copy link
Contributor

lberki commented Mar 8, 2023

I'm somewhat worried that even though this change fixes the issue at hand any many others that stem from the meaning of to{Lower,Upper}Case() being different depending on the locale, it's easy to introduce such a mistake again.

WDYT about sanitizing the environment of the Bazel server instead so that as far as the JVM is concerned, the locale is always the same, regardless of the locale the user sets? (I thought that we already do that, but we apparently don't)

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 8, 2023

@lberki Char locales are fixed for Java actions and also by the Java stub template, but I don't think Bazel itself is subject to any locale cleaning.

While forcing a locale makes sense from the point of view of hermeticity, there are certain places in which having locale-dependent output may actually be desirable (e.g. String#format rendering of floating point numbers and decimal separators). Not saying that it wouldn't be worth it, but I can see a cost in terms of usability.

An alternative I have thought about was to add an ErrorProne check for locale-less calls to to{Lower,Upper}Case and enable it for Bazel. But maybe it doesn't clear the "obviously wrong" bar?

@lberki
Copy link
Contributor

lberki commented Mar 8, 2023

That also works (@meteorcloudy do we have ErrorProne set up on our public CI?)

I think consistent formatting of floats and the like (to the extent that Bazel emits floats, I don't think it happens a lot) is a positive development, isn't it? And it would remove a whole class of possible bugs; output is one thing, but I'm worried that we are also parsing things in a locale-dependent way right now which isn't great.

Ultimately, I'm fine with either approach, it's just that sanitizing the server environment seems to bring the most back for the buck in terms of bugs fixed and possible bugs eliminated.

@fmeum
Copy link
Collaborator Author

fmeum commented Mar 8, 2023

@cushon Has an ErrorProne check for locale-less String#to{Lower,Upper}Case() been considered in the past?

@cushon
Copy link
Contributor

cushon commented Mar 8, 2023

@cushon Has an ErrorProne check for locale-less String#to{Lower,Upper}Case() been considered in the past?

I filed google/error-prone#3809

@keertk
Copy link
Member

keertk commented Mar 8, 2023

@bazel-io fork 6.2.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Mar 8, 2023
@fmeum
Copy link
Collaborator Author

fmeum commented Mar 9, 2023

Superseded by #17702

@fmeum fmeum closed this Mar 9, 2023
@fmeum fmeum deleted the fix-locale-dependency branch March 9, 2023 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Rules-ObjC Issues for Objective-C maintainers
Projects
None yet
6 participants