Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Remove nunicode from android binding #12497

Merged
merged 1 commit into from
Sep 11, 2018
Merged

Remove nunicode from android binding #12497

merged 1 commit into from
Sep 11, 2018

Conversation

tobrun
Copy link
Member

@tobrun tobrun commented Jul 27, 2018

This PR is the first step to remove nunicode from the Mapbox Maps SDK for Android. Removing this c++ library would result in decreasing the overal binary size of the SDK.

In this PR, we are replacing the nunicode uppercase and lowercase invocations by wiring through jni to leverage the android platform specific java/lang/String#toUpperCase() and jave/lang/String#toLowerCase() instead.

Note though, we can't remove nunicode yet as it's used for collator expressions (#12268).
@ChrisLoer would it be possible to move that implementaton to Collator.java in the same way as this PR does for upperCasing/lowerCasing?

@tobrun tobrun added the Android Mapbox Maps SDK for Android label Jul 27, 2018
@tobrun tobrun added this to the android-v6.4.0 milestone Jul 27, 2018
@tobrun tobrun self-assigned this Jul 27, 2018
@tobrun tobrun requested a review from ivovandongen July 27, 2018 12:30
@tobrun tobrun mentioned this pull request Jul 27, 2018
@ChrisLoer
Copy link
Contributor

@ChrisLoer would it be possible to move that implementaton to Collator.java in the same way as this PR does for upperCasing/lowerCasing?

Our Android Collator implementation currently uses nunicode for "diacritic folding", not for case shifting, so I don't know a straightforward way to apply the same change there. I'm definitely not saying it's not possible -- just that I haven't figured it out.

@kkaefer
Copy link
Member

kkaefer commented Jul 31, 2018

@LukasPaczos LukasPaczos removed this from the android-v6.4.0 milestone Aug 15, 2018
@@ -0,0 +1,21 @@
#include <mbgl/util/platform.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this file, since it's no longer "stdlib"

jni::Object<String> String::New(JNIEnv& env, jni::String value) {
static auto constructor = String::javaClass.GetConstructor<jni::String>(env);
return String::javaClass.New(env, constructor, value);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jni::String is a Object, to be able to hook into toUpperCase/toLowerCase I'm creating mbgl::android::java::lang::String. Would prefer not having to create a new String for each uppercase/lowercase. Any pointers on how to get this working?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove static constexpr auto Name() { return "java/lang/String"; }; and change jni::Class<String> to jni::Class<jni::StringTag>, you will be able to use value.Call(env, method) directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jfirebaugh! this makes total sense

@tobrun
Copy link
Member Author

tobrun commented Aug 20, 2018

Did you run the tests in https://github.com/mapbox/mapbox-gl-native/blob/master/test/util/text_conversions.test.cpp?

Was having issues running them, I'm making running our gtest suite on android on CI a priority.

@kkaefer
Copy link
Member

kkaefer commented Aug 28, 2018

Is there anything that's blocking this PR?

@lilykaiser
Copy link

iirc it may be something related to #12688 but I will let @tobrun weigh in.

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from afc8838 to cef6472 Compare August 29, 2018 05:08
@tobrun
Copy link
Member Author

tobrun commented Aug 29, 2018

@kkaefer :

I addressed #12497 (comment) and would love some input on #12497 (comment) if possible.

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from cef6472 to 98a3c0a Compare August 29, 2018 18:34
@tobrun
Copy link
Member Author

tobrun commented Sep 3, 2018

@ChrisLoer been looking what it would take to remove nunicode completely, I was able to get it compile with providing a platform specific implementation of unaccent.cpp. We can leverage Normalizer.java to provide that implementation on Android. Branch withr removed nunicde + a no-op unaccent.cpp here.

Thoughts?

@ChrisLoer
Copy link
Contributor

We can leverage Normalizer.java to provide that implementation on Android

The idea sounds attractive to me, but don't you need something that can denormalize instead of normalize? My understanding of the nunicode implementation is that it essentially denormalizes the text and then strips out all the combining characters that are in one of the "accent" blocks:

		{ 0x0300, 0x036F },  /* Combining Diacritical Marks */
		{ 0x1AB0, 0x1AFF },  /* Combining Diacritical Marks Extended */
		{ 0x20D0, 0x20FF },  /* Combining Diacritical Marks for Symbols */
		{ 0x1DC0, 0x1DFF },  /* Combining Diacritical Marks Supplement */

If you're going to work on normalized text then you'd have to maintain a much larger map of character transformations.

@kkaefer
Copy link
Member

kkaefer commented Sep 5, 2018

Looks like it supports passing Normalizer.Form.NFD to get the canonical decomposition, then strip diacritics with a regular expression.

@tobrun
Copy link
Member Author

tobrun commented Sep 5, 2018

Thanks for the input! was able to get it working and the render tests confirm this:

Actual

actual

Expected

expected

Diff

output

I took the approach to remove diatrics with a regex as shown in this blogpost.

Will clean up the code and merge to this PR.

jni::String unaccented = android::java::lang::String::replaceAll(*env, normalized,
jni::Make<jni::String>(*env, "\\p{InCombiningDiacriticalMarks}+"),
jni::Make<jni::String>(*env, ""));
return jni::Make<std::string>(*env, unaccented);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we implement an unaccent(String str) function in com.mapbox.mapboxsdk.utils.StringUtils that does all of that instead of going back and forth through the JNI? That could potentially lower the binary footprint a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is way simpler solution, addressed with 5b326ef

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from d52559c to 5b326ef Compare September 5, 2018 14:12

std::string unaccent(const std::string& str) {
android::UniqueEnv env = android::AttachEnv();
jni::String unaccented = android::StringUtils::unaccent(*env,jni::Make<jni::String>(*env, str));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to delete refs here to avoid table overflows in tight loops?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, looking forward using jni::SeizeLocal to avoid these issues!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be even better than that -- mapbox/jni.hpp#40.

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from 5b326ef to f13eee1 Compare September 5, 2018 15:27
@NonNull
static String unaccent(@NonNull String value) {
return Normalizer.normalize(value, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! From https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html, it looks like you can use match any Unicode block, in which case we can get a little closer to the nunicode implementation by also matching against Combining Diacritical Marks Extended/for Symbols/Supplement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrisLoer having a hard time figuring out the correct configuration for above 😅Typography is clearly not my cup of tea. Based on above, I should allow any unicode character? Thus something as"[^\\p{ASCII}]" or do I need to work on allowing multiple \\p{} items (For example: InCombiningDiacriticalMarks in combination with configurations on symbols/supplement from https://www.regular-expressions.info/unicode.html).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, I think it'll be something like:

(\\p{InCombiningDiacriticalMarks}|\\p{InCombiningDiacriticalMarksExtended}|\\p{InCombiningDiacriticalMarksForSymbols}|\\p{InCombiningDiacriticalMarksSupplement})+

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I wouldn't have figured that out on my own. Thank you, I now went with:

(\p{InCombiningDiacriticalMarks}|\p{InCombiningDiacriticalMarksForSymbols}|\p{InCombiningDiacriticalMarksSupplement})+

as InCombiningDiacriticalMarksExtended didn't compile for me:

screen shot 2018-09-11 at 10 02 23

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I wonder why "Extended" wasn't included (although the documentation is pretty vague on the naming -- it sounds like you just strip the spaces from the Unicode block name). That will make for a small difference from the nunicode implementation, but we already tolerate differences in collator behavior from device to device and locale to locale so I think it should be OK.

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from f13eee1 to 64bd64c Compare September 5, 2018 15:58
@tobrun tobrun changed the title Replace nunicode upperCase/lowerCase with Android String Remove nunicode from android binding Sep 10, 2018
@tobrun tobrun force-pushed the tvn-remove-nunicode branch 2 times, most recently from a11979e to 21fc53f Compare September 10, 2018 21:20
@tobrun
Copy link
Member Author

tobrun commented Sep 10, 2018

21fc53f updates this PR to be jni 0.4.0 compatible.

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from 21fc53f to ff68253 Compare September 11, 2018 08:13
std::string uppercase(const std::string& str) {
auto env{ android::AttachEnv() };
jni::Local<jni::String> value = jni::Make<jni::String>(*env, str.c_str());
static auto& javaClass = jni::Class<jni::StringTag>::Singleton(*env);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fold this line into the next to avoid the need for another static variable. Same for lowercase

@tobrun tobrun force-pushed the tvn-remove-nunicode branch from ff68253 to 67f605f Compare September 11, 2018 10:18
…r uppercasing an lowercasing with an Android specific String.java equivalent
@tobrun tobrun force-pushed the tvn-remove-nunicode branch from 67f605f to 4e49c7b Compare September 11, 2018 10:18
@tobrun tobrun merged commit 6e171a6 into master Sep 11, 2018
@tobrun tobrun deleted the tvn-remove-nunicode branch September 11, 2018 10:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Android Mapbox Maps SDK for Android high priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants