fix(text/unstable): handle non-Latin-script text in slugify
#5880
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #5830
Changed tests:
déjà-vu
, notdeja-vu
(and same for other diacritic tests). This was the most common handling of Latin-text diacritics in the sites I checked intext/slugify
gives empty results for non-Latin alphabets #5830 (comment).33-000
, not33000
. The Unicode category of,
is "Punctuation (Other)", which is very broad and IMO should generally replaced with"-"
rather than""
. I don't think it's probably desirable to special-case individual characters beyond Unicode category as it just ends up with a load of hard-coded and largely arbitrary chars (with the exception of straight quote marks, which are sort-of in the "wrong" category as quote marks typically fall in "Punctuation (initial)" or "Punctuation (final)"). Also, while there isn't much to choose between33-000
vs33000
(33000
is more natural, but33-000
more readable due to retaining the thousands-place delimitation),
is used as a decimal point in many languages, andla-valeur-de-pi-est-3-14
is clearly preferable tola-valeur-de-pi-est-314
.