added info about an arabic script fix, fixed typo #490

semaeostomea · 2022-08-16T16:29:49Z

related to #487

As requested I added the info and code snippet for the arabic script fix to docs/Text.md and docs/Unicode.md

I also noticed that Hangul was mentioned as a script that's not supported, which isn't the case as you can see here:

which is why I removed it, I added Kannada and Tamil as additional examples instead because they were mentioned in #474 and #365

I also changed "Hindic" to "Indic" (note: Hindi is one of many Indian languages and uses Devanagari, "Hindic scripts" do not exist, I assume this was a typo)

A mention of the change is present in CHANGELOG.md
^ This is my first pull request, I'm not sure if something like this warrants a changelog entry

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

codecov · 2022-08-16T17:24:14Z

Codecov Report

Merging #490 (dbff3a7) into master (658fcb3) will decrease coverage by 0.15%.
The diff coverage is n/a.

❗ Current head dbff3a7 differs from pull request most recent head 6bf59ef. Consider uploading reports for the commit 6bf59ef to get more accurate results

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
- Coverage   92.11%   91.96%   -0.16%     
==========================================
  Files          23       23              
  Lines        6684     6555     -129     
  Branches     1366     1333      -33     
==========================================
- Hits         6157     6028     -129     
  Misses        299      299              
  Partials      228      228

Impacted Files	Coverage Δ
fpdf/prefs.py	`90.90% <0.00%> (-1.95%)`	⬇️
fpdf/drawing.py	`92.73% <0.00%> (-0.28%)`	⬇️
fpdf/syntax.py	`95.37% <0.00%> (-0.05%)`	⬇️
fpdf/fpdf.py	`89.87% <0.00%> (-0.01%)`	⬇️
fpdf/sign.py	`100.00% <0.00%> (ø)`
fpdf/enums.py	`100.00% <0.00%> (ø)`
fpdf/__init__.py	`100.00% <0.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

docs/Unicode.md

Lucas-C

❤️ Thank you for this PR!

Would you mind please also adding a short mention of this workaround in CHANGELOG.md?

Something like this:

## [2.5.7] - not released yet
### Added
- workaround by @semaeostomea to support the arabic script: [link to documentation](https://pyfpdf.github.io/fpdf2/Unicode.html#arabic-script-workaround)

Lucas-C · 2022-08-16T18:35:32Z

@all-contributors please add @semaeostomea for documentation

allcontributors · 2022-08-16T18:35:41Z

@Lucas-C

I've put up a pull request to add @semaeostomea! 🎉

…into doc-arabic-fix

gmischler · 2022-08-17T07:18:26Z

Welcome @semaeostomea
Good to have someone with practical experience in these matters chime in!
My own theoretical understanding is evidently incorrect at times...

The fact that arabic_reshaper can prepare text sufficiently for us to process directly indicates that there are seperate Unicode code points for each positional variant. Given that, your solution should really be integrated in fpdf2 directly. For the time being it is of course great to have it documented as a workaround.
Maybe you could add an indication of which names those packages have on PyPi? It looks to me like eg. bidi is listed as python-bidi there, which may be confusing to a newcomer.

Do you happen to know of any other scripts where all positional variants have seperate code points? If there are any, a final solution should treat them all this way.

I happen to be somewhat familiar with Mongolian, and I haven't found any positional code points for that (or Uighur, Manchu, etc.). In cases like this, we'd have to rely on substitutions offered by the font file. Can you confirm this?

I'm not very familiar with Hangul. Since those caracters are technically composites and require special input methods to type, I expected them to be handled like ligatures in Unicode as well. Your remark indicates that I was mistaken about this, and each combination is indeed a seperate code point. So much the better!

Sorry about the hindic/indic confusion, I tend to fall for that once in a while... 😉

Lucas-C · 2022-08-17T07:42:48Z

CHANGELOG.md

@@ -394,3 +394,7 @@ prevented strings passed first to the text-rendering methods to be displayed.
 * turned `accept_page_break` into a property
 * unit tests now use the standard `unittest` lib
 * massive code cleanup using `flake8`
+
+## [2.5.7] - not released yet


Thanks for the addition.
This should be at the top of the CHANGELOG.md file though :)
You may have to rebase your branch because I just merged PR #492 that added an entry to this file too

This GitHub guide may help you to rebase the doc-arabic-fix branch of your fork repo:
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

semaeostomea · 2022-08-17T08:31:03Z

Do you happen to know of any other scripts where all positional variants have seperate code points? If there are any, a final solution should treat them all this way.

Not that I know of. Most of them lack a proper Unicode in the first place, which is also a problem with Mongolian as you probably know

I happen to be somewhat familiar with Mongolian, and I haven't found any positional code points for that (or Uighur, Manchu, etc.). In cases like this, we'd have to rely on substitutions offered by the font file. Can you confirm this?

If you're referring to the old Uyghur script, yes, and afaik the Unicode doesn't even cover it completely but I'm not really familiar with it.
As for the Uyghur adapted Arabic script that is in use nowadays, this is also covered by my solution (as are all other languages that use the Arabic script. Some languages that use it have rare special characters that I noticed aren't covered, but those are exceptions and the majority of the script is still displayed correctly)

I also forgot to mention that Hebrew doesn't have contextual forms. It's displayed wrong because it's RTL like Arabic. But putting it through the same code fixes that too. The only problem is that the diacritics/marks are displaced (not everyone uses them though). Depending on which font is used they're more or less displaced

כַּף סוֹפִית :correct
This is a common problem with lots of software and a number of scripts with diacritics that have separate code points though. I mostly notice this with Diné bizaad and when someone uses a different input method for Czech that separates the letter and diacritics, they end up all over the place.
I just tested it and it's the same problem with Diné bizaad:

łį́į́ʼ in the middle

semaeostomea · 2022-08-17T09:04:17Z

I just changed "Arabic script workaround" to "Right to Left & Arabic script workaround" because it generally also fixes the out-of-order problem of RTL scripts with bidi.
I also added two points to mention the RTL and diacritic issue in Text.md because it's not covered by the other ones and put the header in there after all because it got a bit long too.
I hope it's alright like that.

(I forgot to add a proper commit message because I was on auto-mode and just quickly pushed like I do with my private repo, it's the "u" commit)

Lucas-C · 2022-08-17T09:08:48Z

This PR looks great to me!

@gmischler: would you add anything or are your OK to merge it?

gmischler · 2022-08-17T09:27:38Z

Not that I know of. Most of them lack a proper Unicode in the first place, which is also a problem with Mongolian as you probably know

As far as I understand, Mongolian (et. al) Unicode is complete for the isolated forms.
There should be a table in the font listing the other shapes by context.

The only problem is that the diacritics/marks are displaced (not everyone uses them though).
Depending on which font is used they're more or less displaced

כַּף סוֹפִית :correct
This is a common problem with lots of software and a number of scripts with diacritics that have separate code points though.

I mostly notice this with Diné bizaad and when someone uses a different input method for Czech that separates the letter and diacritics, they end up all over the place. I just tested it and it's the same problem with Diné bizaad:
łį́į́ʼ in the middle

Shouldn't some type of Unicode normalization fix those diacritics?
unicodedata.normalize('NFC', 'łį́į́ʼ')
(or maybe 'NFKC', didn't do any tests right now).

Come to think of it, I'll have to test later if this would actually solve #459.

gmischler · 2022-08-17T09:41:34Z

@gmischler: would you add anything or are your OK to merge it?

No obstacles that I can see.

semaeostomea · 2022-08-17T09:49:56Z

As far as I understand, Mongolian (et. al) Unicode is complete for the isolated forms. There should be a table in the font listing the other shapes by context.

The Mongolian Unicode is complete, but there are some problems with it. I don't understand the underlying mechanism enough, I just know that it can result in the wrong form being displayed sometimes.
If you don't know it yet, this is a good read, it mentions some of the problems with the Unicode

Shouldn't some type of Unicode normalization fix those diacritics? unicodedata.normalize('NFC', 'łį́į́ʼ') (or maybe 'NFKC', didn't do any tests right now).

This can fix come diacritics but unfortunately not all. I can imagine it would fix Czech diacritics because those exist as a separate code point when combined with the letter (if that's how it works). That's not the case for some diacritics and tone marks in Diné bizaad because they're not used in any other language combined like that

~~On a side note: Pillow has fixed this problem. I haven't had any issues with scripts with pillow so far, so it's definitely possible~~
I just tested it again and at least for Diné Bizaad the diacritics are displaced after all. I must've remembered it wrong but I'll look into it, I had to re-do my ven at some point, I might've forgotten a dependency

added info about an arabic script fix, fixed typo

d88881e

semaeostomea requested a review from Lucas-C as a code owner August 16, 2022 16:29

Lucas-C reviewed Aug 16, 2022

View reviewed changes

docs/Unicode.md Outdated Show resolved Hide resolved

Lucas-C approved these changes Aug 16, 2022

View reviewed changes

allcontributors bot mentioned this pull request Aug 16, 2022

docs: add semaeostomea as a contributor for doc #491

Merged

semaeostomea and others added 4 commits August 17, 2022 08:32

added header

5d0405b

Merge branch 'PyFPDF:master' into doc-arabic-fix

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired
Learn about vigilant mode

e5621b1

added to changelog, fixed typo

a6e9891

Merge branch 'doc-arabic-fix' of https://github.com/semaeostomea/fpdf2 …

37fb091

…into doc-arabic-fix

Lucas-C reviewed Aug 17, 2022

View reviewed changes

semaeostomea added 2 commits August 17, 2022 10:45

u

9777c4a

Merge remote-tracking branch 'upstream/master' into doc-arabic-fix

6bf59ef

Lucas-C merged commit c81d720 into py-pdf:master Aug 17, 2022

semaeostomea deleted the doc-arabic-fix branch August 18, 2022 07:18

marcstober mentioned this pull request Sep 19, 2022

Hebrew combining diacritics aren't positioned correctly #549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added info about an arabic script fix, fixed typo #490

added info about an arabic script fix, fixed typo #490

semaeostomea commented Aug 16, 2022 •

edited by Lucas-C

Loading

codecov bot commented Aug 16, 2022 •

edited

Loading

Lucas-C left a comment

Lucas-C commented Aug 16, 2022

allcontributors bot commented Aug 16, 2022

gmischler commented Aug 17, 2022 •

edited

Loading

Lucas-C Aug 17, 2022 •

edited

Loading

semaeostomea commented Aug 17, 2022 •

edited

Loading

semaeostomea commented Aug 17, 2022 •

edited

Loading

Lucas-C commented Aug 17, 2022

gmischler commented Aug 17, 2022

gmischler commented Aug 17, 2022

semaeostomea commented Aug 17, 2022 •

edited

Loading

added info about an arabic script fix, fixed typo #490

added info about an arabic script fix, fixed typo #490

Conversation

semaeostomea commented Aug 16, 2022 • edited by Lucas-C Loading

codecov bot commented Aug 16, 2022 • edited Loading

Codecov Report

Lucas-C left a comment

Choose a reason for hiding this comment

Lucas-C commented Aug 16, 2022

allcontributors bot commented Aug 16, 2022

gmischler commented Aug 17, 2022 • edited Loading

Lucas-C Aug 17, 2022 • edited Loading

Choose a reason for hiding this comment

semaeostomea commented Aug 17, 2022 • edited Loading

semaeostomea commented Aug 17, 2022 • edited Loading

Lucas-C commented Aug 17, 2022

gmischler commented Aug 17, 2022

gmischler commented Aug 17, 2022

semaeostomea commented Aug 17, 2022 • edited Loading

semaeostomea commented Aug 16, 2022 •

edited by Lucas-C

Loading

codecov bot commented Aug 16, 2022 •

edited

Loading

gmischler commented Aug 17, 2022 •

edited

Loading

Lucas-C Aug 17, 2022 •

edited

Loading

semaeostomea commented Aug 17, 2022 •

edited

Loading

semaeostomea commented Aug 17, 2022 •

edited

Loading

semaeostomea commented Aug 17, 2022 •

edited

Loading