Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added info about an arabic script fix, fixed typo #490

Merged
merged 7 commits into from
Aug 17, 2022
Merged

added info about an arabic script fix, fixed typo #490

merged 7 commits into from
Aug 17, 2022

Conversation

semaeostomea
Copy link

@semaeostomea semaeostomea commented Aug 16, 2022

related to #487

As requested I added the info and code snippet for the arabic script fix to docs/Text.md and docs/Unicode.md

I also noticed that Hangul was mentioned as a script that's not supported, which isn't the case as you can see here:
Screenshot 2022-08-16 175021
which is why I removed it, I added Kannada and Tamil as additional examples instead because they were mentioned in #474 and #365

I also changed "Hindic" to "Indic" (note: Hindi is one of many Indian languages and uses Devanagari, "Hindic scripts" do not exist, I assume this was a typo)

  • A mention of the change is present in CHANGELOG.md
    ^ This is my first pull request, I'm not sure if something like this warrants a changelog entry

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

Sorry, something went wrong.

@semaeostomea semaeostomea requested a review from Lucas-C as a code owner August 16, 2022 16:29
@codecov
Copy link

codecov bot commented Aug 16, 2022

Codecov Report

Merging #490 (dbff3a7) into master (658fcb3) will decrease coverage by 0.15%.
The diff coverage is n/a.

❗ Current head dbff3a7 differs from pull request most recent head 6bf59ef. Consider uploading reports for the commit 6bf59ef to get more accurate results

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
- Coverage   92.11%   91.96%   -0.16%     
==========================================
  Files          23       23              
  Lines        6684     6555     -129     
  Branches     1366     1333      -33     
==========================================
- Hits         6157     6028     -129     
  Misses        299      299              
  Partials      228      228              
Impacted Files Coverage Δ
fpdf/prefs.py 90.90% <0.00%> (-1.95%) ⬇️
fpdf/drawing.py 92.73% <0.00%> (-0.28%) ⬇️
fpdf/syntax.py 95.37% <0.00%> (-0.05%) ⬇️
fpdf/fpdf.py 89.87% <0.00%> (-0.01%) ⬇️
fpdf/sign.py 100.00% <0.00%> (ø)
fpdf/enums.py 100.00% <0.00%> (ø)
fpdf/__init__.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

docs/Unicode.md Outdated Show resolved Hide resolved
Copy link
Member

@Lucas-C Lucas-C left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ Thank you for this PR!

Would you mind please also adding a short mention of this workaround in CHANGELOG.md?

Something like this:

## [2.5.7] - not released yet
### Added
- workaround by @semaeostomea to support the arabic script: [link to documentation](https://pyfpdf.github.io/fpdf2/Unicode.html#arabic-script-workaround)

@Lucas-C
Copy link
Member

Lucas-C commented Aug 16, 2022

@all-contributors please add @semaeostomea for documentation

@allcontributors
Copy link

@Lucas-C

I've put up a pull request to add @semaeostomea! 🎉

semaeostomea and others added 4 commits August 17, 2022 08:32

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@gmischler
Copy link
Collaborator

gmischler commented Aug 17, 2022

Welcome @semaeostomea
Good to have someone with practical experience in these matters chime in!
My own theoretical understanding is evidently incorrect at times...

The fact that arabic_reshaper can prepare text sufficiently for us to process directly indicates that there are seperate Unicode code points for each positional variant. Given that, your solution should really be integrated in fpdf2 directly. For the time being it is of course great to have it documented as a workaround.
Maybe you could add an indication of which names those packages have on PyPi? It looks to me like eg. bidi is listed as python-bidi there, which may be confusing to a newcomer.

Do you happen to know of any other scripts where all positional variants have seperate code points? If there are any, a final solution should treat them all this way.

I happen to be somewhat familiar with Mongolian, and I haven't found any positional code points for that (or Uighur, Manchu, etc.). In cases like this, we'd have to rely on substitutions offered by the font file. Can you confirm this?

I'm not very familiar with Hangul. Since those caracters are technically composites and require special input methods to type, I expected them to be handled like ligatures in Unicode as well. Your remark indicates that I was mistaken about this, and each combination is indeed a seperate code point. So much the better!

Sorry about the hindic/indic confusion, I tend to fall for that once in a while... 😉

CHANGELOG.md Outdated
@@ -394,3 +394,7 @@ prevented strings passed first to the text-rendering methods to be displayed.
* turned `accept_page_break` into a property
* unit tests now use the standard `unittest` lib
* massive code cleanup using `flake8`

## [2.5.7] - not released yet
Copy link
Member

@Lucas-C Lucas-C Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition.
This should be at the top of the CHANGELOG.md file though :)
You may have to rebase your branch because I just merged PR #492 that added an entry to this file too

This GitHub guide may help you to rebase the doc-arabic-fix branch of your fork repo:
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@semaeostomea
Copy link
Author

semaeostomea commented Aug 17, 2022

Do you happen to know of any other scripts where all positional variants have seperate code points? If there are any, a final solution should treat them all this way.

Not that I know of. Most of them lack a proper Unicode in the first place, which is also a problem with Mongolian as you probably know

I happen to be somewhat familiar with Mongolian, and I haven't found any positional code points for that (or Uighur, Manchu, etc.). In cases like this, we'd have to rely on substitutions offered by the font file. Can you confirm this?

If you're referring to the old Uyghur script, yes, and afaik the Unicode doesn't even cover it completely but I'm not really familiar with it.
As for the Uyghur adapted Arabic script that is in use nowadays, this is also covered by my solution (as are all other languages that use the Arabic script. Some languages that use it have rare special characters that I noticed aren't covered, but those are exceptions and the majority of the script is still displayed correctly)

I also forgot to mention that Hebrew doesn't have contextual forms. It's displayed wrong because it's RTL like Arabic. But putting it through the same code fixes that too. The only problem is that the diacritics/marks are displaced (not everyone uses them though). Depending on which font is used they're more or less displaced
Screenshot 2022-08-17 093451
כַּף סוֹפִית :correct
This is a common problem with lots of software and a number of scripts with diacritics that have separate code points though. I mostly notice this with Diné bizaad and when someone uses a different input method for Czech that separates the letter and diacritics, they end up all over the place.
I just tested it and it's the same problem with Diné bizaad:
Screenshot 2022-08-17 102804
łį́į́ʼ in the middle

@semaeostomea
Copy link
Author

semaeostomea commented Aug 17, 2022

I just changed "Arabic script workaround" to "Right to Left & Arabic script workaround" because it generally also fixes the out-of-order problem of RTL scripts with bidi.
I also added two points to mention the RTL and diacritic issue in Text.md because it's not covered by the other ones and put the header in there after all because it got a bit long too.
I hope it's alright like that.

(I forgot to add a proper commit message because I was on auto-mode and just quickly pushed like I do with my private repo, it's the "u" commit)

@Lucas-C
Copy link
Member

Lucas-C commented Aug 17, 2022

This PR looks great to me!

@gmischler: would you add anything or are your OK to merge it?

@gmischler
Copy link
Collaborator

Not that I know of. Most of them lack a proper Unicode in the first place, which is also a problem with Mongolian as you probably know

As far as I understand, Mongolian (et. al) Unicode is complete for the isolated forms.
There should be a table in the font listing the other shapes by context.

The only problem is that the diacritics/marks are displaced (not everyone uses them though).
Depending on which font is used they're more or less displaced
Screenshot 2022-08-17 093451
כַּף סוֹפִית :correct
This is a common problem with lots of software and a number of scripts with diacritics that have separate code points though.

I mostly notice this with Diné bizaad and when someone uses a different input method for Czech that separates the letter and diacritics, they end up all over the place. I just tested it and it's the same problem with Diné bizaad: Screenshot 2022-08-17 102804
łį́į́ʼ in the middle

Shouldn't some type of Unicode normalization fix those diacritics?
unicodedata.normalize('NFC', 'łį́į́ʼ')
(or maybe 'NFKC', didn't do any tests right now).

Come to think of it, I'll have to test later if this would actually solve #459.

@gmischler
Copy link
Collaborator

@gmischler: would you add anything or are your OK to merge it?

No obstacles that I can see.

@Lucas-C Lucas-C merged commit c81d720 into py-pdf:master Aug 17, 2022
@semaeostomea
Copy link
Author

semaeostomea commented Aug 17, 2022

As far as I understand, Mongolian (et. al) Unicode is complete for the isolated forms. There should be a table in the font listing the other shapes by context.

The Mongolian Unicode is complete, but there are some problems with it. I don't understand the underlying mechanism enough, I just know that it can result in the wrong form being displayed sometimes.
If you don't know it yet, this is a good read, it mentions some of the problems with the Unicode

Shouldn't some type of Unicode normalization fix those diacritics? unicodedata.normalize('NFC', 'łį́į́ʼ') (or maybe 'NFKC', didn't do any tests right now).

This can fix come diacritics but unfortunately not all. I can imagine it would fix Czech diacritics because those exist as a separate code point when combined with the letter (if that's how it works). That's not the case for some diacritics and tone marks in Diné bizaad because they're not used in any other language combined like that
Screenshot 2022-08-17 114453

On a side note: Pillow has fixed this problem. I haven't had any issues with scripts with pillow so far, so it's definitely possible
I just tested it again and at least for Diné Bizaad the diacritics are displaced after all. I must've remembered it wrong but I'll look into it, I had to re-do my ven at some point, I might've forgotten a dependency

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants