Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Feature: Support ligature Glyphs from TTF Fonts #540

Closed
gmischler opened this issue Sep 15, 2022 · 3 comments
Closed

New Feature: Support ligature Glyphs from TTF Fonts #540

gmischler opened this issue Sep 15, 2022 · 3 comments

Comments

@gmischler
Copy link
Collaborator

We've had several issues raised because fpdf2 currently doesn't correctly support writing systems that require the merging of successive characters into combined glyphs (ligatures). This appears to be mainly the case for indic scripts.

Compare #365, #381, #459, #474, and downstream global_scorecards #7.

Those ligature glyphs are usually stored in the font file with index numbers outside of the range of Unicode characters that can be represented with Python strings. This means that we need a custom data structure to represent them, which can also include some other helpful information. Note that such a ligature may actually consist of several partial glyphs, so there is an n*m relationship between Unicode code points and ligature glyphs.

We could represent our text elements eg. similar to this:

class GlyphGroup(NamedTuple):
    font_glyphs: tuple   # indices in the font file
    glyph_widths: tuple  # width of each subglyph, for placement
    width: float  # width of the whole thing, for text width calculation
    chars: tuple  # the original Unicode characters

The processing sequence might look something like this:

  • For each remaining part of the input string, search the "gsub" table in the font file for the longest matching sequence.
  • If there is a match, build the GlyphGroup from it. Otherwise build a single-character GlyphGroup.
  • Replace Fragment.characters with a list of GlyphGroups.
  • Adapt the methods of Fragment to this change.
  • In _render_styled_text_line(), place the sequence of glyphs on the page.
  • For bonus accessability points, follow the PDF specs section 14.9.4, so that copying text in the PDF viewer will return the original Unicode character sequence again.

There are probably quite a few pitfalls that aren't obvious at the moment. We'll also need support and advice from native speakers of the respective languages, which are the only ones able to spot any errors in the resulting files. There may be other tables than "gsub" in some fonts that we might also want to take into account.

Upside of the change:

  • A large part of the world will be able to create PDFs in their own language with fpdf2 (I'm currently not aware of any Open Source solution that allows this).
  • Even documents in languages using latin script may look nicer when using a high quality font, since this will also cover typographic ligatures like "fi", "ft", "fs", and any others the font designers might have included.

Downside:

  • Both the processing load and memory use of the library will rise, even for users who just create documents in english.

Anyone up for the task?

@gmischler
Copy link
Collaborator Author

In #549, @marcstober pointed out that some combined characters are not technically ligatures (looked up in "gsub"), but rather diacritics. Those don't need to be substituted, but they follow special placement rules found in "gpos". This is particularly important (and tricky) when several of them need to be combined with a single base character, which may require them to be stacked on top of each other. Example scripts that require this are Hebrew and Thai.

I don't think is realistic to handle all those special cases directly in _render_styled_text_line(). A more practical approach might be to turn GlyphGroup() outlined above into a more elaborate class than just a NamedTuple. It should have a .render() method that returns a completed string containing all the glyph and placement instructions to be inserted into the PDF stream. There can be several subclasses dealing with either standard unicode glyphs, ligatures, or glyph/diacritic combinations. This assumes that ligature substitution and diacritic stacking will not apply to the same input characters, or things may get even more involved...

As a basis for such functionality, an initial refactoring might introduce a generic GlyphGroup class, which just passes through all characters, resulting in the same output as the current code. After that, subclasses that combine and substitute glyphs as needed can be implemented one by one.

@Lucas-C
Copy link
Member

Lucas-C commented Aug 2, 2023

@andersonhc PR #820 has been merged today.

Could you test if that solved your issue @gmischler?

You can install fpdf2 directly from the master branch of this repo with this command:

pip install git+https://github.com/PyFPDF/fpdf2.git@master

The documentation is there: https://pyfpdf.github.io/fpdf2/TextShaping.html

@gmischler
Copy link
Collaborator Author

As far as I can determine, this is now fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants