New Feature: Support ligature Glyphs from TTF Fonts #540

gmischler · 2022-09-15T07:05:46Z

We've had several issues raised because fpdf2 currently doesn't correctly support writing systems that require the merging of successive characters into combined glyphs (ligatures). This appears to be mainly the case for indic scripts.

Compare #365, #381, #459, #474, and downstream global_scorecards #7.

Those ligature glyphs are usually stored in the font file with index numbers outside of the range of Unicode characters that can be represented with Python strings. This means that we need a custom data structure to represent them, which can also include some other helpful information. Note that such a ligature may actually consist of several partial glyphs, so there is an n*m relationship between Unicode code points and ligature glyphs.

We could represent our text elements eg. similar to this:

class GlyphGroup(NamedTuple):
    font_glyphs: tuple   # indices in the font file
    glyph_widths: tuple  # width of each subglyph, for placement
    width: float  # width of the whole thing, for text width calculation
    chars: tuple  # the original Unicode characters

The processing sequence might look something like this:

For each remaining part of the input string, search the "gsub" table in the font file for the longest matching sequence.
If there is a match, build the GlyphGroup from it. Otherwise build a single-character GlyphGroup.
Replace Fragment.characters with a list of GlyphGroups.
Adapt the methods of Fragment to this change.
In _render_styled_text_line(), place the sequence of glyphs on the page.
For bonus accessability points, follow the PDF specs section 14.9.4, so that copying text in the PDF viewer will return the original Unicode character sequence again.

There are probably quite a few pitfalls that aren't obvious at the moment. We'll also need support and advice from native speakers of the respective languages, which are the only ones able to spot any errors in the resulting files. There may be other tables than "gsub" in some fonts that we might also want to take into account.

Upside of the change:

A large part of the world will be able to create PDFs in their own language with fpdf2 (I'm currently not aware of any Open Source solution that allows this).
Even documents in languages using latin script may look nicer when using a high quality font, since this will also cover typographic ligatures like "fi", "ft", "fs", and any others the font designers might have included.

Downside:

Both the processing load and memory use of the library will rise, even for users who just create documents in english.

Anyone up for the task?

The text was updated successfully, but these errors were encountered:

gmischler · 2022-09-30T06:28:25Z

In #549, @marcstober pointed out that some combined characters are not technically ligatures (looked up in "gsub"), but rather diacritics. Those don't need to be substituted, but they follow special placement rules found in "gpos". This is particularly important (and tricky) when several of them need to be combined with a single base character, which may require them to be stacked on top of each other. Example scripts that require this are Hebrew and Thai.

I don't think is realistic to handle all those special cases directly in _render_styled_text_line(). A more practical approach might be to turn GlyphGroup() outlined above into a more elaborate class than just a NamedTuple. It should have a .render() method that returns a completed string containing all the glyph and placement instructions to be inserted into the PDF stream. There can be several subclasses dealing with either standard unicode glyphs, ligatures, or glyph/diacritic combinations. This assumes that ligature substitution and diacritic stacking will not apply to the same input characters, or things may get even more involved...

As a basis for such functionality, an initial refactoring might introduce a generic GlyphGroup class, which just passes through all characters, resulting in the same output as the current code. After that, subclasses that combine and substitute glyphs as needed can be implemented one by one.

Lucas-C · 2023-08-02T10:46:04Z

@andersonhc PR #820 has been merged today.

Could you test if that solved your issue @gmischler?

You can install fpdf2 directly from the master branch of this repo with this command:

pip install git+https://github.com/PyFPDF/fpdf2.git@master

The documentation is there: https://pyfpdf.github.io/fpdf2/TextShaping.html

gmischler · 2023-08-07T19:13:52Z

As far as I can determine, this is now fixed.

gmischler added the enhancement label Sep 15, 2022

gmischler mentioned this issue Sep 16, 2022

I am trying to use Tamil font (ttf). But the letters are jumbled up. #541

Closed

Lucas-C added unicode font up-for-grabs hacktoberfest labels Sep 19, 2022

Lucas-C mentioned this issue Sep 19, 2022

Hebrew combining diacritics aren't positioned correctly #549

Closed

Lucas-C mentioned this issue Feb 6, 2023

Shaping Thai text support #679

Closed

eroux mentioned this issue Feb 22, 2023

I cannot render Khmer Unicode Properly in PDF file. #700

Closed

andersonhc mentioned this issue Jun 14, 2023

Text shaping #820

Merged

9 tasks

Lucas-C added the pending-answer label Aug 2, 2023

gmischler closed this as completed Aug 7, 2023

Lucas-C added text-shaping and removed pending-answer labels Aug 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Feature: Support ligature Glyphs from TTF Fonts #540

New Feature: Support ligature Glyphs from TTF Fonts #540

gmischler commented Sep 15, 2022

gmischler commented Sep 30, 2022

Lucas-C commented Aug 2, 2023

gmischler commented Aug 7, 2023

New Feature: Support ligature Glyphs from TTF Fonts #540

New Feature: Support ligature Glyphs from TTF Fonts #540

Comments

gmischler commented Sep 15, 2022

gmischler commented Sep 30, 2022

Lucas-C commented Aug 2, 2023

gmischler commented Aug 7, 2023