Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Khmer contains errors #35

Closed
kreier opened this issue May 20, 2024 · 26 comments
Closed

Support for Khmer contains errors #35

kreier opened this issue May 20, 2024 · 26 comments
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested

Comments

@kreier
Copy link
Owner

kreier commented May 20, 2024

On many locations I observed a dotted circle in the Khmer word, and it does not look like a Khmer character. Investigating further, it looks like rendering Khmer with the NotoSans font does not match the exact writing. To start the investigation I look a the summary list in the bottom left corner. It should read:

54មនុស្ស
12ចៅក្រម
19ហោរា
53ស្តេច
82រយៈពេល
37ព្រឹត្តិការណ៍
18វត្ថុឬវត្ថុ
80សមាជិកនៃគ្រួសាររបស់ Terah

But instead this is rendered:
image

@kreier kreier added the bug Something isn't working label May 20, 2024
@kreier kreier self-assigned this May 20, 2024
@kreier
Copy link
Owner Author

kreier commented May 20, 2024

Imported the text above into Word and rendered it with Noto Sans Khmer. It shows the rendering errors.
image

@kreier
Copy link
Owner Author

kreier commented May 20, 2024

@kreier kreier added help wanted Extra attention is needed question Further information is requested labels May 20, 2024
@kreier
Copy link
Owner Author

kreier commented May 20, 2024

The problem was added to the official Google Groups Support forum:
https://groups.google.com/g/reportlab-users/c/WHuatWlUUpE

@kreier
Copy link
Owner Author

kreier commented May 20, 2024

The documentation states: "No special handling at all is needed to work with Asian TrueType fonts."
https://docs.reportlab.com/reportlab/userguide/ch3_fonts/

I think this is true for the CJK glyphs in Chinese, Japanese and Korean and simpler diacritics in combined glyphs like in Vietnamese. But the combination of some other Asian language glyphs as needed in Lao, Khmer and Thai seems not to work correctly yet.

@kreier
Copy link
Owner Author

kreier commented May 21, 2024

I got a nice response from ChatGPT into some details. The provided example with three Unicode characters combining a Base Consonant, a Dependent Vowel and Subscript Consonants does work for reportlab. Here the answer:

In the Khmer script, combined characters are created using a base consonant followed by one or more dependent vowel signs, subscript consonants, and other diacritical marks. This combination process is similar to other scripts that use complex text rendering. When represented in Unicode and encoded in UTF-8, these combined characters follow a specific order to ensure proper display.

Here's how it works:

  1. Base Consonant: Every syllable starts with a base consonant character. Khmer consonants range from U+1780 to U+17A2 in Unicode.
  2. Dependent Vowels: Dependent vowel signs are attached to the base consonant. These vowels range from U+17B6 to U+17D3. They may appear before, above, below, or after the consonant, and their position is managed by the rendering engine based on their Unicode value and the base consonant they follow.
  3. Subscript Consonants: Subscript forms of consonants, used to write consonant clusters, are encoded using a subscript modifier (U+17D2). The subscript consonant itself is in the range U+1780 to U+17A2, same as the regular consonants.
  4. Diacritical Marks: Various diacritics, such as U+17C6 (KHMER SIGN NIKAHIT) and others from U+17C9 to U+17D3, are used to modify the sounds or meanings of the base consonants and vowels.
  5. Ordering: In Unicode, the base consonant comes first, followed by the subscript consonants (if any, marked by the subscript modifier), then any dependent vowels, and finally any additional diacritics.

Example of Unicode Sequence

Let's consider the Khmer syllable "កាំ" (kâm):

  • Base Consonant: ក (KA) - U+1780
  • Dependent Vowel: ាំ (AM) - U+17B6 (dependent vowel ា) + U+17C6 (diacritical mark ំ)

In Unicode, the sequence would be:

U+1780 (ក) 
U+17B6 (ា) 
U+17C6 (ំ)

When encoded in UTF-8, these characters would be represented as follows:

  • U+1780 (ក) in UTF-8 is E1 9E 80
  • U+17B6 (ា) in UTF-8 is E1 9E B6
  • U+17C6 (ំ) in UTF-8 is E1 9F 86

So, the full UTF-8 sequence for the syllable "កាំ" would be:

E1 9E 80 E1 9E B6 E1 9F 86

Rendering

Rendering engines (like those in web browsers or text editors) take this sequence and properly combine and position the characters based on the rules of the Khmer script. The process involves:

  • Identifying the base consonant.
  • Attaching and positioning the dependent vowel signs and diacritics relative to the base consonant.
  • Adjusting the subscript consonants as necessary.

In summary, combined characters in the Khmer script are rendered in Unicode using UTF-8 by encoding the base consonant followed by any necessary subscript consonants, dependent vowels, and diacritics in a specific order, which is then interpreted by the rendering engine to display the proper combined form.

@kreier
Copy link
Owner Author

kreier commented May 21, 2024

However, the Khmer syllable "ស្ស" (ssa) is not rendered correctly. Here the explanation what it consists of and how it should be rendered:

The Khmer syllable "ស្ស" (ssa) consists of a base consonant followed by a subscript consonant. Here’s a detailed breakdown of the Unicode sequence:

  • Base Consonant: ស (SA) - U+179F
  • Subscript Consonant: ្ស (subscript SA) - U+17D2 (KHMER SIGN COENG) + U+179F (subscript form of SA)

Unicode Sequence

  1. Base Consonant:
  • U+179F (ស)
  1. Subscript Consonant:
  • U+17D2 (KHMER SIGN COENG)
  • U+179F (subscript form of SA)

Full Unicode Sequence

Putting these together, the full Unicode sequence for "ស្ស" is:

U+179F (ស) 
U+17D2 (្) 
U+179F (្ស)

UTF-8 Encoding

To represent this sequence in UTF-8, each Unicode code point is converted to its corresponding UTF-8 byte sequence:

  • U+179F (ស) in UTF-8: E1 9E 9F
  • U+17D2 (្) in UTF-8: E1 9F 92
  • U+179F (subscript SA) in UTF-8: E1 9E 9F

Full UTF-8 Sequence

Combining these, the UTF-8 encoding for the sequence "ស្ស" is:

E1 9E 9F E1 9F 92 E1 9E 9F

Rendering Process

  • Base Consonant: The rendering engine identifies the base consonant ស (U+179F).
  • Subscript Consonant: It recognizes the subscript sign (KHMER SIGN COENG, U+17D2) and attaches the following consonant to the base consonant in its subscript form.
  • Combination: The engine renders the subscript consonant properly positioned under the base consonant.

In summary, the Unicode sequence for "ស្ស" involves a base consonant followed by a subscript sign and another consonant, encoded and rendered according to the rules of the Khmer script. The UTF-8 encoding ensures each character is correctly represented in byte form, which the rendering engine interprets to display the correct combined character.

@kreier
Copy link
Owner Author

kreier commented May 21, 2024

Challenges reported as it might be related to only a subset of the font embedded in the pdf. Here an observation from 2021:

https://groups.google.com/g/reportlab-users/c/mxVz1vxeZCk

Update 30.05.2024 no it is not related to embedding a subset. That is standard practice (and in a way sometimes necessary since a single font in pdf seems only have up to 256 characters?) and other examples below show that it works just fine.

@kreier
Copy link
Owner Author

kreier commented May 23, 2024

My example ស្ស interestingly consists of two of the same consonants, with the same Unicode U+179F but because with have the indicator U+17D2 (KHMER SIGN COENG) the second one is to be rendered differently. The current reportlab version does not do that.

I tried a different Python package to create a PDF file, PyMuPDF, but in order to render new pages with non-Latin fonts it uses a package fonttools. And I got the same result. There is actually an open issue from 2021 regarding a similar issue with the character ឃើ : fonttools/fonttools#2387 and it actually started 2020 with Google fonts (all are still open):

It might actually be that this whole problem is related to an old implementation of HarfBuzz for OpenType fonts. My TrueType fonts might be an older subset of these.

@kreier
Copy link
Owner Author

kreier commented May 23, 2024

The current repository for reprotlab (4.2.1) can be found on their website as Mercurial bitbucket: https://hg.reportlab.com/hg-public/reportlab . It is mirrored to Github on https://github.com/MrBitBucket/reportlab-mirror

The part that is responsible to render the TrueType fonts (I think) is https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/pdfbase/ttfonts.py - last edited 4 months ago by robin. The strings we want to use and embed are 16-bit Unicode characters mentioned in the introduction.

Our example character "ស្ស" will be embedded into a generated pdf file. Both for Word and reportlab this results in a 10 kByte file, but the object streams inside the pdf are different, and the reportlab rendering is not correct. Libreoffice creates yet another pdf file with other streams containing the subset of the font file, but it manages to correctly embed this glyph in only a 5 kByte pdf. The character streams are the main source for the size difference. Yet, how to get reportlab to correctly render these characters - I have no clue yet.

@kreier
Copy link
Owner Author

kreier commented May 23, 2024

Another option would be iText - in the iText Core version 8 community edition. But I would have to move from Python to Java or .NET (C#). The Community edition should be open source and it supports Khmer since version 7.0.4 in 2017.

@kreier
Copy link
Owner Author

kreier commented May 24, 2024 via email

@kreier
Copy link
Owner Author

kreier commented May 24, 2024

This is the rendered output of both python programs above (not shown in an email reply, not even when edited):

image in Khmer and Sinhala

Correct is:

ស្តេច ហោរា and සමුළුව

@kreier
Copy link
Owner Author

kreier commented May 29, 2024

The 13 lines of example code to test Khmer and Sinhala are:

# example rendering in some languages
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
matrix = [["Khmer", "King Prophet", "ស្តេច ហោរា"],
          ["Sinhala", "Conference", "සමුළුව"]]
my_canvas = canvas.Canvas("example_reportlab.pdf")
for i in range(len(matrix)):
    pdfmetrics.registerFont(TTFont(matrix[i][0], '../../fonts/Noto' + matrix[i][0] + '.ttf'))
    my_canvas.setFont(matrix[i][0], 32)
    my_canvas.drawString(72, 749-90*i, f"Language {matrix[i][0]}:")
    my_canvas.drawString(72, 713-90*i, f"Word '{matrix[i][1]}' - {matrix[i][2]}") 
my_canvas.save()

image in Khmer and Sinhala

I'll try iText to see if it is an option (with Java or C#). Out of the box these Unicode characters are not rendered, but let's see.

@kreier
Copy link
Owner Author

kreier commented May 29, 2024

iText renders the same result. In the iText Demo Lab you can create your own pdf document with an embedded Java code editor. I entered the following:

import com.itextpdf.kernel.pdf.*;
import com.itextpdf.layout.Document;
import com.itextpdf.layout.element.Paragraph;
import java.io.*;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.io.font.PdfEncodings;

public class HelloWorld {
  public static final String DEST = "/myfiles/example_iText.pdf";
  public static final String FONT_KHMER = "/uploads/NotoKhmer.ttf";
  public static final String FONT_SINHALA = "/uploads/NotoSinhala.ttf";
  public static final String KHMER = "ស្តេច ហោរា";
  public static final String SINHALA = "සමුළුව";
  
  public static void main(String args[]) throws IOException {
    PdfDocument pdf = new PdfDocument(new PdfWriter(DEST));
    Document document = new Document(pdf);
    document.setFontSize(30).add(new Paragraph("Language Khmer:"));
    PdfFont fontKhmer = PdfFontFactory.createFont(FONT_KHMER, PdfEncodings.IDENTITY_H);
    document.add(new Paragraph().setFont(fontKhmer).setFontSize(30).add("Word 'King Prophet'  ").add(KHMER));
    
    document.setFontSize(30).add(new Paragraph("\nLanguage Sinhala"));
    PdfFont fontSinhala = PdfFontFactory.createFont(FONT_SINHALA, PdfEncodings.IDENTITY_H);
    document.add(new Paragraph().setFont(fontSinhala).setFontSize(30).add("Word 'Conference'  ").add(SINHALA));
    document.close();
  }
}

The render problems are the same as mentioned above. #35 (comment)

image

@kreier
Copy link
Owner Author

kreier commented May 29, 2024

And we're getting closer to an answer: Many scripts and Glyphs are supported in core of iText (including Russian, Armenian, Greek, Chinese, Japanese, Korean) but my two problem languages Khmer and Sinhala require the module pdfCalligraph. It's actually 14 scripts for more than 51 languages.

@kreier
Copy link
Owner Author

kreier commented May 29, 2024

Back to reportlab: An older conversation on the reportlab Google group from 2015 talks about the composite glyph positioning with responses from Glenn Lindermann, Robin Becker and Andy Robinson. And some Unicode and ttf history of reportlab.

@kreier
Copy link
Owner Author

kreier commented May 30, 2024

The use of a shape engine like harfbuzz does the job. I got it working in fpdf2 after pip install uharfbuzz:

# example rendering Khmer
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King        - ស្តេច")
pdf.ln()
pdf.cell(text="Prophet - ហោរា")
pdf.ln()
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King        - ស្តេច")
pdf.ln()
pdf.cell(text="Prophet - ហោរា")
pdf.output("example_fpdf.pdf")

Result:

image

Let's see if I can import the this in reportlab.

@timeline24
Copy link
Collaborator

And with the embedded subset of the font in Type TrueType (CID) and encoding: Identity-H it has only 7 kByte size, half the size of the Word and Google Docs solution.

@kreier
Copy link
Owner Author

kreier commented May 30, 2024

It looks that fpdf2 had a similar problem in 2022. This issue py-pdf/fpdf2#365 mentions Khmer (py-pdf/fpdf2#700) among Arabic, Hindi and other languages. And with the switch to the Fonttools library and harfbuzz in pull request 447 py-pdf/fpdf2#477 it seems many other issues are resolved. gmischler describes the changed approach in issue 418 py-pdf/fpdf2#418. By 2023 the implementation appears to be stable. @andy-robinson @replabrobin Is it possible to do something similar in reportlab?

In a forum post from 2015 https://groups.google.com/g/reportlab-users/c/scxAhaReanI/m/IYSaDfoH9ZkJ Andy Robinson mentions that :'We are trying to work out the right font descriptors and sequences of bytes to put in the PDF file so that the right stuff magically happens on screen.". In the same post he describes his work on Japanese in 2002-2003 (that's why my CJK versions have no problem) and that around 2009 an Arabic speaking employee worked on the project. I could not find a specific reference in the source code, but on stack overflow a working solution includes https://pypi.org/project/arabic-reshaper/ and bidi.algorithm

The function to create the embedded subset of the TTF font is part of the https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/pdfbase/ttfonts.py file. Is it here some ligature substitutions needed for Khmer, Sinhala and many other languages should be integrated?

@kreier
Copy link
Owner Author

kreier commented Jun 1, 2024

Here are a few more details of the integration of harfbuzz with uharfbuzz from a proof-of-concept to finalization in early 2023: py-pdf/fpdf2#696

There are also some testfiles linked for Thai. Might be worth checking out, since the Thai script was developed from the Khmer one a few centuries ago.

@kreier
Copy link
Owner Author

kreier commented Jun 6, 2024

Back to basics. Let's take the simple ឆ្នាំ which translates to 'years'. It consists of 5 codepoints:

  • U+1786 Khmer Letter Cha
  • U+17D2 Khmer Sign Coeng
  • U+1793 Khmer Letter No
  • U+17B6 Khmer Vowel Sign Aa
  • U+17C6 Khmer Sign Nikahit

When copy/paste the rendered glyphs in the pdf we get ឆ្នា􀀖 ំ as result. Codepoints.net finds 1431 codepoints in here. With the help of a little python program:

text = "ឆ្នាំ"
for char in text:
    print(f"Character '{char}' has codepoint {ord(char):X}")

We get 7 codepoints. Sees like this is the sequence the shape engine produced:

  • Character 'ឆ' has codepoint 1786
  • Character '្' has codepoint 17D2
  • Character 'ន' has codepoint 1793
  • Character 'ា' has codepoint 17B6
  • Character '􀀖' has codepoint 100016
  • Character ' ' has codepoint 20
  • Character 'ំ' has codepoint 17C6

The first 4 codepoints are unchanged, but then '100016' and '20' are integrated.

@replabrobin
Copy link

HI Matthias, having difficulty emailing directly. It seems you post in a google 'reportlab-users' group. Our official mail list is not run by me, but has address https://two.pairlist.net/pipermail/reportlab-users/. I imagine you would like us to support proper harfbuzz shaping etc etc.

I would like to integrate uharfbuzz into the reportlab paragraph code, but there are a number of issues which I don't yet have solutions for.

I have no experience of the khmer codes, but when I tried your example above I didn't get the same outcome after shaping I get only three outputs so the code below produces

uni178617B6 gid248=0@923,0+923
uni17D21793 gid209=0@0,-26+0
uni17C6 gid137=0@0,-29+0

#!/bin/env python
import uharfbuzz as hb

if False:
	import sys
	fontfile = sys.argv[1]
	text = sys.argv[2]
else:
	fontfile = '/home/robin/devel/reportlab/REPOS/reportlab/tmp/NotoSansKhmer/NotoSansKhmer-Regular.ttf'
	#1786 Khmer Letter Cha
	#17D2 Khmer Sign Coeng
	#1793 Khmer Letter No
	#17B6 Khmer Vowel Sign Aa
	#17C6 Khmer Sign Nikahit
	text = '\u1786\u17D2\u1793\u17B6\u17C6'

blob = hb.Blob.from_file_path(fontfile)
face = hb.Face(blob)
font = hb.Font(face)

buf = hb.Buffer()
buf.add_str(text)
buf.guess_segment_properties()

features = {"kern": True, "liga": True}
hb.shape(font, buf, features)

infos = buf.glyph_infos
positions = buf.glyph_positions

for info, pos in zip(infos, positions):
	gid = info.codepoint
	glyph_name = font.glyph_to_string(gid)
	cluster = info.cluster
	x_advance = pos.x_advance
	x_offset = pos.x_offset
	y_offset = pos.y_offset
	print(f"{glyph_name} gid{gid}={cluster}@{x_advance},{y_offset}+{x_advance}")

@kreier
Copy link
Owner Author

kreier commented Jun 10, 2024

Hi Robin @replabrobin,

Thanks for answering here. Yes, it would be great if harfbuzz could be integrated into reportlab!! I tried to sign up for the email list but got no response. And I posted some questions at the Google groups but this groups probably needs some cleanup. Anyway, back to the question of shape engine.

I think the Khmer glyph "ឆ្នាំ" is a good example (it means years) , in Unicode represented with 5 codepoints '\u1786\u17D2\u1793\u17B6\u17C6'. Without font shaping the 5 codepoints combined with a font glyphs and their individual width gives not the correct final glyph. I tried to combine the result you got (got the same results) from uharfbuzz with NotoSansKhmer and https://fontdrop.info/. Now it is only three codepoints, and some additional information about how to shift the glyphs in the combined glyph:

image

Above are the 3 glyph points uni178617B6, uni17D21793 and uni17C6. Only the last one is a Unicode codepoint, the others only exist inside the font as glyph points. Since the individual glyphs have to be correctly positioned its not possible just to pass the string of updated glyph points to be included in the pdf, but each glyph has to be put in the correct position by the python script that puts the glyphs in the pdf.

I guess currently there is already some part of glyph positioning integrated in reportlab, now it needs to additionally process the location output from harfbuzz for the glyph position, not just the information included in the font for each glyph.

I'm sure this will be a considerable effort to integrate - I've seen a little of the work done at fpdf2 in the last 2 years - but maybe I can at least help a little with beta-testing. Just recently a small bug was fixed py-pdf/fpdf2#1187

@kreier
Copy link
Owner Author

kreier commented Jun 11, 2024

In the post to fpdf2 mentioned above gmischler explains the steps fpdf2 takes to integrate a Unicode string into the correct sequence of glyphs. He wrote:

  1. fpdf2 accepts a sequence of characters, and passes it to pyharfbuzz.
  2. pyharfbuzz converts the python string to a C structure and passes it to harfbuzz.
  3. harfbuzz consults the font file, combines character sequences into glyph clusters, and adds the width information given in the font file to each cluster.
  4. pyharfbuzz converts the result back into python data
  5. fpdf2 uses the returned width information for line wrapping, and adds the resulting line data into the PDF stream.
  6. A PDF viewer reads that stream, and needs to figure out where to place the glyphs on the page.

I think it should be a similar sequence for reportlab.

And I found the value of advance width of mark attached glyphs. The first returned glyph from harfbuzz uni178617B6 actually has a width of 923, as indicated with the response "uni178617B6 gid248=0@923,0+923". It can be seen with https://www.glyphrstudio.com/app/

image

As requested by your code print(f"{glyph_name} gid{gid}={cluster}@{x_advance},{y_offset}+{x_advance}") the value for x_advance for the glyph is returned (zero for the next two uni17D21793 and uni17C6

image

The {y_offset} values indicate that their location should be slightly adjusted in the final glyph. Not sure how this caused a problem in the fpdf2 string_width calculation, since the advanceWidth values are 923, 0 and 0 and it looks like 923 is the correct value.

@kreier
Copy link
Owner Author

kreier commented Jun 11, 2024

I changed the last line of your code to print the offset for x and y as returned by harfbuzz

print(f"{glyph_name} \t gid{gid}={cluster} \t advanceWidth: {x_advance} \t offset x:{x_offset} y:{y_offset}")

The output is

uni178617B6      gid248=0        advanceWidth: 923       offset x:0 y:0
uni17D21793      gid209=0        advanceWidth: 0         offset x:-296 y:-26
uni17C6          gid137=0        advanceWidth: 0         offset x:47 y:-29

Which verifies the shifted location of the two additional glyphs seen in the combined glyph when rendered as ឆ្នាំ in two posts above.

@kreier
Copy link
Owner Author

kreier commented Jul 25, 2024

This problem is addressed in the official reportlab forum: https://groups.google.com/g/reportlab-users/c/WHuatWlUUpE

For me this is solved now after switching to fpdf2. I might return to reportlab in the future when the font shape engine is implemented.

@kreier kreier closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants