-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Khmer contains errors #35
Comments
The problem was recorded on Stackoverflow: |
The problem was added to the official Google Groups Support forum: |
The documentation states: "No special handling at all is needed to work with Asian TrueType fonts." I think this is true for the CJK glyphs in Chinese, Japanese and Korean and simpler diacritics in combined glyphs like in Vietnamese. But the combination of some other Asian language glyphs as needed in Lao, Khmer and Thai seems not to work correctly yet. |
I got a nice response from ChatGPT into some details. The provided example with three Unicode characters combining a Base Consonant, a Dependent Vowel and Subscript Consonants does work for reportlab. Here the answer: In the Khmer script, combined characters are created using a base consonant followed by one or more dependent vowel signs, subscript consonants, and other diacritical marks. This combination process is similar to other scripts that use complex text rendering. When represented in Unicode and encoded in UTF-8, these combined characters follow a specific order to ensure proper display. Here's how it works:
Example of Unicode SequenceLet's consider the Khmer syllable "កាំ" (kâm):
In Unicode, the sequence would be: U+1780 (ក)
U+17B6 (ា)
U+17C6 (ំ) When encoded in UTF-8, these characters would be represented as follows:
So, the full UTF-8 sequence for the syllable "កាំ" would be: E1 9E 80 E1 9E B6 E1 9F 86 RenderingRendering engines (like those in web browsers or text editors) take this sequence and properly combine and position the characters based on the rules of the Khmer script. The process involves:
In summary, combined characters in the Khmer script are rendered in Unicode using UTF-8 by encoding the base consonant followed by any necessary subscript consonants, dependent vowels, and diacritics in a specific order, which is then interpreted by the rendering engine to display the proper combined form. |
However, the Khmer syllable "ស្ស" (ssa) is not rendered correctly. Here the explanation what it consists of and how it should be rendered: The Khmer syllable "ស្ស" (ssa) consists of a base consonant followed by a subscript consonant. Here’s a detailed breakdown of the Unicode sequence:
Unicode Sequence
Full Unicode SequencePutting these together, the full Unicode sequence for "ស្ស" is: U+179F (ស)
U+17D2 (្)
U+179F (្ស) UTF-8 EncodingTo represent this sequence in UTF-8, each Unicode code point is converted to its corresponding UTF-8 byte sequence:
Full UTF-8 SequenceCombining these, the UTF-8 encoding for the sequence "ស្ស" is: E1 9E 9F E1 9F 92 E1 9E 9F Rendering Process
In summary, the Unicode sequence for "ស្ស" involves a base consonant followed by a subscript sign and another consonant, encoded and rendered according to the rules of the Khmer script. The UTF-8 encoding ensures each character is correctly represented in byte form, which the rendering engine interprets to display the correct combined character. |
Challenges reported as it might be related to only a subset of the font embedded in the pdf. Here an observation from 2021: https://groups.google.com/g/reportlab-users/c/mxVz1vxeZCk Update 30.05.2024 no it is not related to embedding a subset. That is standard practice (and in a way sometimes necessary since a single font in pdf seems only have up to 256 characters?) and other examples below show that it works just fine. |
My example ស្ស interestingly consists of two of the same consonants, with the same Unicode U+179F but because with have the indicator U+17D2 (KHMER SIGN COENG) the second one is to be rendered differently. The current reportlab version does not do that. I tried a different Python package to create a PDF file, PyMuPDF, but in order to render new pages with non-Latin fonts it uses a package fonttools. And I got the same result. There is actually an open issue from 2021 regarding a similar issue with the character ឃើ : fonttools/fonttools#2387 and it actually started 2020 with Google fonts (all are still open):
It might actually be that this whole problem is related to an old implementation of HarfBuzz for OpenType fonts. My TrueType fonts might be an older subset of these. |
The current repository for reprotlab (4.2.1) can be found on their website as Mercurial bitbucket: https://hg.reportlab.com/hg-public/reportlab . It is mirrored to Github on https://github.com/MrBitBucket/reportlab-mirror The part that is responsible to render the TrueType fonts (I think) is https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/pdfbase/ttfonts.py - last edited 4 months ago by robin. The strings we want to use and embed are 16-bit Unicode characters mentioned in the introduction. Our example character "ស្ស" will be embedded into a generated pdf file. Both for Word and reportlab this results in a 10 kByte file, but the object streams inside the pdf are different, and the reportlab rendering is not correct. Libreoffice creates yet another pdf file with other streams containing the subset of the font file, but it manages to correctly embed this glyph in only a 5 kByte pdf. The character streams are the main source for the size difference. Yet, how to get reportlab to correctly render these characters - I have no clue yet. |
Another option would be iText - in the iText Core version 8 community edition. But I would have to move from Python to Java or .NET (C#). The Community edition should be open source and it supports Khmer since version 7.0.4 in 2017. |
Hi Khaled,
Thanks for looking into this. I know too little about Unicode code points, glyph indices and font subsetting. With this timeline project I try to learn on the fly. I noticed some glyphs in Khmer and Sinhala were rendered differently in the generated pdf than in the browser, editor or even Word. To investigate further I created two test scripts, one with reportlab and one with pymupdf & fonttools:
- https://github.com/kreier/timeline/blob/4.6/python/test/problem_km_si_ar_th/example_reportlab.py
- https://github.com/kreier/timeline/blob/4.6/python/test/problem_km_si_ar_th/example_pymupdf_fonttools.py
Both create the same flawed glyph combination (instead of ស្តេច ហោរា and සමුළුව):
![image in Khmer and Sinhala](https://github.com/kreier/timeline/assets/43933271/f8143ba9-d5ee-4b37-ac0e-7b04986721eb)
In [an earlier attempt](https://github.com/kreier/timeline/blob/4.6/python/test/PyMuPDF/hello_world2.py) I tried to set the option and create a subset with the pymupdf and fonttools version. I got some error messages when trying to activate to set
layoutFeatures or text (as "not supported"). Probably a syntax error on my side with this library. And while the `khmer_unicode_range = range(0x1780, 0x1800)` and `subsetter` option is in the program, the created pdf states the
font as _Embedded_, not _Embedded subset_.
In the earlier mentioned example I did not use the subset generation in the fonttools example and got a larger file (125 kByte) compared to the reportlab version (22 kByte). Acrobat reader states that the reportlab version contains an Embedded Subset of NotoSans Khmer and Sinhala, while the larger fonttool version only states that these two fonts are embedded, not a Subset.
To me that's an indication that there is more than a flawed subset generation as the problem, since the fonttool version has the complete font embedded, but the rendering is still flawed in the same way. So no missing glyphs in the embedded font. When I highlight the rendered text and copy/paste it into an editor/web browser/Word I get the correct content. So I think the Unicode code points are unchanged, even though the glyph indices are not correct, right? I think this is part of the philosophy to have them separate in TrueType? Again, I know close to nothing about it. Maybe you can help me with this one.
And thanks for having a look at the "proof of concept" Arabic version of my project. The utf-8 strings are not yet passed through the `arabic_reshaper` and `bidi.algorithm.get_display` packages (just something I found on stackoverflow). It's just replacing the english string and sending it to the renderer of reportlab. Surprisingly when highlighting some parts of the text it has some RTL behaviour as on websites or programs that are in RTL. So including reshaper and other converters will be one of the further steps to actually have an Arabic version. And definitely a native speaker to check the translation, Azure Translator and Google made enough mistakes in the languages I do speak a little or have friends knowing them.
Again, thanks for taking time to look at this - and maybe you can help me with the current Khmer and Sinhala rendering problems.
Matthias
On Thu, 23 May 2024 at 13:00, خالد حسني (Khaled Hosny) <
***@***.***> wrote:
… The HarfBuzz and FontTools issues are related to subsetting using Unicode
code points as input. This kind of subsetting is typically to make fully
functional fonts with smaller character set (e.g. used for web fonts to
serve smaller files that cover only the page content). PDF subsetting is a
lot simpler since for PDF only glyphs used are needed and subsetting uses
glyph indices as input not Unicode code points.
The problem seems to be that the tool/library used to generate the PDFs do
not do proper text layout. The Arabic PDF
<https://timeline24.github.io/timeline_ar.pdf> linked from the README is
completely unreadable, the text is set left-to-right (it should be
right-to-left) and letter that should join/change shape are not joined.
There are even characters missing from the font and are rendered as empty
boxes (which suggests no font fallback is performed, which is another
totally different issue).
|
The 13 lines of example code to test Khmer and Sinhala are: # example rendering in some languages
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
matrix = [["Khmer", "King Prophet", "ស្តេច ហោរា"],
["Sinhala", "Conference", "සමුළුව"]]
my_canvas = canvas.Canvas("example_reportlab.pdf")
for i in range(len(matrix)):
pdfmetrics.registerFont(TTFont(matrix[i][0], '../../fonts/Noto' + matrix[i][0] + '.ttf'))
my_canvas.setFont(matrix[i][0], 32)
my_canvas.drawString(72, 749-90*i, f"Language {matrix[i][0]}:")
my_canvas.drawString(72, 713-90*i, f"Word '{matrix[i][1]}' - {matrix[i][2]}")
my_canvas.save() I'll try iText to see if it is an option (with Java or C#). Out of the box these Unicode characters are not rendered, but let's see. |
iText renders the same result. In the iText Demo Lab you can create your own pdf document with an embedded Java code editor. I entered the following: import com.itextpdf.kernel.pdf.*;
import com.itextpdf.layout.Document;
import com.itextpdf.layout.element.Paragraph;
import java.io.*;
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.font.PdfFontFactory;
import com.itextpdf.io.font.PdfEncodings;
public class HelloWorld {
public static final String DEST = "/myfiles/example_iText.pdf";
public static final String FONT_KHMER = "/uploads/NotoKhmer.ttf";
public static final String FONT_SINHALA = "/uploads/NotoSinhala.ttf";
public static final String KHMER = "ស្តេច ហោរា";
public static final String SINHALA = "සමුළුව";
public static void main(String args[]) throws IOException {
PdfDocument pdf = new PdfDocument(new PdfWriter(DEST));
Document document = new Document(pdf);
document.setFontSize(30).add(new Paragraph("Language Khmer:"));
PdfFont fontKhmer = PdfFontFactory.createFont(FONT_KHMER, PdfEncodings.IDENTITY_H);
document.add(new Paragraph().setFont(fontKhmer).setFontSize(30).add("Word 'King Prophet' ").add(KHMER));
document.setFontSize(30).add(new Paragraph("\nLanguage Sinhala"));
PdfFont fontSinhala = PdfFontFactory.createFont(FONT_SINHALA, PdfEncodings.IDENTITY_H);
document.add(new Paragraph().setFont(fontSinhala).setFontSize(30).add("Word 'Conference' ").add(SINHALA));
document.close();
}
} The render problems are the same as mentioned above. #35 (comment) |
And we're getting closer to an answer: Many scripts and Glyphs are supported in core of iText (including Russian, Armenian, Greek, Chinese, Japanese, Korean) but my two problem languages Khmer and Sinhala require the module pdfCalligraph. It's actually 14 scripts for more than 51 languages. |
Back to reportlab: An older conversation on the reportlab Google group from 2015 talks about the composite glyph positioning with responses from Glenn Lindermann, Robin Becker and Andy Robinson. And some Unicode and ttf history of reportlab. |
The use of a shape engine like harfbuzz does the job. I got it working in # example rendering Khmer
from fpdf import FPDF
pdf = FPDF()
pdf.add_page()
pdf.add_font("noto", style="", fname="../../fonts/NotoKhmer.ttf")
pdf.set_font('noto', size=32)
pdf.cell(text="King - ស្តេច")
pdf.ln()
pdf.cell(text="Prophet - ហោរា")
pdf.ln()
pdf.set_text_shaping(use_shaping_engine=True, script="khmr", language="khm")
pdf.cell(text="King - ស្តេច")
pdf.ln()
pdf.cell(text="Prophet - ហោរា")
pdf.output("example_fpdf.pdf") Result: Let's see if I can import the this in reportlab. |
And with the embedded subset of the font in Type TrueType (CID) and encoding: Identity-H it has only 7 kByte size, half the size of the Word and Google Docs solution. |
It looks that fpdf2 had a similar problem in 2022. This issue py-pdf/fpdf2#365 mentions Khmer (py-pdf/fpdf2#700) among Arabic, Hindi and other languages. And with the switch to the Fonttools library and harfbuzz in pull request 447 py-pdf/fpdf2#477 it seems many other issues are resolved. gmischler describes the changed approach in issue 418 py-pdf/fpdf2#418. By 2023 the implementation appears to be stable. @andy-robinson @replabrobin Is it possible to do something similar in reportlab? In a forum post from 2015 https://groups.google.com/g/reportlab-users/c/scxAhaReanI/m/IYSaDfoH9ZkJ Andy Robinson mentions that :'We are trying to work out the right font descriptors and sequences of bytes to put in the PDF file so that the right stuff magically happens on screen.". In the same post he describes his work on Japanese in 2002-2003 (that's why my CJK versions have no problem) and that around 2009 an Arabic speaking employee worked on the project. I could not find a specific reference in the source code, but on stack overflow a working solution includes https://pypi.org/project/arabic-reshaper/ and bidi.algorithm The function to create the embedded subset of the TTF font is part of the https://github.com/MrBitBucket/reportlab-mirror/blob/master/src/reportlab/pdfbase/ttfonts.py file. Is it here some ligature substitutions needed for Khmer, Sinhala and many other languages should be integrated? |
Here are a few more details of the integration of harfbuzz with uharfbuzz from a proof-of-concept to finalization in early 2023: py-pdf/fpdf2#696 There are also some testfiles linked for Thai. Might be worth checking out, since the Thai script was developed from the Khmer one a few centuries ago. |
Back to basics. Let's take the simple ឆ្នាំ which translates to 'years'. It consists of 5 codepoints:
When copy/paste the rendered glyphs in the pdf we get ឆ្នា ំ as result. Codepoints.net finds 1431 codepoints in here. With the help of a little python program: text = "ឆ្នាំ"
for char in text:
print(f"Character '{char}' has codepoint {ord(char):X}") We get 7 codepoints. Sees like this is the sequence the shape engine produced:
The first 4 codepoints are unchanged, but then '100016' and '20' are integrated. |
HI Matthias, having difficulty emailing directly. It seems you post in a google 'reportlab-users' group. Our official mail list is not run by me, but has address https://two.pairlist.net/pipermail/reportlab-users/. I imagine you would like us to support proper harfbuzz shaping etc etc. I would like to integrate uharfbuzz into the reportlab paragraph code, but there are a number of issues which I don't yet have solutions for. I have no experience of the khmer codes, but when I tried your example above I didn't get the same outcome after shaping I get only three outputs so the code below produces uni178617B6 gid248=0@923,0+923
|
Hi Robin @replabrobin, Thanks for answering here. Yes, it would be great if harfbuzz could be integrated into reportlab!! I tried to sign up for the email list but got no response. And I posted some questions at the Google groups but this groups probably needs some cleanup. Anyway, back to the question of shape engine. I think the Khmer glyph "ឆ្នាំ" is a good example (it means years) , in Unicode represented with 5 codepoints '\u1786\u17D2\u1793\u17B6\u17C6'. Without font shaping the 5 codepoints combined with a font glyphs and their individual width gives not the correct final glyph. I tried to combine the result you got (got the same results) from uharfbuzz with NotoSansKhmer and https://fontdrop.info/. Now it is only three codepoints, and some additional information about how to shift the glyphs in the combined glyph: Above are the 3 glyph points uni178617B6, uni17D21793 and uni17C6. Only the last one is a Unicode codepoint, the others only exist inside the font as glyph points. Since the individual glyphs have to be correctly positioned its not possible just to pass the string of updated glyph points to be included in the pdf, but each glyph has to be put in the correct position by the python script that puts the glyphs in the pdf. I guess currently there is already some part of glyph positioning integrated in reportlab, now it needs to additionally process the location output from harfbuzz for the glyph position, not just the information included in the font for each glyph. I'm sure this will be a considerable effort to integrate - I've seen a little of the work done at fpdf2 in the last 2 years - but maybe I can at least help a little with beta-testing. Just recently a small bug was fixed py-pdf/fpdf2#1187 |
In the post to fpdf2 mentioned above gmischler explains the steps fpdf2 takes to integrate a Unicode string into the correct sequence of glyphs. He wrote:
I think it should be a similar sequence for reportlab. And I found the value of As requested by your code The |
I changed the last line of your code to print the offset for x and y as returned by harfbuzz print(f"{glyph_name} \t gid{gid}={cluster} \t advanceWidth: {x_advance} \t offset x:{x_offset} y:{y_offset}") The output is uni178617B6 gid248=0 advanceWidth: 923 offset x:0 y:0
uni17D21793 gid209=0 advanceWidth: 0 offset x:-296 y:-26
uni17C6 gid137=0 advanceWidth: 0 offset x:47 y:-29 Which verifies the shifted location of the two additional glyphs seen in the combined glyph when rendered as ឆ្នាំ in two posts above. |
This problem is addressed in the official reportlab forum: https://groups.google.com/g/reportlab-users/c/WHuatWlUUpE For me this is solved now after switching to fpdf2. I might return to reportlab in the future when the font shape engine is implemented. |
On many locations I observed a dotted circle in the Khmer word, and it does not look like a Khmer character. Investigating further, it looks like rendering Khmer with the NotoSans font does not match the exact writing. To start the investigation I look a the summary list in the bottom left corner. It should read:
54មនុស្ស
12ចៅក្រម
19ហោរា
53ស្តេច
82រយៈពេល
37ព្រឹត្តិការណ៍
18វត្ថុឬវត្ថុ
80សមាជិកនៃគ្រួសាររបស់ Terah
But instead this is rendered:
The text was updated successfully, but these errors were encountered: