added in wrapping on zero width space to better support word wrapping in languages like Thai #1191

carlhiggs · 2024-06-03T12:34:41Z

Checklist:

The GitHub pipeline is OK (green),
meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.
A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder
A mention of the change is present in CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

gmischler

Even if you only implement this simple change, there's no real point in limiting it to just two characters. Here's the set of the most likely candidates:

2000 | EN QUAD
2001 | EM QUAD
2002 | EN SPACE
2003 | EM SPACE
2004 | THREE-PER-EM SPACE
2005 | FOUR-PER-EM SPACE
2006 | SIX-PER-EM SPACE
2008 | PUNCTUATION SPACE
2009 | THIN SPACE
200A | HAIR SPACE
205F | MEDIUM MATHEMATICAL SPACE
3000 | IDEOGRAPHIC SPACE
0009 | TAB

Near the top of html.py, there's a list containing a few more, which include various side effects and may or may not be relevant here.

The point right now is to add those to your two characters for triggering a break. That won't change anything in how your change works, but brings us yet another (small) step closer to a full wrapping algorithm.

fpdf/line_break.py

CHANGELOG.md

…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191

gmischler · 2024-06-06T11:42:18Z

Two things left to do:

Rebase your branch and resolve any conflicts with changes that have been merged in the mean time.
Update the PDF files that fail the comparison test (after thorougly checking that they are visually identical).

Then I think this is ready to merge!

gmischler · 2024-06-17T06:06:13Z

Are you still around @carlhiggs ?
We're this close to merging this, after just a little bit of cleanup...

carlhiggs · 2024-06-17T07:19:55Z

Are you still around @carlhiggs ?
We're this close to merging this, after just a little bit of cleanup...

Hi @gmischler , yup still here! Sorry been busy on a another project. I started working my way through the pdf updates today -- I think about 47 pdf checks failed, but I'm working my way through and visually checking/updating them. The ones I've checked so far (~10) have looked visually identical, so have replaced pdfs and confirmed them to pass. But just need to work through list. Tricky with other work finding time, but I'll try to get there by Friday.

Hope that's okay and thanks for checking in; will be great to have this included in an update

gmischler · 2024-06-18T06:35:06Z

Hope that's okay and thanks for checking in; will be great to have this included in an update

That's perfectly fine. We absolutely do want to have this improvement, but we can wait a bit longer.

…lly identical and had same file size, as per py-pdf#1191

carlhiggs · 2024-06-21T08:38:22Z

Hope that's okay and thanks for checking in; will be great to have this included in an update

That's perfectly fine. We absolutely do want to have this improvement, but we can wait a bit longer.

Hi @gmischler just confirming that I've now been through and manually checked that the comparison test failing PDFs matched both visually and in file size (KB), and updating the reference PDFs. PyTest checks now all pass on my end (other than 2 skipped, 2 xfailed, and 7 warnings; I don't think those are things I can do anything about).

I've also ensured my fork is sync'd and not behind the upstream.

Fingers crossed this is now good to go!

gmischler · 2024-06-21T09:19:32Z

Something weird is going on here with GitHub still seeing conflicts.
Can you try another rebase? Maybe that makes it go away.

…erowidthspace in addition to regular space for wrapping words

…ace characters for word wrapping purposes

…er branch as per py-pdf#1191

…manually split across several lines

…rted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for

…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191

…dates to pdf tests

…branch to resolve git rebase conflicts

carlhiggs · 2024-06-21T09:52:07Z

Something weird is going on here with GitHub still seeing conflicts. Can you try another rebase? Maybe that makes it go away.

sorry I see what I did --- forgot I was working on a branch; i synced the master in Github, but not my branch...

working through it now --- a few more failing tests to resolve; I'll let you know once commited fixes, should be soon

carlhiggs · 2024-06-21T10:02:10Z

Sorry for the confusion @gmischler and thanks for prompting me to rebase again --- should be all good now; fingers crossed again!

gmischler · 2024-06-21T10:33:24Z

Failing with an encryption test.

carlhiggs · 2024-06-21T11:05:27Z

Failing with an encryption test.

I guess they are ones that were skipped on my local pytest run... i just ran again and looks like this on my side:

But I see what you mean in the logs... i'll try to look again...

carlhiggs · 2024-06-21T11:10:04Z

While it says test_encrypt_font failed, it didn't on my end; presumably similar with other tests

Might relate to qpdf not being installed on my side.. I'll try installing and see if i have more luck. Or otherwise, I'll manually compare the tested and reference pdfs and if equivalent, replace the reference.

carlhiggs · 2024-06-21T11:23:03Z

@gmischler I'm not sure what to do --- a lot of the tests failing are ones that I had already manually been through compared and replaced pdfs for; its pretty painstaking and I'm not sure that me doing it again will help. The tests are passing on my side. I don't know why they fail in the automated build, but also not sure what to do about it. I think I'll have to put this down here -- pretty late on a Friday sorry.

gmischler · 2024-06-23T18:12:19Z

@gmischler I'm not sure what to do

You could try to update your dependencies.
I think there was a recent change that caused the encrypted parts of a file to be stored in a slightly different format.

…df#1191

carlhiggs · 2024-06-24T04:49:13Z

You could try to update your dependencies.
Thank you, I think this solved the previous issue, @gmischler . I set up a fresh Conda environment, installed the fpdf requirements again, re-ran tests identying failures and following manual checks of appearance and file size (kb) and replacing as required, all the pdf checks have now been resolved. Pytest passes on my system.

During the github actions build there was an unrelated failure on 'test_cell_speed_with_long_text' on 3.9 Ubuntu-latest, the other 3.9 Windows build was cancelled at that point, but the builds for 3.8 and 3.10 succeeded. So I think this is progress.

The cell speed test passes on my side, although maybe its because I don't have qdf (and running Windows with Python 3.12).

I see the cell speed test has been an issue before/recently (#923)... maybe its random bad luck? in #913 it was suggested that performance on this test may relate largely to demand at a particular time on Github Actions...

Or maybe adding in the other kinds of space character options for wrapping impacted performance --- we could pare back to essentials if it were a problem.

Again, not sure how to resolve this one; if you have advice, happy to look further into this.

gmischler · 2024-06-24T09:09:53Z

The timing error is persistent. I'd be surprised if your change really had such a significant performance impact (given that in most cases it will match the first character in the pattern).

But then, the PR also shows a lot of changes that were actually made in other PRs. It seems to compare against an earlier version of the repository. Can you please do a clean rebase, without changing anything else?

…est) following git rebase

…ace characters for word wrapping purposes

…manually split across several lines as per py-pdf#1191 (and resolved conflict following rebase)

…rted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for

…df#1191

carlhiggs · 2024-06-25T02:12:39Z

Can you please do a clean rebase, without changing anything else?

Hi @gmischler I've given it a go! I followed https://johneverettcase.com/git-how-to-rebase-a-fork , and noticed this was a bit different to when I attempted it last time (there were a series of repeated conflicts, integrating minor changes made over time and confirming which to retain; last time there seemed to be fewer conflicts, so maybe I did something wrong previously). This seemed to go fine. There were a few PDFs that had conflicts (only 6, happily); I replaced these with the most recent ones I generated for the previous build (after resolving encoding errors).

In case you want to check that I did a 'clean rebase' the right way (I may have misunderstood). I've pasted my output below.

While I was writing this post, I see the 3.9 Ubuntu build failed on the timing test again,

=========================== short test summary info ============================
FAILED test/text/test_cell.py::test_cell_speed_with_long_text - assert 31.197102565700003 < 24
= 1 failed, 1297 passed, 1 skipped, 4 xfailed, 14 warnings in 687.00s (0:11:26) =
Error: Process completed with exit code 1.

Other details of the rebase wrapped in details brackets below:

Expand to view text copy of terminal log attempting clean rebase as suggested

>git clone https://github.com/carlhiggs/fpdf2.git
Cloning into 'fpdf2'...
remote: Enumerating objects: 15315, done.
remote: Counting objects: 100% (973/973), done.
remote: Compressing objects: 100% (536/536), done.
remote: Total 15315 (delta 617), reused 744 (delta 437), pack-reused 14342Receiving objects: 100% (15315/15315), 48.63 MiB | 13.38 MiB/s
Receiving objects: 100% (15315/15315), 52.85 MiB | 13.41 MiB/s, done.
Resolving deltas: 100% (10597/10597), done.

[computername] Tue 25/06/2024 11:28:59.16
D:\projects\repos

>cd fpdf2

[computername] Tue 25/06/2024 11:29:02.25
D:\projects\repos\fpdf2

>git remote add upstream https://github.com/py-pdf/fpdf2.git

[computername] Tue 25/06/2024 11:29:03.47
D:\projects\repos\fpdf2

>git remote -v
origin  https://github.com/carlhiggs/fpdf2.git (fetch)
origin  https://github.com/carlhiggs/fpdf2.git (push)
upstream        https://github.com/py-pdf/fpdf2.git (fetch)
upstream        https://github.com/py-pdf/fpdf2.git (push)

[computername] Tue 25/06/2024 11:29:07.03
D:\projects\repos\fpdf2

>git fetch upstream
remote: Enumerating objects: 23886, done.
remote: Counting objects: 100% (23772/23772), done.
remote: Compressing objects: 100% (2802/2802), done.
remote: Total 23618 (delta 20947), reused 23114 (delta 20468), pack-reused 0Receiving objects: 100% (23618/23618), 249.28 MiB | 13.84 MiB/s
Receiving objects: 100% (23618/23618), 251.05 MiB | 13.51 MiB/s, done.
Resolving deltas: 100% (20947/20947), completed with 42 local objects.
From https://github.com/py-pdf/fpdf2
 * [new branch]        2.7.9-release     -> upstream/2.7.9-release
 * [new branch]        add-script-compare-changed-pdfs -> upstream/add-script-compare-changed-pdfs
 * [new branch]        all-contributors/add-DarekRepos -> upstream/all-contributors/add-DarekRepos
 * [new branch]        all-contributors/add-carlhiggs -> upstream/all-contributors/add-carlhiggs
 * [new branch]        all-contributors/add-smitelli -> upstream/all-contributors/add-smitelli
 * [new branch]        gh-pages          -> upstream/gh-pages
 * [new branch]        html-split-style-and-attrs-handling -> upstream/html-split-style-and-attrs-handling
 * [new branch]        keep_aspect_ratio_and_x_align_center -> upstream/keep_aspect_ratio_and_x_align_center
 * [new branch]        master            -> upstream/master
 * [new branch]        rm-trailing-zeros -> upstream/rm-trailing-zeros
 * [new branch]        svg-text          -> upstream/svg-text
 * [new tag]           1.7               -> 1.7
 * [new tag]           1.7.1             -> 1.7.1
 * [new tag]           1.7.2             -> 1.7.2
 * [new tag]           2.0.0             -> 2.0.0
 * [new tag]           2.0.5             -> 2.0.5
 * [new tag]           2.1.0             -> 2.1.0
 * [new tag]           2.2.0             -> 2.2.0
 * [new tag]           2.3.0             -> 2.3.0
 * [new tag]           2.3.1             -> 2.3.1
 * [new tag]           2.3.2             -> 2.3.2
 * [new tag]           2.3.3             -> 2.3.3
 * [new tag]           2.3.4             -> 2.3.4
 * [new tag]           2.3.5             -> 2.3.5
 * [new tag]           2.4.0             -> 2.4.0
 * [new tag]           2.4.1             -> 2.4.1
 * [new tag]           2.4.2             -> 2.4.2
 * [new tag]           2.4.3             -> 2.4.3
 * [new tag]           2.4.4             -> 2.4.4
 * [new tag]           2.4.5             -> 2.4.5
 * [new tag]           2.4.6             -> 2.4.6
 * [new tag]           2.5.0             -> 2.5.0
 * [new tag]           2.5.1             -> 2.5.1
 * [new tag]           2.5.2             -> 2.5.2
 * [new tag]           2.5.3             -> 2.5.3
 * [new tag]           2.5.4             -> 2.5.4
 * [new tag]           2.5.5             -> 2.5.5
 * [new tag]           2.5.6             -> 2.5.6
 * [new tag]           2.5.7             -> 2.5.7
 * [new tag]           2.6.0             -> 2.6.0
 * [new tag]           2.6.1             -> 2.6.1
 * [new tag]           2.7.0             -> 2.7.0
 * [new tag]           2.7.1             -> 2.7.1
 * [new tag]           2.7.2             -> 2.7.2
 * [new tag]           2.7.3             -> 2.7.3
 * [new tag]           2.7.4             -> 2.7.4
 * [new tag]           2.7.5             -> 2.7.5
 * [new tag]           2.7.6             -> 2.7.6
 * [new tag]           2.7.7             -> 2.7.7
 * [new tag]           2.7.8             -> 2.7.8
 * [new tag]           2.7.9             -> 2.7.9
 * [new tag]           binary            -> binary
 * [new tag]           gif               -> gif
 * [new tag]           png_alpha         -> png_alpha
 * [new tag]           unicode           -> unicode

[computername] Tue 25/06/2024 11:29:42.18
D:\projects\repos\fpdf2

>git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean

[computername] Tue 25/06/2024 11:29:54.46
D:\projects\repos\fpdf2

>git checkout wrap-zero-width-space
Switched to a new branch 'wrap-zero-width-space'
branch 'wrap-zero-width-space' set up to track 'origin/wrap-zero-width-space'.

[computername] Tue 25/06/2024 11:30:10.51
D:\projects\repos\fpdf2

>git branch wrap-zero-width-space-before-rebaes-2024-06-05

[computername] Tue 25/06/2024 11:30:45.17
D:\projects\repos\fpdf2

>git checkout wrap-zero-width-space
Already on 'wrap-zero-width-space'
Your branch is up to date with 'origin/wrap-zero-width-space'.

[computername] Tue 25/06/2024 11:30:47.69
D:\projects\repos\fpdf2

>git status
On branch wrap-zero-width-space
Your branch is up to date with 'origin/wrap-zero-width-space'.

nothing to commit, working tree clean

[computername] Tue 25/06/2024 11:31:01.14
D:\projects\repos\fpdf2

>git rebase upstream/master
Auto-merging fpdf/line_break.py
CONFLICT (content): Merge conflict in fpdf/line_break.py
Auto-merging test/fonts/test_wraps_zerowidthspace.py
CONFLICT (add/add): Merge conflict in test/fonts/test_wraps_zerowidthspace.py
error: could not apply d5808fe9... added in modification to linebreak.py and accompanying test to wrap zerowidthspace in addition to regular space for wrapping words
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply d5808fe9... added in modification to linebreak.py and accompanying test to wrap zerowidthspace in addition to regular space for wrapping words

[computername] Tue 25/06/2024 11:31:13.19
D:\projects\repos\fpdf2

>cd fpdf

[computername] Tue 25/06/2024 11:35:29.32
D:\projects\repos\fpdf2\fpdf

>git add line_break.py

[computername] Tue 25/06/2024 11:35:39.14
D:\projects\repos\fpdf2\fpdf

>cd test
The system cannot find the path specified.

[computername] Tue 25/06/2024 11:35:45.94
D:\projects\repos\fpdf2\fpdf

>cd test
The system cannot find the path specified.

[computername] Tue 25/06/2024 11:38:36.94
D:\projects\repos\fpdf2\fpdf

>cd ..

[computername] Tue 25/06/2024 11:38:43.08
D:\projects\repos\fpdf2

>cd test

[computername] Tue 25/06/2024 11:38:44.12
D:\projects\repos\fpdf2\test

>cd fonts

[computername] Tue 25/06/2024 11:38:46.18
D:\projects\repos\fpdf2\test\fonts

>git add test_wraps_zerowidthspace.py

[computername] Tue 25/06/2024 11:38:54.19
D:\projects\repos\fpdf2\test\fonts

>git rebase --continue
Auto-merging fpdf/line_break.py
CONFLICT (content): Merge conflict in fpdf/line_break.py
Auto-merging test/fonts/test_wraps_zerowidthspace.py
CONFLICT (content): Merge conflict in test/fonts/test_wraps_zerowidthspace.py
error: could not apply 1123c5d2... ran black
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 1123c5d2... ran black

[computername] Tue 25/06/2024 11:39:12.57
D:\projects\repos\fpdf2\test\fonts

>git add test_wraps_zerowidthspace.py

[computername] Tue 25/06/2024 11:41:01.55
D:\projects\repos\fpdf2\test\fonts

>cd ..

[computername] Tue 25/06/2024 11:41:04.23
D:\projects\repos\fpdf2\test

>cd ..

[computername] Tue 25/06/2024 11:41:05.25
D:\projects\repos\fpdf2

>cd fpdf

[computername] Tue 25/06/2024 11:41:06.68
D:\projects\repos\fpdf2\fpdf

>git add line_break.py

[computername] Tue 25/06/2024 11:41:09.26
D:\projects\repos\fpdf2\fpdf

>git rebase --continue
[detached HEAD 8412b5be] addressed conflicts (updates to linespace break code and associated test) following git rebase
 Author: Carl Higgs <carlhiggs@gmail.com>
 1 file changed, 4 insertions(+)
Auto-merging CHANGELOG.md
CONFLICT (content): Merge conflict in CHANGELOG.md
error: could not apply 35969ce1... updated CHANGELOG.md
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 35969ce1... updated CHANGELOG.md

[computername] Tue 25/06/2024 11:42:00.89
D:\projects\repos\fpdf2\fpdf

>cd ..

[computername] Tue 25/06/2024 11:44:05.45
D:\projects\repos\fpdf2

>git add CHANGELOG.md

[computername] Tue 25/06/2024 11:44:09.40
D:\projects\repos\fpdf2

>git rebase --continue
[detached HEAD 5ae48472] updated CHANGELOG.md
 Author: Carl Higgs <carlhiggs@gmail.com>
 1 file changed, 1 insertion(+), 5 deletions(-)
Auto-merging test/fonts/test_wraps_zerowidthspace.py
CONFLICT (content): Merge conflict in test/fonts/test_wraps_zerowidthspace.py
error: could not apply eb7e800b... made sure that the zero-width-space test string was not too long and manually split across several lines
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply eb7e800b... made sure that the zero-width-space test string was not too long and manually split across several lines

[computername] Tue 25/06/2024 11:44:32.52
D:\projects\repos\fpdf2

>cd test

[computername] Tue 25/06/2024 11:45:52.12
D:\projects\repos\fpdf2\test

>cd font
The system cannot find the path specified.

[computername] Tue 25/06/2024 11:45:53.75
D:\projects\repos\fpdf2\test

>cd fonts

[computername] Tue 25/06/2024 11:45:56.40
D:\projects\repos\fpdf2\test\fonts

>git add test_wraps_zerowidthspace.py

[computername] Tue 25/06/2024 11:46:00.62
D:\projects\repos\fpdf2\test\fonts

>git rebase --continue
[detached HEAD 7eb4048e] made sure that the zero-width-space test string was not too long and manually split across several lines as per #1191 (and resolved conflict following rebase)
 Author: Carl Higgs <carlhiggs@gmail.com>
 1 file changed, 9 insertions(+), 13 deletions(-)
Auto-merging docs/Text.md
Auto-merging fpdf/line_break.py
CONFLICT (content): Merge conflict in fpdf/line_break.py
error: could not apply 14bd41ff... updated wrapping to account for additional space characters, and reverted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 14bd41ff... updated wrapping to account for additional space characters, and reverted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for

[computername] Tue 25/06/2024 11:46:57.07
D:\projects\repos\fpdf2\test\fonts

>cd ..

[computername] Tue 25/06/2024 11:48:30.83
D:\projects\repos\fpdf2\test

>cd ..

[computername] Tue 25/06/2024 11:48:32.23
D:\projects\repos\fpdf2

>cd fpdf

[computername] Tue 25/06/2024 11:48:34.51
D:\projects\repos\fpdf2\fpdf

>git add line_break.py

[computername] Tue 25/06/2024 11:48:37.34
D:\projects\repos\fpdf2\fpdf

>git rebase --continue
[detached HEAD e82090c8] updated wrapping to account for additional space characters, and reverted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
 Author: Carl Higgs <carlhiggs@gmail.com>
 2 files changed, 11 insertions(+), 11 deletions(-)
Auto-merging CHANGELOG.md
CONFLICT (content): Merge conflict in CHANGELOG.md
error: could not apply 0a5de1c6... updated line_break.py to list BREAKING_SPACE_SYMBOLS as a single list, and updated CHANGELOG.md to refer to 'all common space symbols' as per #1191
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 0a5de1c6... updated line_break.py to list BREAKING_SPACE_SYMBOLS as a single list, and updated CHANGELOG.md to refer to 'all common space symbols' as per #1191

[computername] Tue 25/06/2024 11:48:50.09
D:\projects\repos\fpdf2\fpdf

>cd ..

[computername] Tue 25/06/2024 11:49:29.82
D:\projects\repos\fpdf2

>git add CHANGELOG.md

[computername] Tue 25/06/2024 11:49:33.03
D:\projects\repos\fpdf2

>git rebase --continue
dropping bf013bba1fd05b25f320ec49264ab691694fa7d1 partial update rebasing main and addressing some (but not yet all) updates to pdf tests -- patch contents already upstream
warning: Cannot merge binary files: test/html/html_custom_pre_code_font.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/html_custom_pre_code_font.pdf
CONFLICT (content): Merge conflict in test/html/html_custom_pre_code_font.pdf
warning: Cannot merge binary files: test/html/html_heading_hebrew.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/html_heading_hebrew.pdf
CONFLICT (content): Merge conflict in test/html/html_heading_hebrew.pdf
warning: Cannot merge binary files: test/html/html_ul_type.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/html_ul_type.pdf
CONFLICT (content): Merge conflict in test/html/html_ul_type.pdf
warning: Cannot merge binary files: test/html/issue_156.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/issue_156.pdf
CONFLICT (content): Merge conflict in test/html/issue_156.pdf
warning: Cannot merge binary files: test/text_shaping/bidi_arabic_lorem_ipsum.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/text_shaping/bidi_arabic_lorem_ipsum.pdf
CONFLICT (content): Merge conflict in test/text_shaping/bidi_arabic_lorem_ipsum.pdf
warning: Cannot merge binary files: test/text_shaping/bidi_paragraph_direction.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/text_shaping/bidi_paragraph_direction.pdf
CONFLICT (content): Merge conflict in test/text_shaping/bidi_paragraph_direction.pdf
error: could not apply e8824083... updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply e8824083... updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191

[computername] Tue 25/06/2024 11:49:38.57
D:\projects\repos\fpdf2

>cd test

[computername] Tue 25/06/2024 11:53:53.79
D:\projects\repos\fpdf2\test

>cd html

[computername] Tue 25/06/2024 11:53:57.66
D:\projects\repos\fpdf2\test\html

>git add html_custom_pre_code_font.pdf

[computername] Tue 25/06/2024 11:54:04.16
D:\projects\repos\fpdf2\test\html

>git add html_ul_type.pdf

[computername] Tue 25/06/2024 11:54:09.19
D:\projects\repos\fpdf2\test\html

>git add html_heading_hebrew.pdf

[computername] Tue 25/06/2024 11:54:22.11
D:\projects\repos\fpdf2\test\html

>git add issue_156.pdf

[computername] Tue 25/06/2024 11:54:28.11
D:\projects\repos\fpdf2\test\html

>cd ..

[computername] Tue 25/06/2024 11:54:29.34
D:\projects\repos\fpdf2\test

>cd text_shaping

[computername] Tue 25/06/2024 11:54:32.07
D:\projects\repos\fpdf2\test\text_shaping

>git add bidi_arabic_lorem_ipsum.pdf

[computername] Tue 25/06/2024 11:54:39.32
D:\projects\repos\fpdf2\test\text_shaping

>git add bidi_paragraph_direction.pdf

[computername] Tue 25/06/2024 11:54:49.87
D:\projects\repos\fpdf2\test\text_shaping

>git rebase --continue
Successfully rebased and updated refs/heads/wrap-zero-width-space.

[computername] Tue 25/06/2024 11:54:57.93
D:\projects\repos\fpdf2\test\text_shaping

>git status
On branch wrap-zero-width-space
Your branch and 'origin/wrap-zero-width-space' have diverged,
and have 8 and 13 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

nothing to commit, working tree clean

[computername] Tue 25/06/2024 11:55:23.81
D:\projects\repos\fpdf2\test\text_shaping

>git push origin wrap-zero-width-space --force
Enumerating objects: 148, done.
Counting objects: 100% (148/148), done.
Delta compression using up to 16 threads
Compressing objects: 100% (79/79), done.
Writing objects: 100% (88/88), 846.73 KiB | 20.65 MiB/s, done.
Total 88 (delta 39), reused 47 (delta 9), pack-reused 0
remote: Resolving deltas: 100% (39/39), completed with 20 local objects.
To https://github.com/carlhiggs/fpdf2.git
 + 4f105dcb...c2c7b7e5 wrap-zero-width-space -> wrap-zero-width-space (forced update)

If there's anything more I can do, just let me know.

gmischler · 2024-06-25T07:08:06Z

Yes, now the diffs look a lot more sane.

There's only this changelog entry about the "/Resources" directory, which is a duplicate of an entry in the *fixed* category that somehow got stuck. You'll need to remove your errant copy.

I also just did a quick check, and found that searching individual characters in a string is several times faster than searching them in a list (not really surprising if you think about it). So if you convert your collection into a string, then the "long_text" timeout might actually go away again. This search is done for every single character in the user input, after all, so it is worth avoiding unnecessary delays.

import timeit

BREAKING_SPACE_SYMBOLS_STR = ''.join(BREAKING_SPACE_SYMBOLS)

res = timeit.repeat(stmt="'x' in BREAKING_SPACE_SYMBOLS", repeat=1, number=1000000, globals=globals())
print('CLIST:', res)
res = timeit.repeat(stmt="'x' in BREAKING_SPACE_SYMBOLS_STR", repeat=1, number=1000000, globals=globals())
print('CSTR:', res)

output (varies a lot depending on system load):
CLIST: [0.9791883510001753]
CSTR: [0.12988920500038148]

…ces using string instead of list as per py-pdf#1191

carlhiggs · 2024-06-26T00:57:07Z

Great idea replacing the list with a string for the check @gmischler ; I just did that and corrected the changelog.

Sadly still fails the time test...

=========================== short test summary info ============================
FAILED test/text/test_cell.py::test_cell_speed_with_long_text - assert 31.066836576600004 < 24
= 1 failed, 1297 passed, 1 skipped, 4 xfailed, 14 warnings in 679.50s (0:11:19) =
Error: Process completed with exit code 1.

It would be great if I could replicate this error on my local machine to try workarounds, but on mine, it passes.

Here I ran it using pytest -k test_cell_speed_with_long_text and including loading pytset, it took 46 seconds; I couldn't figure out how to get the test time logging output displayed on my system. But it passed the test. I also tried this approach using the older code (using the list, and space alone) and each took about 46 seconds.

I was curious how long the test_cell_speed_with_long_text() function would take on its own using the three approaches for breaking (str, list and space), so I run the following as a very crude check (I didn't do the timeit repetitions, which would have been better; I was just wanting to do a quick indicative check of run time with the change and without on my system):

import gzip
from pathlib import Path

import pytest

from fpdf import FPDF, FPDFException
from test.conftest import (
    assert_pdf_equal,
    ensure_exec_time_below,
    ensure_rss_memory_below,
    LOREM_IPSUM,
)

HERE = Path.cwd().resolve() / 'test/text'
FONTS_DIR = HERE.parent / "fonts"

with gzip.open(HERE / "long_text.txt.gz") as text_file:
    LONG_TEXT_LINES = text_file.read().decode(encoding="utf-8").splitlines()

@ensure_exec_time_below(seconds=24)
@ensure_rss_memory_below(mib=1)
def test_cell_speed_with_long_text():  # issue #907
    pdf = FPDF()
    pdf.add_font(fname=FONTS_DIR / "DejaVuSans.ttf")
    pdf.set_font("DejaVuSans", size=5)
    pdf.add_page()
    assert len(LONG_TEXT_LINES) == 26862
    for line in LONG_TEXT_LINES:
        pdf.cell(0, 3, line, new_x="LMARGIN", new_y="NEXT")

import timeit
timeit.timeit(test_cell_speed_with_long_text, number=1)

with character in BREAKING_SPACE_SYMBOLS_STR (current version of this PR, using reference string):

>>> timeit.timeit(test_cell_speed_with_long_text, number=1)
5.54722400000901

with character IN BREAKING_SPACE_SYMBOLS (previous version of this PR, using reference list):

>>> timeit.timeit(test_cell_speed_with_long_text, number=1)
5.622404900001129

with character==SPACE (the code before this pull request):

>>> timeit.timeit(stmt=test_cell_speed_with_long_text,number=1)
5.52745490000234

I think this is good news, in that the proposed change to line_break.py doesn't make much of a meaningful difference to the time it takes to run the test_cell_speed_with_long_text. As you demonstrated (in a more robust way) time to compare against characters in a string is quicker than when using a list. The difference to comparison against a single character was a fraction of a second, so hardly meaningful. I don't see why the change should fail the timing test, when it didn't previously --- in my test it took 0.02 of a second longer to run using the new linebreak.py code compared to the older, and neither exceeded the 24 second limit. Its true, my computer isn't a slouch (32GB ram etc), but the point is, if it passed before, it should in theory pass now because the extra code doesn't seem to cause a meaningful difference in running time.

Happy to make further amendments, but as before, I'm not sure what else to do!

gmischler · 2024-06-26T06:05:59Z

This comparison is only a very small part of the overall text processing, so it's no surprise that the difference hides in the statistical noise in your more complete example. In the strictest sense this turns that change into a "premature optimization", but if we can have even an insignificant piece of code run 5x as fast with no downside whatsoever, we should do so just out of habit.

Don't worry about the long text test failing. I saw that it failed in #1211 as well, so it must have another reason. Either the environment of the respective VM has changed, or we have recently merged another commit that made things slower.

Thanks a lot for this helpful addition, @carlhiggs !
Merging now...

carlhiggs · 2024-06-26T06:13:21Z

Excellent, thanks a lot @gmischler for the patient guidance in getting this feature developed; much appreciated!

carlhiggs requested a review from gmischler as a code owner June 3, 2024 12:34

gmischler approved these changes Jun 3, 2024

View reviewed changes

gmischler reviewed Jun 4, 2024

View reviewed changes

fpdf/line_break.py Outdated Show resolved Hide resolved

gmischler reviewed Jun 4, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

carlhiggs added a commit to carlhiggs/fpdf2 that referenced this pull request Jun 5, 2024

updated line_break.py to list BREAKING_SPACE_SYMBOLS as a single list…

0a5de1c

…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191

carlhiggs added a commit to carlhiggs/fpdf2 that referenced this pull request Jun 21, 2024

updated test reference PDFs after manually confirming they were visua…

e882408

…lly identical and had same file size, as per py-pdf#1191

carlhiggs added 10 commits June 21, 2024 19:28

added in modification to linebreak.py and accompanying test to wrap z…

8e0ac53

…erowidthspace in addition to regular space for wrapping words

ran black

b0b0183

updated documentation to clarify that both ' ' and U+200B count as sp…

c9209bc

…ace characters for word wrapping purposes

updated CHANGELOG.md to resolve conflict following rebasing with mast…

4f95856

…er branch as per py-pdf#1191

made sure that the zero-width-space test string was not too long and …

da0326f

…manually split across several lines

made string concatenation explicit to make pylint happier

096f8c6

updated line_break.py to list BREAKING_SPACE_SYMBOLS as a single list…

e22148b

…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191

partial update rebasing main and addressing some (but not yet all) up…

9c3cc31

…dates to pdf tests

updated test reference PDFs using direct downloads from FPDF2 master …

a75c41e

…branch to resolve git rebase conflicts

carlhiggs added a commit to carlhiggs/fpdf2 that referenced this pull request Jun 24, 2024

updated pdfs following fresh installation of dependencies as per py-p…

4f105dc

…df#1191

carlhiggs added 8 commits June 25, 2024 11:41

addressed conflicts (updates to linespace break code and associated t…

8412b5b

…est) following git rebase

updated documentation to clarify that both ' ' and U+200B count as sp…

4ded107

…ace characters for word wrapping purposes

updated CHANGELOG.md

5ae4847

made sure that the zero-width-space test string was not too long and …

7eb4048

…manually split across several lines as per py-pdf#1191 (and resolved conflict following rebase)

made string concatenation explicit to make pylint happier

73e0e4a

corrected pdfs

29a786f

updated pdfs following fresh installation of dependencies as per py-p…

c2c7b7e

…df#1191

carlhiggs force-pushed the wrap-zero-width-space branch from 4f105dc to c2c7b7e Compare June 25, 2024 01:55

corrected changelog and updated line_break.py to wrap on breaking spa…

a9a5c7f

…ces using string instead of list as per py-pdf#1191

gmischler merged commit f399dda into py-pdf:master Jun 26, 2024
8 of 11 checks passed

carlhiggs mentioned this pull request Jun 26, 2024

Thai text does not correctly wrap (words should wrap on zero width space) healthysustainablecities/global-indicators#441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added in wrapping on zero width space to better support word wrapping in languages like Thai #1191

added in wrapping on zero width space to better support word wrapping in languages like Thai #1191

carlhiggs commented Jun 3, 2024

gmischler left a comment

gmischler commented Jun 6, 2024

gmischler commented Jun 17, 2024

carlhiggs commented Jun 17, 2024

gmischler commented Jun 18, 2024

carlhiggs commented Jun 21, 2024 •

edited

Loading

gmischler commented Jun 21, 2024

carlhiggs commented Jun 21, 2024 •

edited

Loading

carlhiggs commented Jun 21, 2024

gmischler commented Jun 21, 2024

carlhiggs commented Jun 21, 2024

carlhiggs commented Jun 21, 2024

carlhiggs commented Jun 21, 2024

gmischler commented Jun 23, 2024

carlhiggs commented Jun 24, 2024

gmischler commented Jun 24, 2024

carlhiggs commented Jun 25, 2024

gmischler commented Jun 25, 2024

carlhiggs commented Jun 26, 2024 •

edited

Loading

gmischler commented Jun 26, 2024

carlhiggs commented Jun 26, 2024

added in wrapping on zero width space to better support word wrapping in languages like Thai #1191

added in wrapping on zero width space to better support word wrapping in languages like Thai #1191

Conversation

carlhiggs commented Jun 3, 2024

gmischler left a comment

Choose a reason for hiding this comment

gmischler commented Jun 6, 2024

gmischler commented Jun 17, 2024

carlhiggs commented Jun 17, 2024

gmischler commented Jun 18, 2024

carlhiggs commented Jun 21, 2024 • edited Loading

gmischler commented Jun 21, 2024

carlhiggs commented Jun 21, 2024 • edited Loading

carlhiggs commented Jun 21, 2024

gmischler commented Jun 21, 2024

carlhiggs commented Jun 21, 2024

carlhiggs commented Jun 21, 2024

carlhiggs commented Jun 21, 2024

gmischler commented Jun 23, 2024

carlhiggs commented Jun 24, 2024

gmischler commented Jun 24, 2024

carlhiggs commented Jun 25, 2024

gmischler commented Jun 25, 2024

carlhiggs commented Jun 26, 2024 • edited Loading

gmischler commented Jun 26, 2024

carlhiggs commented Jun 26, 2024

carlhiggs commented Jun 21, 2024 •

edited

Loading

carlhiggs commented Jun 21, 2024 •

edited

Loading

carlhiggs commented Jun 26, 2024 •

edited

Loading