Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added in wrapping on zero width space to better support word wrapping in languages like Thai #1191

Merged
merged 19 commits into from
Jun 26, 2024

Conversation

carlhiggs
Copy link

Fixes #1190

Checklist:

  • The GitHub pipeline is OK (green),
    meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.

  • A unit test is covering the code added / modified by this PR

  • This PR is ready to be merged

  • In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder

  • A mention of the change is present in CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

@carlhiggs carlhiggs requested a review from gmischler as a code owner June 3, 2024 12:34
Copy link
Collaborator

@gmischler gmischler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you only implement this simple change, there's no real point in limiting it to just two characters. Here's the set of the most likely candidates:

  • 2000 | EN QUAD
  • 2001 | EM QUAD
  • 2002 | EN SPACE
  • 2003 | EM SPACE
  • 2004 | THREE-PER-EM SPACE
  • 2005 | FOUR-PER-EM SPACE
  • 2006 | SIX-PER-EM SPACE
  • 2008 | PUNCTUATION SPACE
  • 2009 | THIN SPACE
  • 200A | HAIR SPACE
  • 205F | MEDIUM MATHEMATICAL SPACE
  • 3000 | IDEOGRAPHIC SPACE
  • 0009 | TAB

Near the top of html.py, there's a list containing a few more, which include various side effects and may or may not be relevant here.

The point right now is to add those to your two characters for triggering a break. That won't change anything in how your change works, but brings us yet another (small) step closer to a full wrapping algorithm.

fpdf/line_break.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
carlhiggs added a commit to carlhiggs/fpdf2 that referenced this pull request Jun 5, 2024
…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191
@gmischler
Copy link
Collaborator

Two things left to do:

  • Rebase your branch and resolve any conflicts with changes that have been merged in the mean time.
  • Update the PDF files that fail the comparison test (after thorougly checking that they are visually identical).

Then I think this is ready to merge!

@gmischler
Copy link
Collaborator

Are you still around @carlhiggs ?
We're this close to merging this, after just a little bit of cleanup...

@carlhiggs
Copy link
Author

Are you still around @carlhiggs ?
We're this close to merging this, after just a little bit of cleanup...

Hi @gmischler , yup still here! Sorry been busy on a another project. I started working my way through the pdf updates today -- I think about 47 pdf checks failed, but I'm working my way through and visually checking/updating them. The ones I've checked so far (~10) have looked visually identical, so have replaced pdfs and confirmed them to pass. But just need to work through list. Tricky with other work finding time, but I'll try to get there by Friday.

Hope that's okay and thanks for checking in; will be great to have this included in an update

@gmischler
Copy link
Collaborator

Hope that's okay and thanks for checking in; will be great to have this included in an update

That's perfectly fine. We absolutely do want to have this improvement, but we can wait a bit longer.

carlhiggs added a commit to carlhiggs/fpdf2 that referenced this pull request Jun 21, 2024
@carlhiggs
Copy link
Author

carlhiggs commented Jun 21, 2024

Hope that's okay and thanks for checking in; will be great to have this included in an update

That's perfectly fine. We absolutely do want to have this improvement, but we can wait a bit longer.

Hi @gmischler just confirming that I've now been through and manually checked that the comparison test failing PDFs matched both visually and in file size (KB), and updating the reference PDFs. PyTest checks now all pass on my end (other than 2 skipped, 2 xfailed, and 7 warnings; I don't think those are things I can do anything about).

I've also ensured my fork is sync'd and not behind the upstream.

Fingers crossed this is now good to go!

@gmischler
Copy link
Collaborator

Something weird is going on here with GitHub still seeing conflicts.
Can you try another rebase? Maybe that makes it go away.

carlhiggs added 10 commits June 21, 2024 19:28
…erowidthspace in addition to regular space for wrapping words
…rted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191
@carlhiggs
Copy link
Author

carlhiggs commented Jun 21, 2024

Something weird is going on here with GitHub still seeing conflicts. Can you try another rebase? Maybe that makes it go away.

sorry I see what I did --- forgot I was working on a branch; i synced the master in Github, but not my branch...

working through it now --- a few more failing tests to resolve; I'll let you know once commited fixes, should be soon

@carlhiggs
Copy link
Author

Sorry for the confusion @gmischler and thanks for prompting me to rebase again --- should be all good now; fingers crossed again!

@gmischler
Copy link
Collaborator

Failing with an encryption test.

@carlhiggs
Copy link
Author

Failing with an encryption test.

I guess they are ones that were skipped on my local pytest run... i just ran again and looks like this on my side:
image

But I see what you mean in the logs... i'll try to look again...

@carlhiggs
Copy link
Author

While it says test_encrypt_font failed, it didn't on my end; presumably similar with other tests
image

Might relate to qpdf not being installed on my side.. I'll try installing and see if i have more luck. Or otherwise, I'll manually compare the tested and reference pdfs and if equivalent, replace the reference.

@carlhiggs
Copy link
Author

@gmischler I'm not sure what to do --- a lot of the tests failing are ones that I had already manually been through compared and replaced pdfs for; its pretty painstaking and I'm not sure that me doing it again will help. The tests are passing on my side. I don't know why they fail in the automated build, but also not sure what to do about it. I think I'll have to put this down here -- pretty late on a Friday sorry.

@gmischler
Copy link
Collaborator

@gmischler I'm not sure what to do

You could try to update your dependencies.
I think there was a recent change that caused the encrypted parts of a file to be stored in a slightly different format.

carlhiggs added a commit to carlhiggs/fpdf2 that referenced this pull request Jun 24, 2024
@carlhiggs
Copy link
Author

You could try to update your dependencies.
Thank you, I think this solved the previous issue, @gmischler . I set up a fresh Conda environment, installed the fpdf requirements again, re-ran tests identying failures and following manual checks of appearance and file size (kb) and replacing as required, all the pdf checks have now been resolved. Pytest passes on my system.

During the github actions build there was an unrelated failure on 'test_cell_speed_with_long_text' on 3.9 Ubuntu-latest, the other 3.9 Windows build was cancelled at that point, but the builds for 3.8 and 3.10 succeeded. So I think this is progress.

The cell speed test passes on my side, although maybe its because I don't have qdf (and running Windows with Python 3.12).

I see the cell speed test has been an issue before/recently (#923)... maybe its random bad luck? in #913 it was suggested that performance on this test may relate largely to demand at a particular time on Github Actions...

Or maybe adding in the other kinds of space character options for wrapping impacted performance --- we could pare back to essentials if it were a problem.

Again, not sure how to resolve this one; if you have advice, happy to look further into this.

@gmischler
Copy link
Collaborator

The timing error is persistent. I'd be surprised if your change really had such a significant performance impact (given that in most cases it will match the first character in the pattern).

But then, the PR also shows a lot of changes that were actually made in other PRs. It seems to compare against an earlier version of the repository. Can you please do a clean rebase, without changing anything else?

…manually split across several lines as per py-pdf#1191 (and resolved conflict following rebase)
…rted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
@carlhiggs carlhiggs force-pushed the wrap-zero-width-space branch from 4f105dc to c2c7b7e Compare June 25, 2024 01:55
@carlhiggs
Copy link
Author

Can you please do a clean rebase, without changing anything else?

Hi @gmischler I've given it a go! I followed https://johneverettcase.com/git-how-to-rebase-a-fork , and noticed this was a bit different to when I attempted it last time (there were a series of repeated conflicts, integrating minor changes made over time and confirming which to retain; last time there seemed to be fewer conflicts, so maybe I did something wrong previously). This seemed to go fine. There were a few PDFs that had conflicts (only 6, happily); I replaced these with the most recent ones I generated for the previous build (after resolving encoding errors).

In case you want to check that I did a 'clean rebase' the right way (I may have misunderstood). I've pasted my output below.

While I was writing this post, I see the 3.9 Ubuntu build failed on the timing test again,

=========================== short test summary info ============================
FAILED test/text/test_cell.py::test_cell_speed_with_long_text - assert 31.197102565700003 < 24
= 1 failed, 1297 passed, 1 skipped, 4 xfailed, 14 warnings in 687.00s (0:11:26) =
Error: Process completed with exit code 1.

Other details of the rebase wrapped in details brackets below:

Expand to view text copy of terminal log attempting clean rebase as suggested
>git clone https://github.com/carlhiggs/fpdf2.git
Cloning into 'fpdf2'...
remote: Enumerating objects: 15315, done.
remote: Counting objects: 100% (973/973), done.
remote: Compressing objects: 100% (536/536), done.
remote: Total 15315 (delta 617), reused 744 (delta 437), pack-reused 14342Receiving objects: 100% (15315/15315), 48.63 MiB | 13.38 MiB/s
Receiving objects: 100% (15315/15315), 52.85 MiB | 13.41 MiB/s, done.
Resolving deltas: 100% (10597/10597), done.

[computername] Tue 25/06/2024 11:28:59.16
D:\projects\repos

>cd fpdf2

[computername] Tue 25/06/2024 11:29:02.25
D:\projects\repos\fpdf2

>git remote add upstream https://github.com/py-pdf/fpdf2.git

[computername] Tue 25/06/2024 11:29:03.47
D:\projects\repos\fpdf2

>git remote -v
origin  https://github.com/carlhiggs/fpdf2.git (fetch)
origin  https://github.com/carlhiggs/fpdf2.git (push)
upstream        https://github.com/py-pdf/fpdf2.git (fetch)
upstream        https://github.com/py-pdf/fpdf2.git (push)

[computername] Tue 25/06/2024 11:29:07.03
D:\projects\repos\fpdf2

>git fetch upstream
remote: Enumerating objects: 23886, done.
remote: Counting objects: 100% (23772/23772), done.
remote: Compressing objects: 100% (2802/2802), done.
remote: Total 23618 (delta 20947), reused 23114 (delta 20468), pack-reused 0Receiving objects: 100% (23618/23618), 249.28 MiB | 13.84 MiB/s
Receiving objects: 100% (23618/23618), 251.05 MiB | 13.51 MiB/s, done.
Resolving deltas: 100% (20947/20947), completed with 42 local objects.
From https://github.com/py-pdf/fpdf2
 * [new branch]        2.7.9-release     -> upstream/2.7.9-release
 * [new branch]        add-script-compare-changed-pdfs -> upstream/add-script-compare-changed-pdfs
 * [new branch]        all-contributors/add-DarekRepos -> upstream/all-contributors/add-DarekRepos
 * [new branch]        all-contributors/add-carlhiggs -> upstream/all-contributors/add-carlhiggs
 * [new branch]        all-contributors/add-smitelli -> upstream/all-contributors/add-smitelli
 * [new branch]        gh-pages          -> upstream/gh-pages
 * [new branch]        html-split-style-and-attrs-handling -> upstream/html-split-style-and-attrs-handling
 * [new branch]        keep_aspect_ratio_and_x_align_center -> upstream/keep_aspect_ratio_and_x_align_center
 * [new branch]        master            -> upstream/master
 * [new branch]        rm-trailing-zeros -> upstream/rm-trailing-zeros
 * [new branch]        svg-text          -> upstream/svg-text
 * [new tag]           1.7               -> 1.7
 * [new tag]           1.7.1             -> 1.7.1
 * [new tag]           1.7.2             -> 1.7.2
 * [new tag]           2.0.0             -> 2.0.0
 * [new tag]           2.0.5             -> 2.0.5
 * [new tag]           2.1.0             -> 2.1.0
 * [new tag]           2.2.0             -> 2.2.0
 * [new tag]           2.3.0             -> 2.3.0
 * [new tag]           2.3.1             -> 2.3.1
 * [new tag]           2.3.2             -> 2.3.2
 * [new tag]           2.3.3             -> 2.3.3
 * [new tag]           2.3.4             -> 2.3.4
 * [new tag]           2.3.5             -> 2.3.5
 * [new tag]           2.4.0             -> 2.4.0
 * [new tag]           2.4.1             -> 2.4.1
 * [new tag]           2.4.2             -> 2.4.2
 * [new tag]           2.4.3             -> 2.4.3
 * [new tag]           2.4.4             -> 2.4.4
 * [new tag]           2.4.5             -> 2.4.5
 * [new tag]           2.4.6             -> 2.4.6
 * [new tag]           2.5.0             -> 2.5.0
 * [new tag]           2.5.1             -> 2.5.1
 * [new tag]           2.5.2             -> 2.5.2
 * [new tag]           2.5.3             -> 2.5.3
 * [new tag]           2.5.4             -> 2.5.4
 * [new tag]           2.5.5             -> 2.5.5
 * [new tag]           2.5.6             -> 2.5.6
 * [new tag]           2.5.7             -> 2.5.7
 * [new tag]           2.6.0             -> 2.6.0
 * [new tag]           2.6.1             -> 2.6.1
 * [new tag]           2.7.0             -> 2.7.0
 * [new tag]           2.7.1             -> 2.7.1
 * [new tag]           2.7.2             -> 2.7.2
 * [new tag]           2.7.3             -> 2.7.3
 * [new tag]           2.7.4             -> 2.7.4
 * [new tag]           2.7.5             -> 2.7.5
 * [new tag]           2.7.6             -> 2.7.6
 * [new tag]           2.7.7             -> 2.7.7
 * [new tag]           2.7.8             -> 2.7.8
 * [new tag]           2.7.9             -> 2.7.9
 * [new tag]           binary            -> binary
 * [new tag]           gif               -> gif
 * [new tag]           png_alpha         -> png_alpha
 * [new tag]           unicode           -> unicode

[computername] Tue 25/06/2024 11:29:42.18
D:\projects\repos\fpdf2

>git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean

[computername] Tue 25/06/2024 11:29:54.46
D:\projects\repos\fpdf2

>git checkout wrap-zero-width-space
Switched to a new branch 'wrap-zero-width-space'
branch 'wrap-zero-width-space' set up to track 'origin/wrap-zero-width-space'.

[computername] Tue 25/06/2024 11:30:10.51
D:\projects\repos\fpdf2

>git branch wrap-zero-width-space-before-rebaes-2024-06-05

[computername] Tue 25/06/2024 11:30:45.17
D:\projects\repos\fpdf2

>git checkout wrap-zero-width-space
Already on 'wrap-zero-width-space'
Your branch is up to date with 'origin/wrap-zero-width-space'.

[computername] Tue 25/06/2024 11:30:47.69
D:\projects\repos\fpdf2

>git status
On branch wrap-zero-width-space
Your branch is up to date with 'origin/wrap-zero-width-space'.

nothing to commit, working tree clean

[computername] Tue 25/06/2024 11:31:01.14
D:\projects\repos\fpdf2

>git rebase upstream/master
Auto-merging fpdf/line_break.py
CONFLICT (content): Merge conflict in fpdf/line_break.py
Auto-merging test/fonts/test_wraps_zerowidthspace.py
CONFLICT (add/add): Merge conflict in test/fonts/test_wraps_zerowidthspace.py
error: could not apply d5808fe9... added in modification to linebreak.py and accompanying test to wrap zerowidthspace in addition to regular space for wrapping words
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply d5808fe9... added in modification to linebreak.py and accompanying test to wrap zerowidthspace in addition to regular space for wrapping words

[computername] Tue 25/06/2024 11:31:13.19
D:\projects\repos\fpdf2

>cd fpdf

[computername] Tue 25/06/2024 11:35:29.32
D:\projects\repos\fpdf2\fpdf

>git add line_break.py

[computername] Tue 25/06/2024 11:35:39.14
D:\projects\repos\fpdf2\fpdf

>cd test
The system cannot find the path specified.

[computername] Tue 25/06/2024 11:35:45.94
D:\projects\repos\fpdf2\fpdf

>cd test
The system cannot find the path specified.

[computername] Tue 25/06/2024 11:38:36.94
D:\projects\repos\fpdf2\fpdf

>cd ..

[computername] Tue 25/06/2024 11:38:43.08
D:\projects\repos\fpdf2

>cd test

[computername] Tue 25/06/2024 11:38:44.12
D:\projects\repos\fpdf2\test

>cd fonts

[computername] Tue 25/06/2024 11:38:46.18
D:\projects\repos\fpdf2\test\fonts

>git add test_wraps_zerowidthspace.py

[computername] Tue 25/06/2024 11:38:54.19
D:\projects\repos\fpdf2\test\fonts

>git rebase --continue
Auto-merging fpdf/line_break.py
CONFLICT (content): Merge conflict in fpdf/line_break.py
Auto-merging test/fonts/test_wraps_zerowidthspace.py
CONFLICT (content): Merge conflict in test/fonts/test_wraps_zerowidthspace.py
error: could not apply 1123c5d2... ran black
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 1123c5d2... ran black

[computername] Tue 25/06/2024 11:39:12.57
D:\projects\repos\fpdf2\test\fonts

>git add test_wraps_zerowidthspace.py

[computername] Tue 25/06/2024 11:41:01.55
D:\projects\repos\fpdf2\test\fonts

>cd ..

[computername] Tue 25/06/2024 11:41:04.23
D:\projects\repos\fpdf2\test

>cd ..

[computername] Tue 25/06/2024 11:41:05.25
D:\projects\repos\fpdf2

>cd fpdf

[computername] Tue 25/06/2024 11:41:06.68
D:\projects\repos\fpdf2\fpdf

>git add line_break.py

[computername] Tue 25/06/2024 11:41:09.26
D:\projects\repos\fpdf2\fpdf

>git rebase --continue
[detached HEAD 8412b5be] addressed conflicts (updates to linespace break code and associated test) following git rebase
 Author: Carl Higgs <carlhiggs@gmail.com>
 1 file changed, 4 insertions(+)
Auto-merging CHANGELOG.md
CONFLICT (content): Merge conflict in CHANGELOG.md
error: could not apply 35969ce1... updated CHANGELOG.md
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 35969ce1... updated CHANGELOG.md

[computername] Tue 25/06/2024 11:42:00.89
D:\projects\repos\fpdf2\fpdf

>cd ..

[computername] Tue 25/06/2024 11:44:05.45
D:\projects\repos\fpdf2

>git add CHANGELOG.md

[computername] Tue 25/06/2024 11:44:09.40
D:\projects\repos\fpdf2

>git rebase --continue
[detached HEAD 5ae48472] updated CHANGELOG.md
 Author: Carl Higgs <carlhiggs@gmail.com>
 1 file changed, 1 insertion(+), 5 deletions(-)
Auto-merging test/fonts/test_wraps_zerowidthspace.py
CONFLICT (content): Merge conflict in test/fonts/test_wraps_zerowidthspace.py
error: could not apply eb7e800b... made sure that the zero-width-space test string was not too long and manually split across several lines
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply eb7e800b... made sure that the zero-width-space test string was not too long and manually split across several lines

[computername] Tue 25/06/2024 11:44:32.52
D:\projects\repos\fpdf2

>cd test

[computername] Tue 25/06/2024 11:45:52.12
D:\projects\repos\fpdf2\test

>cd font
The system cannot find the path specified.

[computername] Tue 25/06/2024 11:45:53.75
D:\projects\repos\fpdf2\test

>cd fonts

[computername] Tue 25/06/2024 11:45:56.40
D:\projects\repos\fpdf2\test\fonts

>git add test_wraps_zerowidthspace.py

[computername] Tue 25/06/2024 11:46:00.62
D:\projects\repos\fpdf2\test\fonts

>git rebase --continue
[detached HEAD 7eb4048e] made sure that the zero-width-space test string was not too long and manually split across several lines as per #1191 (and resolved conflict following rebase)
 Author: Carl Higgs <carlhiggs@gmail.com>
 1 file changed, 9 insertions(+), 13 deletions(-)
Auto-merging docs/Text.md
Auto-merging fpdf/line_break.py
CONFLICT (content): Merge conflict in fpdf/line_break.py
error: could not apply 14bd41ff... updated wrapping to account for additional space characters, and reverted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 14bd41ff... updated wrapping to account for additional space characters, and reverted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for

[computername] Tue 25/06/2024 11:46:57.07
D:\projects\repos\fpdf2\test\fonts

>cd ..

[computername] Tue 25/06/2024 11:48:30.83
D:\projects\repos\fpdf2\test

>cd ..

[computername] Tue 25/06/2024 11:48:32.23
D:\projects\repos\fpdf2

>cd fpdf

[computername] Tue 25/06/2024 11:48:34.51
D:\projects\repos\fpdf2\fpdf

>git add line_break.py

[computername] Tue 25/06/2024 11:48:37.34
D:\projects\repos\fpdf2\fpdf

>git rebase --continue
[detached HEAD e82090c8] updated wrapping to account for additional space characters, and reverted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
 Author: Carl Higgs <carlhiggs@gmail.com>
 2 files changed, 11 insertions(+), 11 deletions(-)
Auto-merging CHANGELOG.md
CONFLICT (content): Merge conflict in CHANGELOG.md
error: could not apply 0a5de1c6... updated line_break.py to list BREAKING_SPACE_SYMBOLS as a single list, and updated CHANGELOG.md to refer to 'all common space symbols' as per #1191
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 0a5de1c6... updated line_break.py to list BREAKING_SPACE_SYMBOLS as a single list, and updated CHANGELOG.md to refer to 'all common space symbols' as per #1191

[computername] Tue 25/06/2024 11:48:50.09
D:\projects\repos\fpdf2\fpdf

>cd ..

[computername] Tue 25/06/2024 11:49:29.82
D:\projects\repos\fpdf2

>git add CHANGELOG.md

[computername] Tue 25/06/2024 11:49:33.03
D:\projects\repos\fpdf2

>git rebase --continue
dropping bf013bba1fd05b25f320ec49264ab691694fa7d1 partial update rebasing main and addressing some (but not yet all) updates to pdf tests -- patch contents already upstream
warning: Cannot merge binary files: test/html/html_custom_pre_code_font.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/html_custom_pre_code_font.pdf
CONFLICT (content): Merge conflict in test/html/html_custom_pre_code_font.pdf
warning: Cannot merge binary files: test/html/html_heading_hebrew.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/html_heading_hebrew.pdf
CONFLICT (content): Merge conflict in test/html/html_heading_hebrew.pdf
warning: Cannot merge binary files: test/html/html_ul_type.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/html_ul_type.pdf
CONFLICT (content): Merge conflict in test/html/html_ul_type.pdf
warning: Cannot merge binary files: test/html/issue_156.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/html/issue_156.pdf
CONFLICT (content): Merge conflict in test/html/issue_156.pdf
warning: Cannot merge binary files: test/text_shaping/bidi_arabic_lorem_ipsum.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/text_shaping/bidi_arabic_lorem_ipsum.pdf
CONFLICT (content): Merge conflict in test/text_shaping/bidi_arabic_lorem_ipsum.pdf
warning: Cannot merge binary files: test/text_shaping/bidi_paragraph_direction.pdf (HEAD vs. e8824083 (updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191))
Auto-merging test/text_shaping/bidi_paragraph_direction.pdf
CONFLICT (content): Merge conflict in test/text_shaping/bidi_paragraph_direction.pdf
error: could not apply e8824083... updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply e8824083... updated test reference PDFs after manually confirming they were visually identical and had same file size, as per #1191

[computername] Tue 25/06/2024 11:49:38.57
D:\projects\repos\fpdf2

>cd test

[computername] Tue 25/06/2024 11:53:53.79
D:\projects\repos\fpdf2\test

>cd html

[computername] Tue 25/06/2024 11:53:57.66
D:\projects\repos\fpdf2\test\html

>git add html_custom_pre_code_font.pdf

[computername] Tue 25/06/2024 11:54:04.16
D:\projects\repos\fpdf2\test\html

>git add html_ul_type.pdf

[computername] Tue 25/06/2024 11:54:09.19
D:\projects\repos\fpdf2\test\html

>git add html_heading_hebrew.pdf

[computername] Tue 25/06/2024 11:54:22.11
D:\projects\repos\fpdf2\test\html

>git add issue_156.pdf

[computername] Tue 25/06/2024 11:54:28.11
D:\projects\repos\fpdf2\test\html

>cd ..

[computername] Tue 25/06/2024 11:54:29.34
D:\projects\repos\fpdf2\test

>cd text_shaping

[computername] Tue 25/06/2024 11:54:32.07
D:\projects\repos\fpdf2\test\text_shaping

>git add bidi_arabic_lorem_ipsum.pdf

[computername] Tue 25/06/2024 11:54:39.32
D:\projects\repos\fpdf2\test\text_shaping

>git add bidi_paragraph_direction.pdf

[computername] Tue 25/06/2024 11:54:49.87
D:\projects\repos\fpdf2\test\text_shaping

>git rebase --continue
Successfully rebased and updated refs/heads/wrap-zero-width-space.

[computername] Tue 25/06/2024 11:54:57.93
D:\projects\repos\fpdf2\test\text_shaping

>git status
On branch wrap-zero-width-space
Your branch and 'origin/wrap-zero-width-space' have diverged,
and have 8 and 13 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

nothing to commit, working tree clean

[computername] Tue 25/06/2024 11:55:23.81
D:\projects\repos\fpdf2\test\text_shaping

>git push origin wrap-zero-width-space --force
Enumerating objects: 148, done.
Counting objects: 100% (148/148), done.
Delta compression using up to 16 threads
Compressing objects: 100% (79/79), done.
Writing objects: 100% (88/88), 846.73 KiB | 20.65 MiB/s, done.
Total 88 (delta 39), reused 47 (delta 9), pack-reused 0
remote: Resolving deltas: 100% (39/39), completed with 20 local objects.
To https://github.com/carlhiggs/fpdf2.git
 + 4f105dcb...c2c7b7e5 wrap-zero-width-space -> wrap-zero-width-space (forced update)

If there's anything more I can do, just let me know.

@gmischler
Copy link
Collaborator

Yes, now the diffs look a lot more sane.

There's only this changelog entry about the "/Resources" directory, which is a duplicate of an entry in the *fixed* category that somehow got stuck. You'll need to remove your errant copy.

I also just did a quick check, and found that searching individual characters in a string is several times faster than searching them in a list (not really surprising if you think about it). So if you convert your collection into a string, then the "long_text" timeout might actually go away again. This search is done for every single character in the user input, after all, so it is worth avoiding unnecessary delays.

import timeit

BREAKING_SPACE_SYMBOLS_STR = ''.join(BREAKING_SPACE_SYMBOLS)

res = timeit.repeat(stmt="'x' in BREAKING_SPACE_SYMBOLS", repeat=1, number=1000000, globals=globals())
print('CLIST:', res)
res = timeit.repeat(stmt="'x' in BREAKING_SPACE_SYMBOLS_STR", repeat=1, number=1000000, globals=globals())
print('CSTR:', res)

output (varies a lot depending on system load):
CLIST: [0.9791883510001753]
CSTR: [0.12988920500038148]

@carlhiggs
Copy link
Author

carlhiggs commented Jun 26, 2024

Great idea replacing the list with a string for the check @gmischler ; I just did that and corrected the changelog.

Sadly still fails the time test...

=========================== short test summary info ============================
FAILED test/text/test_cell.py::test_cell_speed_with_long_text - assert 31.066836576600004 < 24
= 1 failed, 1297 passed, 1 skipped, 4 xfailed, 14 warnings in 679.50s (0:11:19) =
Error: Process completed with exit code 1.

It would be great if I could replicate this error on my local machine to try workarounds, but on mine, it passes.

Here I ran it using pytest -k test_cell_speed_with_long_text and including loading pytset, it took 46 seconds; I couldn't figure out how to get the test time logging output displayed on my system. But it passed the test. I also tried this approach using the older code (using the list, and space alone) and each took about 46 seconds.

I was curious how long the test_cell_speed_with_long_text() function would take on its own using the three approaches for breaking (str, list and space), so I run the following as a very crude check (I didn't do the timeit repetitions, which would have been better; I was just wanting to do a quick indicative check of run time with the change and without on my system):

import gzip
from pathlib import Path

import pytest

from fpdf import FPDF, FPDFException
from test.conftest import (
    assert_pdf_equal,
    ensure_exec_time_below,
    ensure_rss_memory_below,
    LOREM_IPSUM,
)

HERE = Path.cwd().resolve() / 'test/text'
FONTS_DIR = HERE.parent / "fonts"

with gzip.open(HERE / "long_text.txt.gz") as text_file:
    LONG_TEXT_LINES = text_file.read().decode(encoding="utf-8").splitlines()

@ensure_exec_time_below(seconds=24)
@ensure_rss_memory_below(mib=1)
def test_cell_speed_with_long_text():  # issue #907
    pdf = FPDF()
    pdf.add_font(fname=FONTS_DIR / "DejaVuSans.ttf")
    pdf.set_font("DejaVuSans", size=5)
    pdf.add_page()
    assert len(LONG_TEXT_LINES) == 26862
    for line in LONG_TEXT_LINES:
        pdf.cell(0, 3, line, new_x="LMARGIN", new_y="NEXT")

import timeit
timeit.timeit(test_cell_speed_with_long_text, number=1)

with character in BREAKING_SPACE_SYMBOLS_STR (current version of this PR, using reference string):

>>> timeit.timeit(test_cell_speed_with_long_text, number=1)
5.54722400000901

with character IN BREAKING_SPACE_SYMBOLS (previous version of this PR, using reference list):

>>> timeit.timeit(test_cell_speed_with_long_text, number=1)
5.622404900001129

with character==SPACE (the code before this pull request):

>>> timeit.timeit(stmt=test_cell_speed_with_long_text,number=1)
5.52745490000234

I think this is good news, in that the proposed change to line_break.py doesn't make much of a meaningful difference to the time it takes to run the test_cell_speed_with_long_text. As you demonstrated (in a more robust way) time to compare against characters in a string is quicker than when using a list. The difference to comparison against a single character was a fraction of a second, so hardly meaningful. I don't see why the change should fail the timing test, when it didn't previously --- in my test it took 0.02 of a second longer to run using the new linebreak.py code compared to the older, and neither exceeded the 24 second limit. Its true, my computer isn't a slouch (32GB ram etc), but the point is, if it passed before, it should in theory pass now because the extra code doesn't seem to cause a meaningful difference in running time.

Happy to make further amendments, but as before, I'm not sure what else to do!

@gmischler
Copy link
Collaborator

This comparison is only a very small part of the overall text processing, so it's no surprise that the difference hides in the statistical noise in your more complete example. In the strictest sense this turns that change into a "premature optimization", but if we can have even an insignificant piece of code run 5x as fast with no downside whatsoever, we should do so just out of habit.

Don't worry about the long text test failing. I saw that it failed in #1211 as well, so it must have another reason. Either the environment of the respective VM has changed, or we have recently merged another commit that made things slower.

Thanks a lot for this helpful addition, @carlhiggs !
Merging now...

@gmischler gmischler merged commit f399dda into py-pdf:master Jun 26, 2024
8 of 11 checks passed
@carlhiggs
Copy link
Author

Excellent, thanks a lot @gmischler for the patient guidance in getting this feature developed; much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants