-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added in wrapping on zero width space to better support word wrapping in languages like Thai #1191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you only implement this simple change, there's no real point in limiting it to just two characters. Here's the set of the most likely candidates:
- 2000 | EN QUAD
- 2001 | EM QUAD
- 2002 | EN SPACE
- 2003 | EM SPACE
- 2004 | THREE-PER-EM SPACE
- 2005 | FOUR-PER-EM SPACE
- 2006 | SIX-PER-EM SPACE
- 2008 | PUNCTUATION SPACE
- 2009 | THIN SPACE
- 200A | HAIR SPACE
- 205F | MEDIUM MATHEMATICAL SPACE
- 3000 | IDEOGRAPHIC SPACE
- 0009 | TAB
Near the top of html.py, there's a list containing a few more, which include various side effects and may or may not be relevant here.
The point right now is to add those to your two characters for triggering a break. That won't change anything in how your change works, but brings us yet another (small) step closer to a full wrapping algorithm.
…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191
Two things left to do:
Then I think this is ready to merge! |
Are you still around @carlhiggs ? |
Hi @gmischler , yup still here! Sorry been busy on a another project. I started working my way through the pdf updates today -- I think about 47 pdf checks failed, but I'm working my way through and visually checking/updating them. The ones I've checked so far (~10) have looked visually identical, so have replaced pdfs and confirmed them to pass. But just need to work through list. Tricky with other work finding time, but I'll try to get there by Friday. Hope that's okay and thanks for checking in; will be great to have this included in an update |
That's perfectly fine. We absolutely do want to have this improvement, but we can wait a bit longer. |
…lly identical and had same file size, as per py-pdf#1191
Hi @gmischler just confirming that I've now been through and manually checked that the comparison test failing PDFs matched both visually and in file size (KB), and updating the reference PDFs. PyTest checks now all pass on my end (other than 2 skipped, 2 xfailed, and 7 warnings; I don't think those are things I can do anything about). I've also ensured my fork is sync'd and not behind the upstream. Fingers crossed this is now good to go! |
Something weird is going on here with GitHub still seeing conflicts. |
…erowidthspace in addition to regular space for wrapping words
…ace characters for word wrapping purposes
…manually split across several lines
…rted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
…, and updated CHANGELOG.md to refer to 'all common space symbols' as per py-pdf#1191
…dates to pdf tests
…branch to resolve git rebase conflicts
sorry I see what I did --- forgot I was working on a branch; i synced the master in Github, but not my branch... working through it now --- a few more failing tests to resolve; I'll let you know once commited fixes, should be soon |
Sorry for the confusion @gmischler and thanks for prompting me to rebase again --- should be all good now; fingers crossed again! |
Failing with an encryption test. |
@gmischler I'm not sure what to do --- a lot of the tests failing are ones that I had already manually been through compared and replaced pdfs for; its pretty painstaking and I'm not sure that me doing it again will help. The tests are passing on my side. I don't know why they fail in the automated build, but also not sure what to do about it. I think I'll have to put this down here -- pretty late on a Friday sorry. |
You could try to update your dependencies. |
During the github actions build there was an unrelated failure on 'test_cell_speed_with_long_text' on 3.9 Ubuntu-latest, the other 3.9 Windows build was cancelled at that point, but the builds for 3.8 and 3.10 succeeded. So I think this is progress. The cell speed test passes on my side, although maybe its because I don't have qdf (and running Windows with Python 3.12). I see the cell speed test has been an issue before/recently (#923)... maybe its random bad luck? in #913 it was suggested that performance on this test may relate largely to demand at a particular time on Github Actions... Or maybe adding in the other kinds of space character options for wrapping impacted performance --- we could pare back to essentials if it were a problem. Again, not sure how to resolve this one; if you have advice, happy to look further into this. |
The timing error is persistent. I'd be surprised if your change really had such a significant performance impact (given that in most cases it will match the first character in the pattern). But then, the PR also shows a lot of changes that were actually made in other PRs. It seems to compare against an earlier version of the repository. Can you please do a clean rebase, without changing anything else? |
…est) following git rebase
…ace characters for word wrapping purposes
…manually split across several lines as per py-pdf#1191 (and resolved conflict following rebase)
…rted change in documentation that specifically referred to unicode zero-width space as now the existing documentation that refered generically to wrapping on 'space characters' is accurate as multiple space characters are accounted for
4f105dc
to
c2c7b7e
Compare
Hi @gmischler I've given it a go! I followed https://johneverettcase.com/git-how-to-rebase-a-fork , and noticed this was a bit different to when I attempted it last time (there were a series of repeated conflicts, integrating minor changes made over time and confirming which to retain; last time there seemed to be fewer conflicts, so maybe I did something wrong previously). This seemed to go fine. There were a few PDFs that had conflicts (only 6, happily); I replaced these with the most recent ones I generated for the previous build (after resolving encoding errors). In case you want to check that I did a 'clean rebase' the right way (I may have misunderstood). I've pasted my output below. While I was writing this post, I see the 3.9 Ubuntu build failed on the timing test again,
Other details of the rebase wrapped in details brackets below: Expand to view text copy of terminal log attempting clean rebase as suggested
If there's anything more I can do, just let me know. |
Yes, now the diffs look a lot more sane. There's only this changelog entry about the "/Resources" directory, which is a duplicate of an entry in the I also just did a quick check, and found that searching individual characters in a string is several times faster than searching them in a list (not really surprising if you think about it). So if you convert your collection into a string, then the "long_text" timeout might actually go away again. This search is done for every single character in the user input, after all, so it is worth avoiding unnecessary delays. import timeit
BREAKING_SPACE_SYMBOLS_STR = ''.join(BREAKING_SPACE_SYMBOLS)
res = timeit.repeat(stmt="'x' in BREAKING_SPACE_SYMBOLS", repeat=1, number=1000000, globals=globals())
print('CLIST:', res)
res = timeit.repeat(stmt="'x' in BREAKING_SPACE_SYMBOLS_STR", repeat=1, number=1000000, globals=globals())
print('CSTR:', res)
output (varies a lot depending on system load):
CLIST: [0.9791883510001753]
CSTR: [0.12988920500038148] |
…ces using string instead of list as per py-pdf#1191
Great idea replacing the list with a string for the check @gmischler ; I just did that and corrected the changelog. Sadly still fails the time test...
It would be great if I could replicate this error on my local machine to try workarounds, but on mine, it passes. Here I ran it using I was curious how long the
with
with
with
I think this is good news, in that the proposed change to line_break.py doesn't make much of a meaningful difference to the time it takes to run the Happy to make further amendments, but as before, I'm not sure what else to do! |
This comparison is only a very small part of the overall text processing, so it's no surprise that the difference hides in the statistical noise in your more complete example. In the strictest sense this turns that change into a "premature optimization", but if we can have even an insignificant piece of code run 5x as fast with no downside whatsoever, we should do so just out of habit. Don't worry about the long text test failing. I saw that it failed in #1211 as well, so it must have another reason. Either the environment of the respective VM has changed, or we have recently merged another commit that made things slower. Thanks a lot for this helpful addition, @carlhiggs ! |
Excellent, thanks a lot @gmischler for the patient guidance in getting this feature developed; much appreciated! |
Fixes #1190
Checklist:
The GitHub pipeline is OK (green),
meaning that both
pylint
(static code analyzer) andblack
(code formatter) are happy with the changes of this PR.A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the
docs/
folderA mention of the change is present in
CHANGELOG.md
By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.