-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coords in extract text #1389
Coords in extract text #1389
Conversation
When visiting the text-changes at changes of the collecting output-variable the coordinates may be outdated. This commit visits texts in Tj and TJ. This gets better coordinates but may result in sending letters instead of whole words.
The coordinates of the texts after Td are correct but where wrong when visited. The visit-change in _page.py (Tj and TJ only) fixed this. This commit contains an update of the corresponding tests.
Have you been able to check that :
|
@pubpub-zz Yesterday I executed the tests in tests/test_page.py. Looking at your questions I started with behavior in arabic using sample-files/015-arabic/habibi.pdf. But there seems to be a doubled interpretation of the first text -- in the PR-version and the non-PR-version. I commented out the arabic text:
But I got arabic letters: The codepoint 004b seems to include the letter 'h' and arabic characters when extracting via PyPDF2:
gives
In <004b> the 0068 is the 'h' of 'habibi'. But it is accompanied by arabic letters not shown in viewers like xpdf or gs. |
I'm not worried about change non-visitors output, it is just to be sure that your change will properly generate visitor function call in the cases I imagined 😊. |
@pubpub-zz <004b> <062d064e0628064a0628064a00200068>
<0044> <0061>
<0045> <0062>
<004c> <0069>
endbfchar In the arabic text most characters' text should be empty: <0003> <>
<03f2> <>
<0392> <>
<03f4> <>
<02f4> <>
<03a3> <062d064e0628064a0628064a0020>
endbfchar Ok, at least I understand what's going on there ... Do you recommend some PDFs to check the answers to your questions? |
here you are a pdf I've built: |
Topic 1:
Dump of BDC-sections in a tagged PDF:
Output (line-breaks added manually):
The line-breaks mark the positions where text and coordinates had been sent without this PR. |
There are difficulties indeed! I added an evaluation of rtl_dir in the TJ-text-visitor. So the extraction of the first arabic sentence is fixed.
But mixed content gets still scrambled -- if one ignores the x-coordinates:
Code used to extract the texts the samples in this comment (ignoring x-coordinates):
|
"bidi" is not simple. I had a look at https://www.unicode.org/reports/tr9/ and https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt. The following visitor-sample uses a regular expression to detect RTL-ranges (like the expressions used in _page.py). To use custom ranges one may use a changed regular expression. Result: Digit 1 in the globalization-line is at the left (as in the PDF).
Code of used visitor: The texts are mapped to y and x.
Output of extract_text() (2.11.0, visitor not used): The digit 1 is not on the left side.
|
Some thoughts: The visitor-sample which sorts all text-evens regarding y- and x-coordinate can takle complicated documents which send the text-fragments not line-by-line or randomly inside a line. In most cases the text-extraction via page.extract_text() is very efficient and gives the proper result. At the current state -- I haven't answered all the questions of @pubpub-zz yet -- this PR looks good. The second question was helpful to improve RTL-handling. The previous sample uses a regular expression to determine rtl-status. But that status is computed in _page.py already. So another solution might be to add flags "is_rtl" and "is_ltr" in the visitor-function's arguments to determine if we have "treasure", "يبيبَح" or neutral " : ". But then some kind of status-object in the visitor-arguments would be better than a lot of further arguments (@MartinThoma we talked about that) -- to preserve compatibility. |
I tested the map-y-x-sample above with #1395. The TJ-operations extend over several columns. In this case it would be fine if one could disable the overriding the visitor in TJ handling. But before it would be necessary that the Tj-implementation updates the text matrix. Have a look at 5.3.3 "Text Space Details". It describes the update of the text matrix when writing glyphs. @pubpub-zz This would be very useful! Sample output:
Events send to the visitor (line 7):
Source of the visitor trying to retain the layout:
By the way: I checked the map at page 2. The layout can be seen (you may have to rotate by 90°).
|
This flags enables one to get a visitor_text-call at each text-operand of a TJ operation. Default is group_TJ = False, one visitor_text-call only at a TJ-operation.
I added an optional flag group_TJ. This enables one to choose between one visitor_text-event at TJ and a visitor_text-event for each Tj called by the TJ-implementation. group_TJ = True: You can see why I would like an update of the text_matrix at each glyph. This glyph-update might be optional to preserve performance in linear documents.
group_TJ = False (default):
|
Third question:
Below is a sample which prints the font-names and explains some chinese words.
Source code of the implementation:
The fourth question:
This one was answered already above in the arabic sample. A mix of arabic and numbers is possible but one has to be careful. |
@pubpub-zz Thanks for your questions and the sample! They helped me to improve this PR. |
Excursus: I played a bit with the glyph widths --- but all of those offsets (Tj, Tc, Tw) and factors (Tfs, th) and signs (+ / -) in 5.3.3 are necessary. Otherwise one gets a rather floating experience:
Patch used to give the glyphs' widths some influence:
|
@srogmann Is there anything missing in this PR or is it ready to be merged? I see that Flake8 complains:
|
@MartinThoma In my opinion the PR is ready to be merged. More exactly: I don't know the current main branch, it was ready to be merged on October 16th. The function visitor_text in line 1740 modifies the variable text_TJ. The function visitor_text is used in the TJ-elif-section to replace the original visitor temporarily. I'm not a python-master. Would nonlocal be a correct way to handle the B023-complaint of Flake8?
|
I tested this against a PDF that was having issues w/ the |
Before the text-visitor-function had been called at each change of the output. But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text. As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation. In this pull request the texts are sent inside the TJ and Tj operations. This may lead to sending letters instead of words: ``` x=264.53, y=403.13, text='M' x=264.53, y=403.13, text='etad' x=264.53, y=403.13, text='ata' x=307.85, y=403.13, text=' ' ``` Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ. The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ. When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python. In case of bad style a local variable current_text_visitor may be introduced. See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR. -- This PR is a copy of #1389 The PR#1389 was made a long time ago (before we renamed to pypdf), but it seems still valuable. This PR migrated the changes to the new codebase. Full credit to rogmann for all of the changes. Co-authored-by: rogmann <github@rogmann.org>
I'm closing this PR now in favor of #2364 (that one resolved the merge conflicts). I'm sorry that it's now over a year and the PR still didn't get merged. |
This pull requests changes the positions of the calls of the text-visitor-function.
Before the text-visitor-function had been called at each change of the output.
But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text.
As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation.
In this pull request the texts are sent inside the TJ and Tj operations.
This may lead to sending letters instead of words:
Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ.
The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ.
When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python.
In case of bad style a local variable current_text_visitor may be introduced.
See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR.