Coords in extract text #1389

srogmann · 2022-10-10T21:36:23Z

This pull requests changes the positions of the calls of the text-visitor-function.

Before the text-visitor-function had been called at each change of the output.
But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text.
As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation.

In this pull request the texts are sent inside the TJ and Tj operations.
This may lead to sending letters instead of words:

    x=264.53, y=403.13, text='etad'
    x=264.53, y=403.13, text='ata'
    x=307.85, y=403.13, text=' '

Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ.
The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ.
When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python.
In case of bad style a local variable current_text_visitor may be introduced.

See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR.

When visiting the text-changes at changes of the collecting output-variable the coordinates may be outdated. This commit visits texts in Tj and TJ. This gets better coordinates but may result in sending letters instead of whole words.

The coordinates of the texts after Td are correct but where wrong when visited. The visit-change in _page.py (Tj and TJ only) fixed this. This commit contains an update of the corresponding tests.

pubpub-zz · 2022-10-11T17:27:49Z

Have you been able to check that :

the final text before the EM is correctly processed
behavior with arabic
behavior with mixed text font (changing to bold / chinese)
behavior with mixed orientation (eg arabic + numbers)

srogmann · 2022-10-11T21:59:43Z

@pubpub-zz Yesterday I executed the tests in tests/test_page.py.
My changes in this PR shouldn't change the non-visitor text-extraction -- at least this is my intention. In the rtl-aware section I introduced a variable tj_text to collect the text produced in that section. Depending on pythons internal string-concatenation this might improve performance a bit (concatenation short instead of long strings).

Looking at your questions I started with behavior in arabic using sample-files/015-arabic/habibi.pdf. But there seems to be a doubled interpretation of the first text -- in the PR-version and the non-PR-version.

I commented out the arabic text:

    /XGBNKK 1 Tf
    [<004b00440045004c0045004c>] TJ
    /DWGDXP 1 Tf
    %[<000303f2039203f40392>-150.75<02f4>351<03a3>]TJ

But I got arabic letters:
extract: يبيبَحhabibi

The codepoint 004b seems to include the letter 'h' and arabic characters when extracting via PyPDF2:

    [<004b00450045004500450045>] TJ
    /DWGDXP 1 Tf
    %[<00000000000000000000>-150.75<02f4>351<03a3>]TJ

gives
extract: يبيبَحhbbbbb
The source is the mapping in the corresponding cmap:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <ffff>
endcodespacerange
4 beginbfchar
<004b> <062d064e0628064a0628064a00200068>
<0044> <0061>
<0045> <0062>
<004c> <0069>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
endendstream

In <004b> the 0068 is the 'h' of 'habibi'. But it is accompanied by arabic letters not shown in viewers like xpdf or gs.

pubpub-zz · 2022-10-11T22:05:05Z

I'm not worried about change non-visitors output, it is just to be sure that your change will properly generate visitor function call in the cases I imagined 😊.

srogmann · 2022-10-11T22:24:11Z

@pubpub-zz
I was looking for examples in the PyPDF2-resources to answer your questions. So I came across the habibi.pdf.
After looking at https://blog.idrsolutions.com/how-are-embedded-cmap-tables-in-pdf-file/ and issue #1111 I see what happened: The sample habibi.pdf uses very unusual ligatures. The display of '004b' is the "ligature" h but its text should be extracted as "يبيبَحh".

<004b> <062d064e0628064a0628064a00200068>
<0044> <0061>
<0045> <0062>
<004c> <0069>
endbfchar

In the arabic text most characters' text should be empty:

<0003> <>
<03f2> <>
<0392> <>
<03f4> <>
<02f4> <>
<03a3> <062d064e0628064a0628064a0020>
endbfchar

Ok, at least I understand what's going on there ...

Do you recommend some PDFs to check the answers to your questions?

pubpub-zz · 2022-10-12T17:44:09Z

here you are a pdf I've built:
test for TextVisitor.pdf

srogmann · 2022-10-12T20:22:41Z

Topic 1:

the final text before the EM is correctly processed

Dump of BDC-sections in a tagged PDF:

    reader = PdfReader(EXTERNAL_ROOT / "test.for.TextVisitor.pdf")
    page_t4tv = reader.pages[0]
    # We store the current marked-content sequence
    curr_BDC = [None]
    map_BDC = {}
    def visitor_BDC(op, args, cm, tm):
        if op == b"BDC":
            # Example of args: ['/P', {'/MCID': 0, '/Lang': b'en-US'}]
            curr_BDC[0] = str(args)
            print("BDC: {0}".format(args))
            if curr_BDC[0] not in map_BDC:
                map_BDC[curr_BDC[0]] = []
        if op == b"EDC":
            curr_BDC[0] = None

    def visitor_BDC_text(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text != "" and curr_BDC[0] is not None:
            map_BDC[curr_BDC[0]].append(text)

    page_t4tv.extract_text(visitor_operand_after=visitor_BDC, visitor_text=visitor_BDC_text)
    print("BDC-summary: {0}".format(map_BDC))

Output (line-breaks added manually):

BDC: ['/P', {'/MCID': 1, '/Lang': b'en-US'}]
BDC: ['/P', {'/MCID': 2, '/Lang': b'en-US'}]
BDC: ['/P', {'/MCID': 3, '/Lang': b'en-US'}]
BDC-summary: {"['/P', {'/MCID': 0, '/Lang': b'en-US'}]": ['this is a test : can you indicate what is text reporting in those cases:', ' ', ' '],
 "['/P', {'/MCID': 1, '/Lang': b'en-US'}]": ['\n', ' '],
 "['/P', {'/MCID': 2, '/Lang': b'en-US'}]": ['\n', 'text chang', ' ', 'ing page so with EM is reported', ' ', ' '],
 "['/P', {'/MCID': 3, '/Lang': b'en-US'}]": ['\n', ' ', ' ']}

The line-breaks mark the positions where text and coordinates had been sent without this PR.

srogmann · 2022-10-14T21:41:48Z

behavior with arabic

There are difficulties indeed! I added an evaluation of rtl_dir in the TJ-text-visitor. So the extraction of the first arabic sentence is fixed.

Text (70.824, 710.74), font /ABCDEE+Calibri: page 2 with some Arabic:
Text (70.824, 685.3), font /ABCDEE+Calibri: extracted from https://github.com/py-pdf/PyPDF2/issues/1296
Text (70.824, 659.86), font /ABCDEE+Calibri: if we say that we have a line that have this text :
Text (70.824, 644.5), font /ArialMT: هذا مثال على المشكل الذي يواجهني
Text (70.824, 619.78), font /ABCDEE+Calibri: (…)

But mixed content gets still scrambled -- if one ignores the x-coordinates:

Text (76.824, 507.31), font /ABCDEE+CourierNewPSMT: :مرﻗ ة راﻟمحﺎﻀ1: اﻟﻌوﻟمﺔglobalization :

Code used to extract the texts the samples in this comment (ignoring x-coordinates):

    reader = PdfReader(EXTERNAL_ROOT / "test.for.TextVisitor.pdf")
    page_arabic = reader.pages[1]
    texts_ar = []
    def print_arabic(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "" and tm_matrix[5] > 450:
            font_name = font_dict["/BaseFont"]
            (ax, ay) = (tm_matrix[4], tm_matrix[5])
            texts_ar.append((ax, ay, font_name, text))
    page_arabic.extract_text(visitor_text=print_arabic)
    cur_ar_x = None
    cur_ar_y = None
    cur_ar_fnt = None
    cur_head_len = 0
    cur_ar_text = ""
    for t in texts_ar:
        rep_text = t[3].replace("\ufffd", "\ufffd (replacement character)")
        if cur_ar_y != t[1] or cur_ar_fnt != t[2]:
            if cur_ar_text != "":
                print(cur_ar_text)
            cur_ar_x = t[0]
            cur_ar_y = t[1]
            cur_ar_fnt = t[2]
            cur_ar_text = "Text ({0}, {1}), font {2}: ".format(t[0], t[1], t[2])
            cur_head_len = len(cur_ar_text)
        cur_ar_text += rep_text
    print(cur_ar_text)

srogmann · 2022-10-14T23:21:30Z

"bidi" is not simple. I had a look at https://www.unicode.org/reports/tr9/ and https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt.

The following visitor-sample uses a regular expression to detect RTL-ranges (like the expressions used in _page.py). To use custom ranges one may use a changed regular expression.

Result: Digit 1 in the globalization-line is at the left (as in the PDF).

text (70.824, 685.3): extracted from https://github.com/py-pdf/PyPDF2/issues/1296
text (70.824, 659.86): if we say that we have a line that have this text :
text (70.824, 644.5): هذا مثال على المشكل الذي يواجهني
text (70.824, 619.78): (…)
text (70.824, 594.34): for example if i have as pdf text
text (70.824, 578.98): 21محمد
text (70.824, 529.99): Ang-L1+sociology-globalisation: 
text (76.824, 507.31): :1اﻟﻌوﻟمﺔ: مرﻗ ة راﻟمحﺎﻀglobalization : 
text (76.824, 482.95): ﻟﻠﻌوﻟمﺔ ﺔ�خ�راﻟتﺎﺔ�فﺎﻟخﻠ:

Code of used visitor: The texts are mapped to y and x.

    page_arabic = reader.pages[1]
    map_ar_lines = {}
    def print_arabic(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "" and tm_matrix[5] > 450:
            font_name = font_dict["/BaseFont"]
            (ax, ay) = (tm_matrix[4], tm_matrix[5])
            if ay not in map_ar_lines:
                map_ar_lines[ay] = {}
            cur_line = map_ar_lines[ay]
            if ax not in cur_line:
                cur_line[ax] = []
            cur_point = cur_line[ax]
            cur_point.append(text)
    page_arabic.extract_text(visitor_text=print_arabic)
    reg_ar_must = re.compile(r"[\u0590-\u08ff\ufb1d-\ufdff\ufe70-\ufeff]*")
    reg_ar_may = re.compile(r"[ -/:-@\u0590-\u08ff\ufb1d-\ufdff\ufe70-\ufeff\ufffd]*")
    for ay in reversed(sorted(map_ar_lines)):
        text = ""
        list_ar = []
        off_last_non_ar = -1
        is_rtl = False
        for ax in sorted(map_ar_lines[ay]):
            if text == "":
                text = "text ({0}, {1}): ".format(ax, ay)
            for cell in map_ar_lines[ay][ax]:
                if reg_ar_may.fullmatch(cell):
                    is_rtl = True
                    list_ar.insert(off_last_non_ar + 1, cell)
                elif is_rtl and reg_ar_may.fullmatch(cell):
                    list_ar.insert(off_last_non_ar + 1, cell)
                else:
                    is_rtl = False
                    list_ar.append(cell)
                    off_last_non_ar = len(list_ar) - 1
        text += "".join(list_ar)
        print(text)

Output of extract_text() (2.11.0, visitor not used): The digit 1 is not on the left side.

Ang-L1+sociology -globalisation:  
   
 : مرﻗ ة راﻟمحﺎﻀ1               : اﻟﻌوﻟمﺔglobalization :  
  
 : ﻟﻠﻌوﻟمﺔ ﺔ�خ�راﻟتﺎ ﺔ�فﺎﻟخﻠ

srogmann · 2022-10-15T10:22:55Z

Some thoughts: The visitor-sample which sorts all text-evens regarding y- and x-coordinate can takle complicated documents which send the text-fragments not line-by-line or randomly inside a line.
This sample may be of help in cases like those discussed in #1395 ("layout preserving text extraction").

In most cases the text-extraction via page.extract_text() is very efficient and gives the proper result.

At the current state -- I haven't answered all the questions of @pubpub-zz yet -- this PR looks good. The second question was helpful to improve RTL-handling.

The previous sample uses a regular expression to determine rtl-status. But that status is computed in _page.py already. So another solution might be to add flags "is_rtl" and "is_ltr" in the visitor-function's arguments to determine if we have "treasure", "يبيبَح" or neutral " : ". But then some kind of status-object in the visitor-arguments would be better than a lot of further arguments (@MartinThoma we talked about that) -- to preserve compatibility.

srogmann · 2022-10-15T12:05:06Z

I tested the map-y-x-sample above with #1395.

The TJ-operations extend over several columns. In this case it would be fine if one could disable the overriding the visitor in TJ handling. But before it would be necessary that the Tj-implementation updates the text matrix. Have a look at 5.3.3 "Text Space Details". It describes the update of the text matrix when writing glyphs. @pubpub-zz This would be very useful!
Because then we would be able to process issue-914-xmp-data.pdf correctly :-).

Sample output:

     2022 Intelligent Money British GT Championship
     TEST SESSION 1 - SECTOR ANALYSIS
       SECTOR 1 = FL to I1,    SECTOR 2 = I1 to I2,    SECTOR 3 = I2 to FL,    DIFF = Difference To Personal Best Lap,   P = Crossed Finish Line in Pit Lane,   D = Time Disallowed
        P1       77 Enduro MotorsportGT3PA                                              McLaren 720S GT3
        IDEAL LAP TIME :  1:26.866 BEST LAP TIME :  1:26.942 DIFFERENCE :                       0.076
        D1: Morgan TILLBROOK D2: Marcus CLUTTON
        LAP             SECTOR 1 SECTOR 2 SECTOR 3                                              LAP TIME DIFF TIME OF DAYMPH
         1 - D1    OUTLAP 116.7 37.750 139.2 35.381 101.0                                                                                11:03:12.126
         2 - D1      19.626 144.6 34.707 140.9 34.380 100.4 100.93                               1:28.713 1.771 11:04:40.839
         3 - D1      19.636 134.4 34.731 140.9 34.156 101.8                                      1:28.523 1.581 11:06:09.362101.15
         4 - D1      19.362 145.8 34.269 140.9 33.930 101.2 102.26                               1:27.561 0.619 11:07:36.923
         5 - D1      19.200 146.5 34.323 140.9 33.992 100.7 102.31                               1:27.515 0.573 11:09:04.438
         6 - D1      19.302 146.2 34.527 141.8 IN PIT                                            1:29.908 2.966 11:10:34.34699.59
                                                                                                           P
         7 - D1    OUTLAP 125.9 35.711 140.6 34.663 101.3 28.88                                  5:09.987 3:43.045 11:15:44.333
[...]

Events send to the visitor (line 7):

(35.0, 597.0): 7 - D1
(70.0, 597.0): OUTLAP 125.9 35.711 140.6 34.663 101.3 28.88
(360.0, 597.0): 5:09.987 3:43.045 11:15:44.333

Source of the visitor trying to retain the layout:

    # Test 5 (sample text-layout)
    reader = PdfReader(EXTERNAL_ROOT / "missing_newlines.pdf")
    n_cols = 160
    f_x = n_cols / 595.0
    page_tl = reader.pages[6]
    map_tl_lines = {}
    def print_tl(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "":
            (ax, ay) = (round(tm_matrix[4], 0), round(tm_matrix[5], 0))
            if ay not in map_tl_lines:
                map_tl_lines[ay] = {}
            cur_line = map_tl_lines[ay]
            if ax not in cur_line:
                cur_line[ax] = []
            cur_point = cur_line[ax]
            cur_point.append(text)
    page_tl.extract_text(visitor_text=print_tl)

    for ay in reversed(sorted(map_tl_lines)):
        text = ""
        for ax in sorted(map_tl_lines[ay]):
            col_x = round(f_x * ax, 0)
            if len(text) < col_x:
                text = ("{:<" + str(int(col_x)) + "}").format(text)
            for cell in map_tl_lines[ay][ax]:
                # print("({0}, {1}): {2}".format(ax, ay, cell))
                text += cell
        print(text)

By the way: I checked the map at page 2. The layout can be seen (you may have to rotate by 90°).

                                                                            
                                                                                                                           Length 2.4873 miles 4003.0 m  FL  52.82971 N 1.37867 W I1 941m 52.83226 N 1.37893 W I2 2641m 52.82866 N 1.37129 W Pit Entry 3966m 52.82951 N 1.37832 W Pit Exit 229m after FL 52.83002 N 1.38218 W Pit Entry–Pit Exit 256m, 18.4s @50kph, 15.3s @60kph 
                                                                           Coppice 
                                                      
                                                      
                                             McLeans 
                                                   
                                                   
                                                            Schwantz Curve 
                                                           
                                                           
                                              Starkeys Bridge                           
                                                                                        
                                                                                            The Esses 
                                                                                
                                                                                             
                                                                                
                                                                                             
                                              
                                                                           Goddards 
                                              
                                      Old Hairpin 
                                             
                                             
                                                     Craner Curves 
                                     www.tsl-timing.com
                                                                                                 
                                                                                                 
                                                        
                                                        
                                                                                            Melbourne Hairpin 
                                                                   Redgate 
                                                                         
             Donington Park GP 
                                                                         
                                                Hollywood 
                                     All results available at

This flags enables one to get a visitor_text-call at each text-operand of a TJ operation. Default is group_TJ = False, one visitor_text-call only at a TJ-operation.

srogmann · 2022-10-16T20:32:41Z

I added an optional flag group_TJ. This enables one to choose between one visitor_text-event at TJ and a visitor_text-event for each Tj called by the TJ-implementation.
This gives the possibility to examine the text-fragments of a TJ-execution.

group_TJ = True: You can see why I would like an update of the text_matrix at each glyph. This glyph-update might be optional to preserve performance in linear documents.

(35.0, 597.0): 7 - D1
(70.0, 597.0): OUTLAP
(70.0, 597.0): 125.9
(70.0, 597.0): 35.711
(70.0, 597.0): 140.6
(70.0, 597.0): 34.663
(70.0, 597.0): 101.3
(70.0, 597.0): 28.88
(360.0, 597.0): 5:09.987
(360.0, 597.0): 3:43.045
(360.0, 597.0): 11:15:44.333

group_TJ = False (default):

(35.0, 597.0): 7 - D1
(70.0, 597.0): OUTLAP 125.9 35.711 140.6 34.663 101.3 28.88
(360.0, 597.0): 5:09.987 3:43.045 11:15:44.333

srogmann · 2022-10-16T21:07:07Z

Third question:

behavior with mixed text font (changing to bold / chinese)

Below is a sample which prints the font-names and explains some chinese words.

text (70.824, 444.31): [/ABCDEE+Calibri]An other file to be tested:
text (70.824, 418.87): https://github.com/py-pdf/PyPDF2/files/9454967/pdf_test.pdf
text (70.824, 367.97): And not changingfonts:
text (70.824, 340.85): Text in Calibri;[/ABCDEE+ComicSansMS]Now in Comic Sans Ms;[/ArialMT]And Arial to finish
text (70.824, 290.33): Test with some Chinese(extracted from page n°8of [/ABCDEE+Calibri]https://github.com/py-
text (70.824, 274.85): pdf/PyPDF2/files/9150656/ST.2019.PDF[/ArialMT])
text (70.824, 246.65): [/ABCDEE+Calibri]1.[/ABCDEE+MicrosoftYaHei]公司(公司=company)生产经营主体主要有控股子公司河南辅仁堂制药有限公司、全资子公司开封制药（集团）有限公
text (70.824, 228.65): 司，全资孙公司(公司=company)主要有河南同源制药有限公司、河南辅仁怀庆堂制药有限公司、开封豫港制药有限公
text (70.824, 210.65): 司、辅仁药业集团医药有限公司(公司=company)、郑州豫港制药有限公司、郑州远策生物制药有限公司、开药集团（
text (70.824, 192.62): 开鲁）制药有限公司(公司=company)、北京(北京=beijing)辅仁瑞辉生物医药研究院有限公司等。
text (70.824, 174.62): 公司(公司=company)主要产品为化学药、中成药、原料药、生物制药的研发、生产和销售。公司拥有药品批准文号[/ABCDEE+Calibri]547
text (70.824, 156.62): [/ABCDEE+MicrosoftYaHei]个，其中入选《医保目录（[/ABCDEE+Calibri]2019[/ABCDEE+MicrosoftYaHei]年版）》的品种[/ABCDEE+Calibri]313[/ABCDEE+MicrosoftYaHei]个，进入国家基本药物目录的品种[/ABCDEE+Calibri]150[/ABCDEE+MicrosoftYaHei]个，[/ABCDEE+Calibri]100[/ABCDEE+MicrosoftYaHei]个
text (70.824, 138.5): 药品品种进入地方医保目录。公司(公司=company)共拥有专利[/ABCDEE+Calibri]45[/ABCDEE+MicrosoftYaHei]项，其中发明专利[/ABCDEE+Calibri]22[/ABCDEE+MicrosoftYaHei]项[/ABCDEE+Calibri],[/ABCDEE+MicrosoftYaHei]实用新型专利[/ABCDEE+Calibri]23[/ABCDEE+MicrosoftYaHei]项。主要产品
text (70.824, 120.5): 覆盖包括粉针剂、片剂、原料药、水针剂、口服液、胶剂、胶囊剂、颗粒剂、中间体等多种剂型的化
text (70.824, 102.5): 学药、中成药、原料药和生物制药。产品质量符合中国药典标准，部分产品符合欧盟等国家和地区药
text (70.824, 84.504): 物进口标准并出口欧洲多个国家。

Source code of the implementation:

    reader = PdfReader(EXTERNAL_ROOT / "test.for.TextVisitor.pdf")
    page_chinese = reader.pages[1]
    map_ar_lines = {}
    state = {'font':None}
    cur_font = None
    mini_dict = {'公司':'公司(公司=company)', '北京':'北京(北京=beijing)'}
    def print_chinese(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "" and tm_matrix[5] < 450:
            font_name = font_dict["/BaseFont"]
            (ax, ay) = (tm_matrix[4], tm_matrix[5])
            if ay not in map_ar_lines:
                map_ar_lines[ay] = {}
            cur_line = map_ar_lines[ay]
            if ax not in cur_line:
                cur_line[ax] = []
            cur_point = cur_line[ax]
            if font_name != state['font']:
               cur_point.append("[" + font_name + "]")
               state['font'] = font_name
            text_rep = text
            for k, v in mini_dict.items():
                text_rep = text_rep.replace(k, v, 1)
            cur_point.append(text_rep)
    page_chinese.extract_text(visitor_text=print_chinese)
    cur_font = None
    for ay in reversed(sorted(map_ar_lines)):
        text = ""
        list_ar = []
        for ax in sorted(map_ar_lines[ay]):
            if text == "":
                text = "text ({0}, {1}): ".format(ax, ay)
            for cell in map_ar_lines[ay][ax]:
                list_ar.append(cell)
        text += "".join(list_ar)
        print(text)

The fourth question:

behavior with mixed orientation (eg arabic + numbers)

This one was answered already above in the arabic sample. A mix of arabic and numbers is possible but one has to be careful.

srogmann · 2022-10-16T21:14:25Z

@pubpub-zz Thanks for your questions and the sample! They helped me to improve this PR.

srogmann · 2022-10-17T21:34:29Z

Excursus: I played a bit with the glyph widths --- but all of those offsets (Tj, Tc, Tw) and factors (Tfs, th) and signs (+ / -) in 5.3.3 are necessary.

Otherwise one gets a rather floating experience:

     2022 Intelligent Money             British GT Championship
     TEST SESSION 1 - SECTOR ANALYSIS
       SECTOR 1 = FL to I1, SECTOR 2 = I1to I2, SECTOR 3 = I2to FL,DIFF = Difference To Personal Best Lap,   P = Crossed F                                                                                            inish Line in Pit Lane,D = Time Disallowed
      Enduro Motorsport77P1GT3PA                                                        McLaren 720S GT     3
            IDEAL LAP TIME :  1:26.866BEST LAP TIME :  1:26.942DIFFERENCE :                                                                              0.076
        D1:Morgan TILLBD2RM:OOKarcus CLUTTON
        LAP            SECTOSECTOR 3SECTOR 2R 1                                         DIFFTIME OF DAYLAP TIME  MPH
         1 - D1                    35.38137.750101.0OUTLAP139.2116.7                                                                           11:03:12.126
                               100.932 - D1   34.38034.707100.419.626140.9144.6                                 1.77111:04:40.8391:28.713
                          3 - D1        34.156101.834.73119.636140.9134.4                                1.58111:06:09.3621:28.523        101.15
                                        102.264 - D1   33.93034.269101.219.362140.9145.8                                 0.61911:07:36.9231:27.561
                                 102.315 - D1   33.992100.734.32319.200140.9146.5                                 0.57311:09:04.4381:27.515
                             6 - D1       IN PIT34.52719.302141.8146.2                                     2.96611:10:34.3461:29.908         99.59
                                                                                                                                    P
         7 - D128.88         34.663OUTLAP35.711101.3140.6125.9                             3:43.04511:15:44.3335:09.987
      100.808 - D1    34.14435.110101.919.579140.3118.9                                1.89111:17:13.1661:28.833
102.549 - D1   33.97734.132101.619.215141.2145.8                                0.38211:18:40.4901:27.324
10 - D1102.5533.93034.215101.319.167141.2145.8                             0.37011:20:07.8021:27.312(3)
11 - D1     19.194           147.1             IN PIT34.112141.5        1.96211:21:36.7061:28.904P                    100.72
        12 - D144.70          41.436OUTLAP34.842102.7141.8142.4                              1:53.33711:24:56.9853:20.279
        13 - D1                19.093146.8                      33.945142.1       102.56  0.36111:26:24.28834.265101.61:27.303(2)
                        14 - D1   33.99019.116145.8                       33.836142.1            11:27:51.230101.5     (1)1:26.942         102.99
                                          15 - D1100.23         19.085     IN PIT34.645141.5146.2                           2.39211:29:20.5641:29.334P
        16 - D121.97          35.60099.5OUTLAP36.242139.5133.9                               5:20.50811:36:08.0146:47.450

Patch used to give the glyphs' widths some influence:

diff --git a/PyPDF2/_page.py b/PyPDF2/_page.py
index 3a581df..b05f90d 100644
--- a/PyPDF2/_page.py
+++ b/PyPDF2/_page.py
@@ -1355,6 +1355,8 @@ class PageObject(DictionaryObject):
             0.0,
             0.0,
         ]  # will store cm_matrix * tm_matrix
+        char_space = 0.0
+        word_space = 0.0
         char_scale = 1.0
         space_scale = 1.0
         _space_width: float = 500.0  # will be set correctly at first Tf
@@ -1386,7 +1388,7 @@ class PageObject(DictionaryObject):
             return _space_width / 1000.0
 
         def process_operation(operator: bytes, operands: List) -> None:
-            nonlocal cm_matrix, cm_stack, tm_matrix, tm_prev, output, text, char_scale, space_scale, _space_width, TL, font_size, cmap, orientations, rtl_dir, visitor_text
+            nonlocal cm_matrix, cm_stack, tm_matrix, tm_prev, output, text, char_space, word_space, char_scale, space_scale, _space_width, TL, font_size, cmap, orientations, rtl_dir, visitor_text
             global CUSTOM_RTL_MIN, CUSTOM_RTL_MAX, CUSTOM_RTL_SPECIAL_CHARS
 
             check_crlf_space: bool = False
@@ -1449,7 +1451,10 @@ class PageObject(DictionaryObject):
             # Table 5.2 page 398
             elif operator == b"Tz":
                 char_scale = float(operands[0]) / 100.0
+            elif operator == b"Tc":
+                char_space = float(operands[0])
             elif operator == b"Tw":
+                word_space = float(operands[0])
                 space_scale = 1.0 + float(operands[0])
             elif operator == b"TL":
                 TL = float(operands[0])
@@ -1583,6 +1588,49 @@ class PageObject(DictionaryObject):
                             visitor_text(
                                 tj_text, cm_matrix, tm_matrix, cmap[3], font_size
                             )
+                            if "/Widths" in cmap[3]:
+                                first_char = 0
+                                if "/FirstChar" in cmap[3]:
+                                    first_char = cmap[3]["/FirstChar"]
+                                widths = cmap[3]["/Widths"]
+                                sum_widths = 0
+                                for x in t:
+                                    sum_widths += widths[ord(x) - first_char]
+                                tx = sum_widths / 1000.0 * font_size + char_space + word_space # TODO correct factor
+                                ty = 0
+                                tm_matrix[4] += tx * tm_matrix[0] + ty * tm_matrix[2]
+                                tm_matrix[5] += tx * tm_matrix[1] + ty * tm_matrix[3]
+                            elif "/DescendantFonts" in cmap[3]:
+                                desc_font = cmap[3]["/DescendantFonts"][0].get_object()
+                                widths = desc_font["/W"]
+                                width = 0
+                                if "/DW" in desc_font:
+                                    width = desc_font["/DW"]
+                                map_widths = {} # move me to _cmap.py
+                                i = 0
+                                while i < len(widths):
+                                    cid_first = widths[i]
+                                    if isinstance(widths[i + 1], ArrayObject):
+                                       for j in range(len(widths[i + 1])):
+                                           map_widths[cid_first + j] = widths[i + 1][j]
+                                       i += 2
+                                    else:
+                                       cid_last = widths[i + 1]
+                                       w = widths[i + 2]
+                                       for j in range(cid_last - cid_first + 1):
+                                           map_widths[cid_first + j] = w
+                                       i += 3
+                                sum_widths = 0
+                                for x in t:
+                                    if ord(x) in map_widths:
+                                        sum_widths += map_widths[ord(x)]
+                                    else:
+                                        sum_widths += width
+                                tx = sum_widths / 1000.0 * font_size + char_space + word_space  # TODO correct factor
+                                ty = 0
+
+                                tm_matrix[4] += tx * tm_matrix[0] + ty * tm_matrix[2]
+                                tm_matrix[5] += tx * tm_matrix[1] + ty * tm_matrix[3]
             else:
                 return None
             if check_crlf_space:
@@ -1749,6 +1797,10 @@ class PageObject(DictionaryObject):
                     if isinstance(op, (str, bytes)):
                         process_operation(b"Tj", [op])
                     if isinstance(op, (int, float, NumberObject, FloatObject)):
+                        tx = float(op) / 1000.0 * font_size
+                        ty = 0 # TODO direction
+                        tm_matrix[4] += tx * tm_matrix[0] + ty * tm_matrix[2]
+                        tm_matrix[5] += tx * tm_matrix[1] + ty * tm_matrix[3]
                         if (
                             (abs(float(op)) >= _space_width)
                             and (len(text) > 0)

MartinThoma · 2022-11-20T08:36:05Z

@srogmann Is there anything missing in this PR or is it ready to be merged?

I see that Flake8 complains:

./PyPDF2/_page.py:1746:29: B023 Function definition does not bind loop variable 'text_TJ'.
./PyPDF2/_page.py:1748:29: B023 Function definition does not bind loop variable 'text_TJ'.

srogmann · 2022-11-20T19:40:43Z

@MartinThoma In my opinion the PR is ready to be merged. More exactly: I don't know the current main branch, it was ready to be merged on October 16th.

The function visitor_text in line 1740 modifies the variable text_TJ. The function visitor_text is used in the TJ-elif-section to replace the original visitor temporarily. I'm not a python-master. Would nonlocal be a correct way to handle the B023-complaint of Flake8?

--- a/PyPDF2/_page.py
+++ b/PyPDF2/_page.py
@@ -1739,6 +1739,7 @@ class PageObject(DictionaryObject):
 
                     def visitor_text(text, cm_matrix, tm_matrix, font_dict, font_size):
                         # TODO cases where the current inserting order is kept
+                        nonlocal text_TJ
                         if rtl_dir:
                             # right-to-left
                             text_TJ.insert(0, text)

PyPDF2/_page.py

torial · 2023-03-28T01:00:00Z

I tested this against a PDF that was having issues w/ the visitor_text results and the number of parsing issues greatly improved.

Before the text-visitor-function had been called at each change of the output. But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text. As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation. In this pull request the texts are sent inside the TJ and Tj operations. This may lead to sending letters instead of words: ``` x=264.53, y=403.13, text='M' x=264.53, y=403.13, text='etad' x=264.53, y=403.13, text='ata' x=307.85, y=403.13, text=' ' ``` Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ. The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ. When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python. In case of bad style a local variable current_text_visitor may be introduced. See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR. -- This PR is a copy of #1389 The PR#1389 was made a long time ago (before we renamed to pypdf), but it seems still valuable. This PR migrated the changes to the new codebase. Full credit to rogmann for all of the changes. Co-authored-by: rogmann <github@rogmann.org>

MartinThoma · 2023-12-24T10:04:38Z

I'm closing this PR now in favor of #2364 (that one resolved the merge conflicts).

I'm sorry that it's now over a year and the PR still didn't get merged.

srogmann added 4 commits October 10, 2022 22:34

TST: Adding test of extract_table of Sample_Td-matrix.pdf.

1494812

The coordinates of the texts after Td are correct but where wrong when visited. The visit-change in _page.py (Tj and TJ only) fixed this. This commit contains an update of the corresponding tests.

ENH: Send TJ-operands in one block to text-visitor.

36f158d

MAINT: Executed black.

40b0f98

BUG: visitor_text in TJ in case of right-to-left characters (non mixed!)

d79f150

ENH: Added optional argument group_TJ in extract_text(...).

ffce52a

This flags enables one to get a visitor_text-call at each text-operand of a TJ operation. Default is group_TJ = False, one visitor_text-call only at a TJ-operation.

MartinThoma added the is-feature A feature request label Dec 12, 2022

Merge branch 'main' into coords_in_extract_text

3620963

MartinThoma reviewed Dec 22, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

Update PyPDF2/_page.py

ad101a0

MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Aug 14, 2023

srogmann mentioned this pull request Sep 23, 2023

BUG: invalid cm/tm in visitor functions #2206

Merged

MartinThoma mentioned this pull request Dec 24, 2023

MAINT: Change the positions of the calls of the visitor-function #2364

Open

MartinThoma closed this Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coords in extract text #1389

Coords in extract text #1389

srogmann commented Oct 10, 2022

pubpub-zz commented Oct 11, 2022

srogmann commented Oct 11, 2022

pubpub-zz commented Oct 11, 2022

srogmann commented Oct 11, 2022

pubpub-zz commented Oct 12, 2022

srogmann commented Oct 12, 2022

srogmann commented Oct 14, 2022

srogmann commented Oct 14, 2022

srogmann commented Oct 15, 2022

srogmann commented Oct 15, 2022

srogmann commented Oct 16, 2022

srogmann commented Oct 16, 2022

srogmann commented Oct 16, 2022

srogmann commented Oct 17, 2022

MartinThoma commented Nov 20, 2022

srogmann commented Nov 20, 2022

torial commented Mar 28, 2023

MartinThoma commented Dec 24, 2023

Coords in extract text #1389

Coords in extract text #1389

Conversation

srogmann commented Oct 10, 2022

pubpub-zz commented Oct 11, 2022

srogmann commented Oct 11, 2022

pubpub-zz commented Oct 11, 2022

srogmann commented Oct 11, 2022

pubpub-zz commented Oct 12, 2022

srogmann commented Oct 12, 2022

srogmann commented Oct 14, 2022

srogmann commented Oct 14, 2022

srogmann commented Oct 15, 2022

srogmann commented Oct 15, 2022

srogmann commented Oct 16, 2022

srogmann commented Oct 16, 2022

srogmann commented Oct 16, 2022

srogmann commented Oct 17, 2022

MartinThoma commented Nov 20, 2022

srogmann commented Nov 20, 2022

torial commented Mar 28, 2023

MartinThoma commented Dec 24, 2023