Text and position extraction in a single call #325

yui-mhcp · 2024-07-30T14:40:13Z

yui-mhcp
Jul 30, 2024

Hello ! 😋

First of all, thank you for your nice work, it is a really powerful library compared to other pdf processing tools ! (I have tested pypdf and pdfminer.six)

I am wondering whether it is possible to access both text and positions in a simple call. My objective is to extract the text with its position in order to post-process the result. The expected output format should be something like [{'text' : text1, 'box' : box1}, ...]

In order to get this result, I have combined the get_object and the get_text_bounded functions, in order to extract the text at each object position (the code is available below). However, when comparing with a single call to get_text_bounded, that extracts all the text without any information, the time performance is largely decreased, as shown in the below benchmark :

Timers for logger timer :
- fast_extraction : 77 ms
Timers for logger timer :
- parse_pypdfium2 : 445 ms
- pdf processing : 526 μs
- page processing executed 46 times : 444 ms (9.664 ms / exec)
  - object extraction executed 4907 times : 297 ms (60 μs / exec)

Both extraction methods have been applied on the same document containing 46 pages, with my loggers module to track performances. It is available in this repository if you want to reproduce the below code.

The major bottleneck is the multiple (4907) calls to get_text_bounded to extract each text individually.

import pypdfium2
import pypdfium2.raw as pypdfium_c

from loggers import timer, time_logger, set_level

@timer
def fast_extraction():
    file = pypdfium2.PdfDocument(filename)
    
    return {
        idx : file[idx].get_textpage().get_text_bounded() for idx in range(len(file))
    }

@timer
def parse_pypdfium2(filename, image_folder = None, pagenos = None, ** kwargs):
    with time_logger.timer('pdf processing'):
        pdf = pypdfium2.PdfDocument(filename)
    
    if pagenos is None: pagenos = range(len(pdf))
    
    filters = (pypdfium_c.FPDF_PAGEOBJ_TEXT, ) if not image_folder else ()
    
    document = {}
    for page_index in pagenos:
        with time_logger.timer('page processing'):
            page = pdf.get_page(page_index)
            text = page.get_textpage()

            img_num = 0
            paragraphs = []
            for obj in page.get_objects(filters):
                with time_logger.timer('object extraction'):
                    page_w, page_h = int(page.get_width()), int(page.get_height())
                    box = obj.get_pos()
                    scaled_box = [int(c) for c in box]
                    scaled_box[1], scaled_box[3] = page_h - scaled_box[3], page_h - scaled_box[1]
                    if obj.type == pypdfium_c.FPDF_PAGEOBJ_TEXT:
                        paragraphs.append({
                            # Feature request : replace this by `obj.get_text()`
                            'text': text.get_text_bounded(* box),
                            'box' : scaled_box,
                            'page_w'    : page_w,
                            'page_h'    : page_h
                        })
                    elif obj.type == pypdfium_c.FPDF_PAGEOBJ_IMAGE and image_folder:
                        pass
        
        document[page_index] = paragraphs  
    
    return document

set_level('time')

res1 = fast_extraction(filename)
res2 = parse_pydfium2(filename)

Would you have a suggestion to solve this optimisation bottleneck ? Is there any raw API that may achieve this feature in a single call ?
Otherwise, would it be possible to modify the PdfObject returned by get_objects to return a PdfText-like object instead (in case of text) ? Similarly to the PdfImage that is returned in case of image, and proposes the extract feature, the PdfText may expose a get_text method ?

Thank you in advance,

Yui 😋

mara004 · 2024-07-30T15:55:18Z

mara004
Jul 30, 2024
Maintainer

Thanks for the elaborate report! I only gave this a cursory glance for now, but here are my first thoughts anyway:

Otherwise, would it be possible to modify the PdfObject returned by get_objects to return a PdfText-like object instead (in case of text) ? Similarly to the PdfImage that is returned in case of image, and proposes the extract feature, the PdfText may expose a get_text method ?

There is FPDFTextObj_GetText(), which could be used to implement the PdfTextObj.extract() you are proposing; however, I'm not sure if that would truly be more performant than calling get_text_bounded() if you need the position anyways.

That said, I spotted in the above code that you are calling PdfPage.get_{width,height}() in the objects loop, which adds quite some unnecessary FFI calls. It should be done only once on page level. (None of pypydfium2's getters are cached, since they should adapt to modifications.)

FWIW, I'm not sure if the text objects provide a decent grouping, or how this compares to the PdfTextPage.{count_rect,get_rect}() APIs?
Also, are you aware of VikParuchuri's pdftext? It uses pypdfium2 to get the char info, and then does layout analysis to group the chars in blocks, lines and spans. This might give better results than pdfium's rectangles.

0 replies

yui-mhcp · 2024-07-30T17:17:22Z

yui-mhcp
Jul 30, 2024
Author

Thank you for your fast reply !

That said, I spotted in the above code is that you are calling PdfPage.get_{width,height}() in the objects loop, which adds quite some unnecessary FFI calls. It should be done only once on page level. (None of pypydfium2's getters are cached, since they should adapt to modifications.)

Indeed I moved this line in the above loop to execute it once.

Also, are you aware of VikParuchuri's pdftext? It uses pypdfium2 to get the char info, and then does layout analysis to group the chars in blocks, lines and spans. This might give better results than pdfium's rectangles.

I was not aware of this library. Based on a simple test, it seems that performances are really slow compared to my code (above 1sec for the same document), being more similar to pdfminer.six performances

FWIW, I'm not sure if the text objects provide a decent grouping, or how this compares to the PdfTextPage.{count_rect,get_rect}() APIs?

I have not tested this API yet, I will try it in the next days, but the grouping order is quite accurate based on the few documents I tried as it correctly returns lines with their position. I have a post-processing code that handles the grouping in paragraphs / blocks (not applied on the provided code above). This is the reason why I need both text and box position

Yui 😋

1 reply

mara004 Jul 31, 2024
Maintainer

I was not aware of this library. Based on a simple test, it seems that performances are really slow compared to my code (above 1sec for the same document), being more similar to pdfminer.six performances

Ah, yes, supposedly this will be due to pdftext operating on char level, which means more pdfium calls, and more grouping to do.

the grouping order is quite accurate based on the few documents I tried as it correctly returns lines with their position.

If the pageobjects provide the right info for you, great!
The thing is, pdfium is a large library, and I don't have emebdder experience with all the APIs myself, so I wasn't sure how the different text layout strategies compare.

I have a post-processing code that handles the grouping in paragraphs / blocks (not applied on the provided code above)

Nice. Out of interest, will that become open-source?

yui-mhcp · 2024-08-01T15:18:21Z

yui-mhcp
Aug 1, 2024
Author

Ah, yes, supposedly this will be due to pdftext operating on char level, which means more pdfium calls, and more grouping to do.

Yes, they also seem to use a DecisionTreeClassifier somewhere, but I do not know why/where.

Nice. Out of interest, will that become open-source?

I just made an update of all my projects ! You can check this directory that contains the processing methods :

The combination file comes from my OCR code to combine bounding boxes
The post_processing file contains utilities to detect bi-columns documents, combine words in lines, then lines in paragraphs
The pypdfium2_parser basically contains the above code

Actually the output of the code is really similar to the output of page.get_text_page().get_text_bounded(), except that paragraphs are properly grouped together, instead of having all the text in one block. Nonetheless, this code is still experimental, and has been tested on a small set of documents

Yui 😋

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text and position extraction in a single call #325

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Text and position extraction in a single call #325

yui-mhcp Jul 30, 2024

Replies: 3 comments · 1 reply

mara004 Jul 30, 2024 Maintainer

yui-mhcp Jul 30, 2024 Author

mara004 Jul 31, 2024 Maintainer

yui-mhcp Aug 1, 2024 Author

yui-mhcp
Jul 30, 2024

Replies: 3 comments 1 reply

mara004
Jul 30, 2024
Maintainer

yui-mhcp
Jul 30, 2024
Author

mara004 Jul 31, 2024
Maintainer

yui-mhcp
Aug 1, 2024
Author