Replies: 3 comments 1 reply
-
Thanks for the elaborate report! I only gave this a cursory glance for now, but here are my first thoughts anyway:
There is That said, I spotted in the above code that you are calling FWIW, I'm not sure if the text objects provide a decent grouping, or how this compares to the |
Beta Was this translation helpful? Give feedback.
-
Thank you for your fast reply !
Indeed I moved this line in the above loop to execute it once.
I was not aware of this library. Based on a simple test, it seems that performances are really slow compared to my code (above 1sec for the same document), being more similar to pdfminer.six performances
I have not tested this API yet, I will try it in the next days, but the grouping order is quite accurate based on the few documents I tried as it correctly returns lines with their position. I have a post-processing code that handles the grouping in paragraphs / blocks (not applied on the provided code above). This is the reason why I need both text and box position Yui 😋 |
Beta Was this translation helpful? Give feedback.
-
Yes, they also seem to use a
I just made an update of all my projects ! You can check this directory that contains the processing methods :
Actually the output of the code is really similar to the output of Yui 😋 |
Beta Was this translation helpful? Give feedback.
-
Hello ! 😋
First of all, thank you for your nice work, it is a really powerful library compared to other pdf processing tools ! (I have tested
pypdf
andpdfminer.six
)I am wondering whether it is possible to access both text and positions in a simple call. My objective is to extract the text with its position in order to post-process the result. The expected output format should be something like
[{'text' : text1, 'box' : box1}, ...]
In order to get this result, I have combined the
get_object
and theget_text_bounded
functions, in order to extract the text at each object position (the code is available below). However, when comparing with a single call toget_text_bounded
, that extracts all the text without any information, the time performance is largely decreased, as shown in the below benchmark :Both extraction methods have been applied on the same document containing 46 pages, with my
loggers
module to track performances. It is available in this repository if you want to reproduce the below code.The major bottleneck is the multiple (4907) calls to
get_text_bounded
to extract each text individually.Would you have a suggestion to solve this optimisation bottleneck ? Is there any raw API that may achieve this feature in a single call ?
Otherwise, would it be possible to modify the
PdfObject
returned byget_objects
to return aPdfText
-like object instead (in case of text) ? Similarly to thePdfImage
that is returned in case of image, and proposes theextract
feature, thePdfText
may expose aget_text
method ?Thank you in advance,
Yui 😋
Beta Was this translation helpful? Give feedback.
All reactions