Large files are parsed incorrectly #357

ePZuz · 2020-10-19T07:43:36Z

I have a problem. The parser performs large files incorrectly. What could be the problem? Small files are fine

k00ni · 2020-10-19T07:53:45Z

There is no way we can answer that, if you don't provide more information, like:

PDF you used (please upload here. We must be able to add it to our test suite, so it has to be free of charge and with no obligations)
size of the PDF
content of the PDF (many images?)
what do you mean with "incorrectly"?

etc.

As far as I know the Parser has still some problems with big PDFs, but I don't know at which size the performance declines.

EwertonDutra · 2021-04-07T18:21:58Z

Hi,
I also spent this
Really big files, gives error in reading

PDF used: https://mega.nz/file/i9kiRAZb#tR07OYUvFTirD-9PDUToYXHqSY6EAmZmTRNaGQ_OYf8
Size: 152 MB (159.862.426 bytes)
content of the PDF (many images?): no
what do you mean with "incorrectly"? Read error

Error:
Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 145134936 bytes) in C:\laragon\www\Leitor\vendor\smalot\pdfparser\src\Smalot\PdfParser\RawData\RawDataParser.php on line 714

Remembering that the server is a local machine (dev)
It is not weak :)
But I believe that something ends up happening in the middle of the way, because the reading takes and takes ... and there comes a time that gives this error

k00ni · 2021-04-08T06:18:35Z

Problem might be the same as in #104. Implementation doesn't free resources (properly) during/after reading a file. I had no time to dive into this problem lately. Any hints are welcome.

derqas · 2023-05-26T13:46:30Z

I have more or less a similar problem.
Big file (70 pages, newspaper, only images, 1 png on 1 page), there are links on each page, 0-10 pcs, and from page 20 it is so that in $page->get('Annot') - nothing ([] - empty array), but there are 244 (=count($pdfDocument->getObjectsByType( 'Annot' ))) of them in the whole document (the first 35 pages), and there are 82 (array_sum( array_map(function($e){return count($e->get('Annot'));}, $pdfDocument->getPages() )) ) in total

k00ni · 2023-05-29T07:54:01Z

$page->get('Annot') - nothing ([] - empty array), but there are 244 (=count($pdfDocument->getObjectsByType( 'Annot' ))) of them in the whole document (the first 35 pages), and there are 82 (array_sum( array_map(function($e){return count($e->get('Annot'));}, $pdfDocument->getPages() )) ) in total

Without code highlighting its hard to read, but I would like to see if there is a bug in the routines concerning annotations.

@derqas can you please check the following: find a page which has annotations, call $page->get('Annot')for it and compare result amount. Does it match? Be aware, you compared result of a single page ($page->get('Annot')) with result for the whole document ($pdfDocument->getObjectsByType( 'Annot' )).

All in all this issue should be closed and posts of @EwertonDutra and @derqas be handled separately. For the sake of all our time, I will keep it open and see, if we can solve these. But there was no evidence presented which illustrates that PDFParser parsing fails when a PDF is "too large" (whatever large even means). @EwertonDutra's case seems more like a reading error rather than a size problem.

k00ni added needs more info missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. labels Oct 19, 2020

k00ni changed the title ~~Does not parse large files~~ Large files are parsed incorrectly Oct 19, 2020

b3n-l mentioned this issue Oct 29, 2021

Crashing PHP process through memory exhaustion when decompressing images. #475

Closed

k00ni added the stale needs decision label May 29, 2023

k00ni removed the stale needs decision label Feb 9, 2024

k00ni closed this as completed Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large files are parsed incorrectly #357

Large files are parsed incorrectly #357

ePZuz commented Oct 19, 2020

k00ni commented Oct 19, 2020

EwertonDutra commented Apr 7, 2021

k00ni commented Apr 8, 2021

derqas commented May 26, 2023 •

edited

Loading

k00ni commented May 29, 2023 •

edited

Loading

Large files are parsed incorrectly #357

Large files are parsed incorrectly #357

Comments

ePZuz commented Oct 19, 2020

k00ni commented Oct 19, 2020

EwertonDutra commented Apr 7, 2021

k00ni commented Apr 8, 2021

derqas commented May 26, 2023 • edited Loading

k00ni commented May 29, 2023 • edited Loading

derqas commented May 26, 2023 •

edited

Loading

k00ni commented May 29, 2023 •

edited

Loading