Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large files are parsed incorrectly #357

Closed
ePZuz opened this issue Oct 19, 2020 · 5 comments
Closed

Large files are parsed incorrectly #357

ePZuz opened this issue Oct 19, 2020 · 5 comments
Labels
missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. needs more info

Comments

@ePZuz
Copy link

ePZuz commented Oct 19, 2020

I have a problem. The parser performs large files incorrectly. What could be the problem? Small files are fine

@k00ni
Copy link
Collaborator

k00ni commented Oct 19, 2020

There is no way we can answer that, if you don't provide more information, like:

  • PDF you used (please upload here. We must be able to add it to our test suite, so it has to be free of charge and with no obligations)
  • size of the PDF
  • content of the PDF (many images?)
  • what do you mean with "incorrectly"?

etc.

As far as I know the Parser has still some problems with big PDFs, but I don't know at which size the performance declines.

@k00ni k00ni added needs more info missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. labels Oct 19, 2020
@k00ni k00ni changed the title Does not parse large files Large files are parsed incorrectly Oct 19, 2020
@EwertonDutra
Copy link

Hi,
I also spent this
Really big files, gives error in reading

PDF used: https://mega.nz/file/i9kiRAZb#tR07OYUvFTirD-9PDUToYXHqSY6EAmZmTRNaGQ_OYf8
Size: 152 MB (159.862.426 bytes)
content of the PDF (many images?): no
what do you mean with "incorrectly"? Read error

Error:
Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 145134936 bytes) in C:\laragon\www\Leitor\vendor\smalot\pdfparser\src\Smalot\PdfParser\RawData\RawDataParser.php on line 714

Remembering that the server is a local machine (dev)
It is not weak :)
But I believe that something ends up happening in the middle of the way, because the reading takes and takes ... and there comes a time that gives this error

@k00ni
Copy link
Collaborator

k00ni commented Apr 8, 2021

Problem might be the same as in #104. Implementation doesn't free resources (properly) during/after reading a file. I had no time to dive into this problem lately. Any hints are welcome.

@derqas
Copy link

derqas commented May 26, 2023

I have more or less a similar problem.
Big file (70 pages, newspaper, only images, 1 png on 1 page), there are links on each page, 0-10 pcs, and from page 20 it is so that in $page->get('Annot') - nothing ([] - empty array), but there are 244 (=count($pdfDocument->getObjectsByType( 'Annot' ))) of them in the whole document (the first 35 pages), and there are 82 (array_sum( array_map(function($e){return count($e->get('Annot'));}, $pdfDocument->getPages() )) ) in total

@k00ni
Copy link
Collaborator

k00ni commented May 29, 2023

$page->get('Annot') - nothing ([] - empty array), but there are 244 (=count($pdfDocument->getObjectsByType( 'Annot' ))) of them in the whole document (the first 35 pages), and there are 82 (array_sum( array_map(function($e){return count($e->get('Annot'));}, $pdfDocument->getPages() )) ) in total

Without code highlighting its hard to read, but I would like to see if there is a bug in the routines concerning annotations.

@derqas can you please check the following: find a page which has annotations, call $page->get('Annot')for it and compare result amount. Does it match? Be aware, you compared result of a single page ($page->get('Annot')) with result for the whole document ($pdfDocument->getObjectsByType( 'Annot' )).

All in all this issue should be closed and posts of @EwertonDutra and @derqas be handled separately. For the sake of all our time, I will keep it open and see, if we can solve these. But there was no evidence presented which illustrates that PDFParser parsing fails when a PDF is "too large" (whatever large even means). @EwertonDutra's case seems more like a reading error rather than a size problem.

@k00ni k00ni added the stale needs decision label May 29, 2023
@k00ni k00ni removed the stale needs decision label Feb 9, 2024
@k00ni k00ni closed this as completed Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing or incomplete functionality For something which is not a bug, but more like an incomplete feature. needs more info
Projects
None yet
Development

No branches or pull requests

4 participants