-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Leak #104
Comments
It's really hard to fix a such issue. |
where do i use __destruct, in this library? Parser.php this file or have to read full library ? and make changes . |
I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use |
Having same issue, Any solution for this.
|
Thank you @citionzeno, your suggestion works for me. |
Glad it helped. It works if you are trying to parse multiple small files. For one single large file however, like @vishaldevrepublic, I don't know how to do. Some deep work into the library might be needed. |
Where do i have to do that garbage collection in the code please ? |
This suggestion only works for parsing multiple small files. In this case, you can call |
Thanks for reply, can you tell me how I can parse a big file (60MB) that have 90 pages ? |
This package does not appear to be a good solution for large files. See #169 also. |
Can you tell me other solutions to use please ? (PHP) |
Hi, we have a memory leak with specific PDFs, like: If I now open the used PDF in macos' preview.app and export it there with "Export as PDF", the error will be gone on next try. |
Can anyone provide a PDF file which causes this problem please? It must be free of charge (and other obligations) and will be part of our test environment. Alternative would be the content of the |
Is it related to #372? |
@k00ni I have the same issue as @Jurek-Raben, there is a memory leak(memory exhausted error in Font.php, line 189) if I parse a file with the Khmer Language (Official language of Cambodia) characters inside or a certain file with a big map image inside(12MB size). By looking at the metadata they were both created by Microsoft Word 2016, if I re-save the pdfs with Preview then the parser works as expected. Tried to solution in #372, but that did not resolve the issue. Sadly I also can't share the PDFs publicly. |
Cannot share the pdf either, if it will publicly used... |
If you can spare the time you could try to find a minimal example (=string) to trigger the problem. Please take the following code: $parser = new \Smalot\PdfParser\Parser();
// load PDF file content
$data = file_get_contents('/path/to/pdf');
// give PDF content to function and parse it
$pdf = $parser->parseContent($data); // <= should trigger the leak If your PDF triggers the leak, try to reduce the content of Here is a good example how that could look: #372 (comment) |
In my scripts, I parse a lot of PDFs and after a while, the out of memory error occurs. |
You are right, I remember the post from @ghost (#104 (comment)). He mentioned something about memory, which is never freed.
I just wanna make sure there is no infinite loop or something which causes this problem. @Jurek-Raben said:
@yapsr Do you read your PDFs in one or multiple script runs? |
I try to read multiple small PDFs (200kB in size) with a background image (4961x7016px PNG) and some text in a single script, but the script always crashes after a seemingly random number of PDF reads in "vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357" with this message: " exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)' in /home/user/project/vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357" There I find this code:
So it might have to do something with gzuncompress() The server runs PHP 5.6.27-1+deb.sury.org~trusty+1 (cli) Here is some relevant part of my log files:
The script seems to add about 111MB of memory usage per PDF file. When converting the PNG background image to BMP format manually, it turns out to be about 104MB in size. So that does look related to the memory leak. Hope this helps locating the problem. |
Which version of PDFParser do you use? You mentioned |
Oops. We were using pdfparser v0.9.25. However, removing tcpdf and updating to v0.18.1, I still get the memory usage error. Adding the gc_collect_cycles() workaround prevents the memory limit exception. |
Using |
I managed to create a PDF file to reproduce the issue: document_with_text_and_png_image.pdf
Memory usage will increase with more than 100MB per loop. |
I've run the script 10 times using Blackfire, but I don't really know how to see where the leak is: https://blackfire.io/profiles/806c9126-a571-4472-8af6-664a8e34a5b7/graph (might only be available 24h I think) But yeah, it leaks a lot:
|
If it is indeed the image, any idea how to work around it? As there's no OCR functionality built into the library, it could simply discard images completely - but I'm not deep enough into the core of the library to see how and where that could me managed, or if it is a feasible approach at alll. |
I am having a memory leak problem analyzing a single PDF. From hundreds of PDFs I have, some of them caused a memory leak that explodes at the memory limit (even small PDFs with less than 1 MB). In my case this happens because internally some code enters in a infinite loop. The infinite loop happens when the method So I am assuming we are talking about two problems.
Unfortunately I can't provide any of my problematic PDFs because they are private data, but I thought this info could help the devs in some way. |
@llagerlof Are you willing to dive a little bit into the code to get us some debug information? Please paste the parameter values of the first Thanks in advance. |
I provide a file https://baogaocos.seedsufe.com/2018/08/12/doc_1534042504420.pdf |
I found some time to look into the parsing of this file: And it seems indeed to simply be the raw image data taking up memory. If I change return new Image($document, $header, '' /* instead of $content */, $config); , the memory increases only by 4-5 MB on each iteration. So the question is: Do we need the image data at all? We could implement an option to turn image parsing off or on, and in that case I'd tend to suggest for it to be turned off automatically, which may on the other hand break existing implementations that do whatever you'd want to do with that image data (wich doesn't even seem to be actual raw image data that you can dump into a file?!) Regarding the other issue with |
Just a quick follow-up regarding |
Using an option and set it to |
Just an update from using v1.1.0, I can replicate running PHP 8.0.11 (latest stable) even with image parsing disabled. It's vulnerable to compression bombs, in the example posted previously, the image expands from 0.1MB to approx 105MB. The problem is, before it checks the xpdf, for example, checks for compression bombs by doing the following in C++. This is lifted directly from the source code in Stream.cc
I've tested this in FilterHelper by adding a max limit to the number of bytes I've attached a PDF that's 400kB and will cause the demo to fail. This is the same PDF posted by yaspr, but with duplicate pages. |
FYI This is still a problem, not sure what it is in the PDF that is making the memory limit sky rocket. you can set that limt your self: $parser = new Parser(); the limit is arg 2 here: https://www.php.net/manual/en/function.gzuncompress.php |
Hi i am using this to get pdf text works well, but there is memory leak in this, i am using it as under
Memory =63226 KB
How do i fix this, how can i release memory which is used.
The text was updated successfully, but these errors were encountered: