Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Leak #104

Open
harinderbachhal opened this issue May 18, 2016 · 36 comments
Open

Memory Leak #104

harinderbachhal opened this issue May 18, 2016 · 36 comments

Comments

@harinderbachhal
Copy link

harinderbachhal commented May 18, 2016

Hi i am using this to get pdf text works well, but there is memory leak in this, i am using it as under

 $parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent($getdata[0]);
 $pages = $pdf->getPages();
 foreach ($pages as $page) {
                 echo $page->getText();
 }
unset($parser);
unset($pdf);
unset($page);
unset($pages);
echo '<h1>Memory =', round(memory_get_usage() / 1000), ' KB</h1><br>';

Memory =63226 KB

How do i fix this, how can i release memory which is used.

@smalot
Copy link
Owner

smalot commented May 18, 2016

It's really hard to fix a such issue.
There is many circular references between objects which block memory garbage.
Usually we use "__destruct" to break such behavior by settings properies to "null" or unset them to help garbage collector

@harinderbachhal
Copy link
Author

where do i use __destruct, in this library? Parser.php this file or have to read full library ? and make changes .

@ghost
Copy link

ghost commented Aug 23, 2016

I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use gc_collect_cycles(); to call the garbage collector in order to delete objects with circular references. The memory usage goes back to normal for me after this call.

@vishaldevrepublic
Copy link

Having same issue, Any solution for this.

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile(asset($pdf_file)); // [ Stuck with "Allowed memory size of 1073741824 bytes exhausted (tried to allocate 26811 bytes)" message here. ]
$pages = $pdf->getPages(); 

@yapsr
Copy link

yapsr commented Sep 19, 2016

Thank you @citionzeno, your suggestion works for me.

@ghost
Copy link

ghost commented Sep 19, 2016

Glad it helped. It works if you are trying to parse multiple small files. For one single large file however, like @vishaldevrepublic, I don't know how to do. Some deep work into the library might be needed.

@nedvajz nedvajz mentioned this issue May 21, 2018
@amineharoun
Copy link

I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use gc_collect_cycles(); to call the garbage collector in order to delete objects with circular references. The memory usage goes back to normal for me after this call.

Where do i have to do that garbage collection in the code please ?

@fm89
Copy link

fm89 commented Dec 20, 2019

This suggestion only works for parsing multiple small files. In this case, you can call gc_collect_cycles() after parsing each file, and before parsing the next one. This trick however does not provide a solution to the case where you want to parse a single large file.

@amineharoun
Copy link

Thanks for reply, can you tell me how I can parse a big file (60MB) that have 90 pages ?
My script crash with 503, my CPU has achived 100%

@fm89
Copy link

fm89 commented Dec 20, 2019

This package does not appear to be a good solution for large files. See #169 also.

@amineharoun
Copy link

amineharoun commented Dec 20, 2019

Can you tell me other solutions to use please ? (PHP)

@k00ni k00ni added bug stale needs decision labels Oct 14, 2020
@Jurek-Raben
Copy link

Hi, we have a memory leak with specific PDFs, like:
Allowed memory size of 134217728 bytes exhausted (tried to allocate 2097160 bytes) in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 197

If I now open the used PDF in macos' preview.app and export it there with "Export as PDF", the error will be gone on next try.
Could not find the difference of the PDFs so far. The exported one even is bigger, but seems to lack of any adobe acrobat meta data (in hex editor view).

@k00ni
Copy link
Collaborator

k00ni commented Dec 30, 2020

Can anyone provide a PDF file which causes this problem please? It must be free of charge (and other obligations) and will be part of our test environment.

Alternative would be the content of the $param you gave $parser->parseContent($param).

@k00ni
Copy link
Collaborator

k00ni commented Dec 30, 2020

Is it related to #372?

@nkoporec
Copy link

nkoporec commented Jan 4, 2021

@k00ni I have the same issue as @Jurek-Raben, there is a memory leak(memory exhausted error in Font.php, line 189) if I parse a file with the Khmer Language (Official language of Cambodia) characters inside or a certain file with a big map image inside(12MB size). By looking at the metadata they were both created by Microsoft Word 2016, if I re-save the pdfs with Preview then the parser works as expected.

Tried to solution in #372, but that did not resolve the issue.

Sadly I also can't share the PDFs publicly.

@Jurek-Raben
Copy link

Cannot share the pdf either, if it will publicly used...

@k00ni
Copy link
Collaborator

k00ni commented Jan 7, 2021

If you can spare the time you could try to find a minimal example (=string) to trigger the problem. Please take the following code:

$parser = new \Smalot\PdfParser\Parser();

// load PDF file content
$data = file_get_contents('/path/to/pdf');

// give PDF content to function and parse it
$pdf = $parser->parseContent($data); // <= should trigger the leak

If your PDF triggers the leak, try to reduce the content of $data as much as possible. After you got to a reasonable length (thats up to you), you post it here. We will use it in our tests to reproduce the problem.

Here is a good example how that could look: #372 (comment)

@k00ni k00ni added needs more info and removed stale needs decision labels Jan 7, 2021
@yapsr
Copy link

yapsr commented Jan 7, 2021

In my scripts, I parse a lot of PDFs and after a while, the out of memory error occurs.
Continuing the script with the latest PDF from the former batch, the error occurs some PDFs later. So in my view, the error is not reproducable with a single PDF.

@k00ni
Copy link
Collaborator

k00ni commented Jan 7, 2021

You are right, I remember the post from @ghost (#104 (comment)). He mentioned something about memory, which is never freed.

So in my view, the error is not reproducable with a single PDF.

I just wanna make sure there is no infinite loop or something which causes this problem.

@Jurek-Raben said:

Hi, we have a memory leak with specific PDFs, like:
Allowed memory size of 134217728 bytes exhausted (tried to allocate 2097160 bytes) in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 197

Font.php around line 197 seems fine. In my opinion it could be caused by either an infinite loop (or recursion) or memory, which is used but never freed.

@yapsr Do you read your PDFs in one or multiple script runs?

@yapsr
Copy link

yapsr commented Jan 7, 2021

@yapsr Do you read your PDFs in one or multiple script runs?

I try to read multiple small PDFs (200kB in size) with a background image (4961x7016px PNG) and some text in a single script, but the script always crashes after a seemingly random number of PDF reads in "vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357" with this message:

" exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)' in /home/user/project/vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357"

There I find this code:


        /**
         * FlateDecode
         * Decompresses data encoded using the zlib/deflate compression method, reproducing the original text or binary data.
         * @param $data (string) Data to decode.
         * @return Decoded data string.
         * @since 1.0.000 (2011-05-23)
         * @public static
         */
        public static function decodeFilterFlateDecode($data) {   
                // initialize string to return
                $decoded = @gzuncompress($data);
                if ($decoded === false) {
                        self::Error('decodeFilterFlateDecode: invalid code');
                }
                return $decoded;
        }

So it might have to do something with gzuncompress()

The server runs PHP 5.6.27-1+deb.sury.org~trusty+1 (cli)

Here is some relevant part of my log files:

[2020-12-11 11:11:01] production.DEBUG: ParsePDF::select() : Exporting 21 files... [] []
[2020-12-11 11:11:01] production.DEBUG: ParsePDF::handle() : handling file 12345664, using 12MB [] []
[2020-12-11 11:11:01] production.DEBUG: ParsePDF::handle() : handling file 12345674, using 126MB [] []
[2020-12-11 11:11:02] production.DEBUG: ParsePDF::handle() : handling file 12345669, using 237MB [] []
[2020-12-11 11:11:02] production.DEBUG: ParsePDF::handle() : handling file 12345684, using 349MB [] []
[2020-12-11 11:11:03] production.DEBUG: ParsePDF::handle() : handling file 12345696, using 349MB [] []
[2020-12-11 11:11:04] production.DEBUG: ParsePDF::handle() : handling file 12345665, using 460MB [] []
[2020-12-11 11:11:04] production.ERROR: exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)'...

The script seems to add about 111MB of memory usage per PDF file.

When converting the PNG background image to BMP format manually, it turns out to be about 104MB in size. So that does look related to the memory leak.

Hope this helps locating the problem.

@k00ni
Copy link
Collaborator

k00ni commented Jan 8, 2021

Which version of PDFParser do you use?

You mentioned vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357, but we removed TCPDF a few versions ago. Can you try again with our latest version 0.18.0 please? I remember that I removed the @ in $decoded = @gzuncompress($data); to allow error reporting.

@yapsr
Copy link

yapsr commented Jan 12, 2021

Oops. We were using pdfparser v0.9.25. However, removing tcpdf and updating to v0.18.1, I still get the memory usage error. Adding the gc_collect_cycles() workaround prevents the memory limit exception.

@k00ni
Copy link
Collaborator

k00ni commented Jan 13, 2021

Does something speak against gc_collect_cycles() after each parseFile call? Besides finding the root cause of this problem of course.

CC @Connum @j0k3r

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 13, 2021

Using gc_collect_cycles() is just the workaround.
Without a proper script + pdf to reproduce the leak, I think we won't be able to properly fix it ...

@yapsr
Copy link

yapsr commented Jan 13, 2021

I managed to create a PDF file to reproduce the issue:

document_with_text_and_png_image.pdf

    $file = 'document_with_text_and_png_image.pdf';
    $loops = 10;
    for($i=0;$i<$loops;$i++) {
        $parser = new \Smalot\PdfParser\Parser(); // v0.18.1
        $pdf = $parser->parseFile($file);
        echo memory_get_usage() . PHP_EOL;
    }

Memory usage will increase with more than 100MB per loop.
The PDF file and it's included image are relatively small (111kB).
The only cause I can think of is the PNG image, that is really large in byte size when uncompressed.

@j0k3r
Copy link
Collaborator

j0k3r commented Jan 13, 2021

I've run the script 10 times using Blackfire, but I don't really know how to see where the leak is: https://blackfire.io/profiles/806c9126-a571-4472-8af6-664a8e34a5b7/graph (might only be available 24h I think)

But yeah, it leaks a lot:

$ php try.php
105778808
210245032
314711224
419177416
523643608
628109800
732575992
837050376
941516568
1045982760
$

@Connum
Copy link
Contributor

Connum commented Jan 13, 2021

If it is indeed the image, any idea how to work around it? As there's no OCR functionality built into the library, it could simply discard images completely - but I'm not deep enough into the core of the library to see how and where that could me managed, or if it is a feasible approach at alll.

@k00ni
Copy link
Collaborator

k00ni commented Mar 9, 2021

@smalot do you have an idea how to fix this? It "soft" blocks #383.

@llagerlof
Copy link

I am having a memory leak problem analyzing a single PDF. From hundreds of PDFs I have, some of them caused a memory leak that explodes at the memory limit (even small PDFs with less than 1 MB). In my case this happens because internally some code enters in a infinite loop. The infinite loop happens when the method getXrefData calls the method decodeXref, and the method decodeXref calls the method getXrefData. (getXrefData is called first) - /Smalot/PdfParser/RawData/RawDataParser.php

So I am assuming we are talking about two problems.

  1. Testing a bunch of files causes some memory leak that can or not reach the memory limit.
  2. Some unique PDFs can itself cause a memory leak that will raise until the memory explodes (this is my problem).

Unfortunately I can't provide any of my problematic PDFs because they are private data, but I thought this info could help the devs in some way.

@k00ni
Copy link
Collaborator

k00ni commented Jun 21, 2021

@llagerlof Are you willing to dive a little bit into the code to get us some debug information?

Please paste the parameter values of the first getXrefData call, which triggers an infinite loop. Parameters should be as small as possible, just enough to trigger the memory leak.

Thanks in advance.

@huihuangjiuai
Copy link

I provide a file https://baogaocos.seedsufe.com/2018/08/12/doc_1534042504420.pdf

@Connum
Copy link
Contributor

Connum commented Jul 20, 2021

I found some time to look into the parsing of this file:

document_with_text_and_png_image.pdf

And it seems indeed to simply be the raw image data taking up memory. If I change PDFObject.php to hand over an empty string instead of $content to the Image()constructor

return new Image($document, $header, '' /* instead of $content */, $config);

, the memory increases only by 4-5 MB on each iteration. So the question is: Do we need the image data at all? We could implement an option to turn image parsing off or on, and in that case I'd tend to suggest for it to be turned off automatically, which may on the other hand break existing implementations that do whatever you'd want to do with that image data (wich doesn't even seem to be actual raw image data that you can dump into a file?!)

Regarding the other issue with decodeXref() and getXrefData() calling each other given the right circumstances, I haven't been able to wrap my mind around why that is happening and how it could be prevented!

@Connum
Copy link
Contributor

Connum commented Jul 20, 2021

Just a quick follow-up regarding decodeXref() and getXrefData(): The code for these functions was apparently taken from the TCPDF library. I couldn't find any substantial changes. Maybe someone could check if the same problems arise when processing the affected files with TCPDF?

@k00ni
Copy link
Collaborator

k00ni commented Jul 21, 2021

We could implement an option to turn image parsing off or on, and in that case I'd tend to suggest for it to be turned off automatically, which may on the other hand break existing implementations that do whatever you'd want to do with that image data (wich doesn't even seem to be actual raw image data that you can dump into a file?!)

Using an option and set it to on by default to stay compatible makes sense to me. Maybe our Config class instance can be used. Would you mind preparing a PR?

@b3n-l
Copy link
Contributor

b3n-l commented Oct 21, 2021

Just an update from using v1.1.0, I can replicate running PHP 8.0.11 (latest stable) even with image parsing disabled. It's vulnerable to compression bombs, in the example posted previously, the image expands from 0.1MB to approx 105MB.

The problem is, before it checks the Config class for disabled image parsing, you need to have called gzuncompress in FilterHelper.php. If this pushes it above the PHP memory limit, it'll simply die.

xpdf, for example, checks for compression bombs by doing the following in C++. This is lifted directly from the source code in Stream.cc

// check for a 'decompression bomb'
  if (totalOut > 50000000 && totalIn < totalOut / 250) {
    error(errSyntaxError, getPos(), "Decompression bomb in flate stream");
    endOfBlock = eof = gTrue;
    remain = 0;
  }

I've tested this in FilterHelper by adding a max limit to the number of bytes gzuncompress can parse on line 254. This will throw an Exception then, rather than going to a fatal error. I don't know what the optimum values would be, but, this could be useful.

I've attached a PDF that's 400kB and will cause the demo to fail. This is the same PDF posted by yaspr, but with duplicate pages.
document_with_text_and_png_image_4pages.pdf

@mikedodd
Copy link

mikedodd commented May 12, 2022

FYI This is still a problem, not sure what it is in the PDF that is making the memory limit sky rocket.

you can set that limt your self:

$parser = new Parser();
$parser->getConfig()->setDecodeMemoryLimit(200000)

the limit is arg 2 here: https://www.php.net/manual/en/function.gzuncompress.php

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests