Memory Leak #104

harinderbachhal · 2016-05-18T14:18:47Z

Hi i am using this to get pdf text works well, but there is memory leak in this, i am using it as under

 $parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent($getdata[0]);
 $pages = $pdf->getPages();
 foreach ($pages as $page) {
                 echo $page->getText();
 }
unset($parser);
unset($pdf);
unset($page);
unset($pages);
echo '<h1>Memory =', round(memory_get_usage() / 1000), ' KB</h1><br>';

Memory =63226 KB

How do i fix this, how can i release memory which is used.

The text was updated successfully, but these errors were encountered:

smalot · 2016-05-18T15:45:59Z

It's really hard to fix a such issue.
There is many circular references between objects which block memory garbage.
Usually we use "__destruct" to break such behavior by settings properies to "null" or unset them to help garbage collector

harinderbachhal · 2016-05-18T15:49:27Z

where do i use __destruct, in this library? Parser.php this file or have to read full library ? and make changes .

ghost · 2016-08-23T17:41:24Z

I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use gc_collect_cycles(); to call the garbage collector in order to delete objects with circular references. The memory usage goes back to normal for me after this call.

vishaldevrepublic · 2016-08-31T11:35:59Z

Having same issue, Any solution for this.

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile(asset($pdf_file)); // [ Stuck with "Allowed memory size of 1073741824 bytes exhausted (tried to allocate 26811 bytes)" message here. ]
$pages = $pdf->getPages();

yapsr · 2016-09-19T15:50:58Z

Thank you @citionzeno, your suggestion works for me.

ghost · 2016-09-19T23:56:19Z

Glad it helped. It works if you are trying to parse multiple small files. For one single large file however, like @vishaldevrepublic, I don't know how to do. Some deep work into the library might be needed.

amineharoun · 2019-12-20T15:49:55Z

I am having the same problem. Trying to parse multiple PDF files within the same script ended up with a huge memory leak. With PHP 7 (or PHP >= 5.3), you can use gc_collect_cycles(); to call the garbage collector in order to delete objects with circular references. The memory usage goes back to normal for me after this call.

Where do i have to do that garbage collection in the code please ?

fm89 · 2019-12-20T15:55:08Z

This suggestion only works for parsing multiple small files. In this case, you can call gc_collect_cycles() after parsing each file, and before parsing the next one. This trick however does not provide a solution to the case where you want to parse a single large file.

amineharoun · 2019-12-20T16:22:29Z

Thanks for reply, can you tell me how I can parse a big file (60MB) that have 90 pages ?
My script crash with 503, my CPU has achived 100%

fm89 · 2019-12-20T17:27:51Z

This package does not appear to be a good solution for large files. See #169 also.

amineharoun · 2019-12-20T17:39:06Z

Can you tell me other solutions to use please ? (PHP)

Jurek-Raben · 2020-12-17T16:28:04Z

Hi, we have a memory leak with specific PDFs, like:
Allowed memory size of 134217728 bytes exhausted (tried to allocate 2097160 bytes) in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 197

If I now open the used PDF in macos' preview.app and export it there with "Export as PDF", the error will be gone on next try.
Could not find the difference of the PDFs so far. The exported one even is bigger, but seems to lack of any adobe acrobat meta data (in hex editor view).

k00ni · 2020-12-30T10:52:15Z

Can anyone provide a PDF file which causes this problem please? It must be free of charge (and other obligations) and will be part of our test environment.

Alternative would be the content of the $param you gave $parser->parseContent($param).

k00ni · 2020-12-30T10:56:59Z

Is it related to #372?

nkoporec · 2021-01-04T11:54:02Z

@k00ni I have the same issue as @Jurek-Raben, there is a memory leak(memory exhausted error in Font.php, line 189) if I parse a file with the Khmer Language (Official language of Cambodia) characters inside or a certain file with a big map image inside(12MB size). By looking at the metadata they were both created by Microsoft Word 2016, if I re-save the pdfs with Preview then the parser works as expected.

Tried to solution in #372, but that did not resolve the issue.

Sadly I also can't share the PDFs publicly.

Jurek-Raben · 2021-01-07T03:51:47Z

Cannot share the pdf either, if it will publicly used...

k00ni · 2021-01-07T08:23:06Z

If you can spare the time you could try to find a minimal example (=string) to trigger the problem. Please take the following code:

$parser = new \Smalot\PdfParser\Parser();

// load PDF file content
$data = file_get_contents('/path/to/pdf');

// give PDF content to function and parse it
$pdf = $parser->parseContent($data); // <= should trigger the leak

If your PDF triggers the leak, try to reduce the content of $data as much as possible. After you got to a reasonable length (thats up to you), you post it here. We will use it in our tests to reproduce the problem.

Here is a good example how that could look: #372 (comment)

yapsr · 2021-01-07T09:44:05Z

In my scripts, I parse a lot of PDFs and after a while, the out of memory error occurs.
Continuing the script with the latest PDF from the former batch, the error occurs some PDFs later. So in my view, the error is not reproducable with a single PDF.

k00ni · 2021-01-07T11:40:58Z

You are right, I remember the post from @ghost (#104 (comment)). He mentioned something about memory, which is never freed.

So in my view, the error is not reproducable with a single PDF.

I just wanna make sure there is no infinite loop or something which causes this problem.

@Jurek-Raben said:

Hi, we have a memory leak with specific PDFs, like:
Allowed memory size of 134217728 bytes exhausted (tried to allocate 2097160 bytes) in /var/www/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 197

Font.php around line 197 seems fine. In my opinion it could be caused by either an infinite loop (or recursion) or memory, which is used but never freed.

@yapsr Do you read your PDFs in one or multiple script runs?

yapsr · 2021-01-07T16:47:29Z

@yapsr Do you read your PDFs in one or multiple script runs?

I try to read multiple small PDFs (200kB in size) with a background image (4961x7016px PNG) and some text in a single script, but the script always crashes after a seemingly random number of PDF reads in "vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357" with this message:

" exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)' in /home/user/project/vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357"

There I find this code:


        /**
         * FlateDecode
         * Decompresses data encoded using the zlib/deflate compression method, reproducing the original text or binary data.
         * @param $data (string) Data to decode.
         * @return Decoded data string.
         * @since 1.0.000 (2011-05-23)
         * @public static
         */
        public static function decodeFilterFlateDecode($data) {   
                // initialize string to return
                $decoded = @gzuncompress($data);
                if ($decoded === false) {
                        self::Error('decodeFilterFlateDecode: invalid code');
                }
                return $decoded;
        }

So it might have to do something with gzuncompress()

The server runs PHP 5.6.27-1+deb.sury.org~trusty+1 (cli)

Here is some relevant part of my log files:

[2020-12-11 11:11:01] production.DEBUG: ParsePDF::select() : Exporting 21 files... [] []
[2020-12-11 11:11:01] production.DEBUG: ParsePDF::handle() : handling file 12345664, using 12MB [] []
[2020-12-11 11:11:01] production.DEBUG: ParsePDF::handle() : handling file 12345674, using 126MB [] []
[2020-12-11 11:11:02] production.DEBUG: ParsePDF::handle() : handling file 12345669, using 237MB [] []
[2020-12-11 11:11:02] production.DEBUG: ParsePDF::handle() : handling file 12345684, using 349MB [] []
[2020-12-11 11:11:03] production.DEBUG: ParsePDF::handle() : handling file 12345696, using 349MB [] []
[2020-12-11 11:11:04] production.DEBUG: ParsePDF::handle() : handling file 12345665, using 460MB [] []
[2020-12-11 11:11:04] production.ERROR: exception 'Symfony\Component\Debug\Exception\FatalErrorException' with message 'Allowed memory size of 536870912 bytes exhausted (tried to allocate 116907502 bytes)'...

The script seems to add about 111MB of memory usage per PDF file.

When converting the PNG background image to BMP format manually, it turns out to be about 104MB in size. So that does look related to the memory leak.

Hope this helps locating the problem.

k00ni · 2021-01-08T07:58:07Z

Which version of PDFParser do you use?

You mentioned vendor/tecnickcom/tcpdf/include/tcpdf_filters.php:357, but we removed TCPDF a few versions ago. Can you try again with our latest version 0.18.0 please? I remember that I removed the @ in $decoded = @gzuncompress($data); to allow error reporting.

yapsr · 2021-01-12T22:51:16Z

Oops. We were using pdfparser v0.9.25. However, removing tcpdf and updating to v0.18.1, I still get the memory usage error. Adding the gc_collect_cycles() workaround prevents the memory limit exception.

k00ni · 2021-01-13T08:02:10Z

Does something speak against gc_collect_cycles() after each parseFile call? Besides finding the root cause of this problem of course.

CC @Connum @j0k3r

j0k3r · 2021-01-13T08:41:36Z

Using gc_collect_cycles() is just the workaround.
Without a proper script + pdf to reproduce the leak, I think we won't be able to properly fix it ...

yapsr · 2021-01-13T12:15:06Z

I managed to create a PDF file to reproduce the issue:

document_with_text_and_png_image.pdf

    $file = 'document_with_text_and_png_image.pdf';
    $loops = 10;
    for($i=0;$i<$loops;$i++) {
        $parser = new \Smalot\PdfParser\Parser(); // v0.18.1
        $pdf = $parser->parseFile($file);
        echo memory_get_usage() . PHP_EOL;
    }

Memory usage will increase with more than 100MB per loop.
The PDF file and it's included image are relatively small (111kB).
The only cause I can think of is the PNG image, that is really large in byte size when uncompressed.

j0k3r · 2021-01-13T13:17:12Z

I've run the script 10 times using Blackfire, but I don't really know how to see where the leak is: https://blackfire.io/profiles/806c9126-a571-4472-8af6-664a8e34a5b7/graph (might only be available 24h I think)

But yeah, it leaks a lot:

$ php try.php
105778808
210245032
314711224
419177416
523643608
628109800
732575992
837050376
941516568
1045982760
$

Connum · 2021-01-13T13:48:24Z

If it is indeed the image, any idea how to work around it? As there's no OCR functionality built into the library, it could simply discard images completely - but I'm not deep enough into the core of the library to see how and where that could me managed, or if it is a feasible approach at alll.

k00ni · 2021-03-09T13:48:34Z

@smalot do you have an idea how to fix this? It "soft" blocks #383.

llagerlof · 2021-06-18T15:52:10Z

I am having a memory leak problem analyzing a single PDF. From hundreds of PDFs I have, some of them caused a memory leak that explodes at the memory limit (even small PDFs with less than 1 MB). In my case this happens because internally some code enters in a infinite loop. The infinite loop happens when the method getXrefData calls the method decodeXref, and the method decodeXref calls the method getXrefData. (getXrefData is called first) - /Smalot/PdfParser/RawData/RawDataParser.php

So I am assuming we are talking about two problems.

Testing a bunch of files causes some memory leak that can or not reach the memory limit.
Some unique PDFs can itself cause a memory leak that will raise until the memory explodes (this is my problem).

Unfortunately I can't provide any of my problematic PDFs because they are private data, but I thought this info could help the devs in some way.

k00ni · 2021-06-21T07:58:08Z

@llagerlof Are you willing to dive a little bit into the code to get us some debug information?

Please paste the parameter values of the first getXrefData call, which triggers an infinite loop. Parameters should be as small as possible, just enough to trigger the memory leak.

Thanks in advance.

huihuangjiuai · 2021-07-06T08:21:50Z

I provide a file https://baogaocos.seedsufe.com/2018/08/12/doc_1534042504420.pdf

Connum · 2021-07-20T14:06:24Z

I found some time to look into the parsing of this file:

document_with_text_and_png_image.pdf

And it seems indeed to simply be the raw image data taking up memory. If I change PDFObject.php to hand over an empty string instead of $content to the Image()constructor

return new Image($document, $header, '' /* instead of $content */, $config);

, the memory increases only by 4-5 MB on each iteration. So the question is: Do we need the image data at all? We could implement an option to turn image parsing off or on, and in that case I'd tend to suggest for it to be turned off automatically, which may on the other hand break existing implementations that do whatever you'd want to do with that image data (wich doesn't even seem to be actual raw image data that you can dump into a file?!)

Regarding the other issue with decodeXref() and getXrefData() calling each other given the right circumstances, I haven't been able to wrap my mind around why that is happening and how it could be prevented!

Connum · 2021-07-20T14:26:21Z

Just a quick follow-up regarding decodeXref() and getXrefData(): The code for these functions was apparently taken from the TCPDF library. I couldn't find any substantial changes. Maybe someone could check if the same problems arise when processing the affected files with TCPDF?

k00ni · 2021-07-21T07:17:54Z

We could implement an option to turn image parsing off or on, and in that case I'd tend to suggest for it to be turned off automatically, which may on the other hand break existing implementations that do whatever you'd want to do with that image data (wich doesn't even seem to be actual raw image data that you can dump into a file?!)

Using an option and set it to on by default to stay compatible makes sense to me. Maybe our Config class instance can be used. Would you mind preparing a PR?

b3n-l · 2021-10-21T14:22:41Z

Just an update from using v1.1.0, I can replicate running PHP 8.0.11 (latest stable) even with image parsing disabled. It's vulnerable to compression bombs, in the example posted previously, the image expands from 0.1MB to approx 105MB.

The problem is, before it checks the Config class for disabled image parsing, you need to have called gzuncompress in FilterHelper.php. If this pushes it above the PHP memory limit, it'll simply die.

xpdf, for example, checks for compression bombs by doing the following in C++. This is lifted directly from the source code in Stream.cc

// check for a 'decompression bomb'
  if (totalOut > 50000000 && totalIn < totalOut / 250) {
    error(errSyntaxError, getPos(), "Decompression bomb in flate stream");
    endOfBlock = eof = gTrue;
    remain = 0;
  }

I've tested this in FilterHelper by adding a max limit to the number of bytes gzuncompress can parse on line 254. This will throw an Exception then, rather than going to a fatal error. I don't know what the optimum values would be, but, this could be useful.

I've attached a PDF that's 400kB and will cause the demo to fail. This is the same PDF posted by yaspr, but with duplicate pages.
document_with_text_and_png_image_4pages.pdf

mikedodd · 2022-05-12T14:41:32Z

FYI This is still a problem, not sure what it is in the PDF that is making the memory limit sky rocket.

you can set that limt your self:

$parser = new Parser();
$parser->getConfig()->setDecodeMemoryLimit(200000)

the limit is arg 2 here: https://www.php.net/manual/en/function.gzuncompress.php

nedvajz mentioned this issue May 21, 2018

Memory Leak #169

Closed

k00ni added bug stale needs decision labels Oct 14, 2020

k00ni added needs more info and removed stale needs decision labels Jan 7, 2021

k00ni mentioned this issue Jan 12, 2021

Introduced PHP 8 support; removed support for PHP 5.6 + 7.0 #383

Merged

2 tasks

k00ni mentioned this issue Apr 8, 2021

Large files are parsed incorrectly #357

Closed

hpvd mentioned this issue Apr 28, 2021

Poll: Is this library ready for 1.0? #348

Closed

2 tasks

Connum mentioned this issue Jul 20, 2021

RawDataParser->getXrefData not fully covered by tests #436

Open

Connum mentioned this issue Jul 21, 2021

implemented config setting $retainImageContent #441

Merged

b3n-l mentioned this issue Oct 29, 2021

Crashing PHP process through memory exhaustion when decompressing images. #475

Closed

Memory Leak #104

Memory Leak #104

Comments

harinderbachhal commented May 18, 2016 • edited Loading

smalot commented May 18, 2016

harinderbachhal commented May 18, 2016

ghost commented Aug 23, 2016

vishaldevrepublic commented Aug 31, 2016

yapsr commented Sep 19, 2016

ghost commented Sep 19, 2016

amineharoun commented Dec 20, 2019

fm89 commented Dec 20, 2019

amineharoun commented Dec 20, 2019

fm89 commented Dec 20, 2019

amineharoun commented Dec 20, 2019 • edited Loading

Jurek-Raben commented Dec 17, 2020

k00ni commented Dec 30, 2020 • edited Loading

k00ni commented Dec 30, 2020

nkoporec commented Jan 4, 2021

Jurek-Raben commented Jan 7, 2021

k00ni commented Jan 7, 2021 • edited Loading

yapsr commented Jan 7, 2021

k00ni commented Jan 7, 2021 • edited Loading

yapsr commented Jan 7, 2021 • edited Loading

k00ni commented Jan 8, 2021 • edited Loading

yapsr commented Jan 12, 2021

k00ni commented Jan 13, 2021 • edited Loading

j0k3r commented Jan 13, 2021

yapsr commented Jan 13, 2021 • edited by j0k3r Loading

j0k3r commented Jan 13, 2021

Connum commented Jan 13, 2021

k00ni commented Mar 9, 2021

llagerlof commented Jun 18, 2021

k00ni commented Jun 21, 2021 • edited Loading

huihuangjiuai commented Jul 6, 2021

Connum commented Jul 20, 2021

Connum commented Jul 20, 2021

k00ni commented Jul 21, 2021 • edited Loading

b3n-l commented Oct 21, 2021

mikedodd commented May 12, 2022 • edited Loading

harinderbachhal commented May 18, 2016 •

edited

Loading

amineharoun commented Dec 20, 2019 •

edited

Loading

k00ni commented Dec 30, 2020 •

edited

Loading

k00ni commented Jan 7, 2021 •

edited

Loading

k00ni commented Jan 7, 2021 •

edited

Loading

yapsr commented Jan 7, 2021 •

edited

Loading

k00ni commented Jan 8, 2021 •

edited

Loading

k00ni commented Jan 13, 2021 •

edited

Loading

yapsr commented Jan 13, 2021 •

edited by j0k3r

Loading

k00ni commented Jun 21, 2021 •

edited

Loading

k00ni commented Jul 21, 2021 •

edited

Loading

mikedodd commented May 12, 2022 •

edited

Loading