Reduce excessive substring allocation, using preg_match's offset parameter #595

se-ti · 2023-04-14T19:53:21Z

This PR proposes to use preg_match's offset parameter instead of preparing special string for it.
This change requires minor changes of regexps: "match start of string" meta-character ^ should be replaced with \G escape sequence https://www.php.net/manual/en/regexp.reference.escape.php

This PR slightly improves memory allocation, achieved with pulls #582, #590, but would be useful for all PDF's and not only image-rich ones.

This PR shows
42M for the first launch, 18M for the best subsequent vs
43M for the first launch, 18M for the best subsequent for pull 590

https://westra.ru/reports/kavkaz/danilina-1el2-arhyz-2021.pdf

$lim = '128M';
ini_set("memory_limit", $lim);
include 'alt_autoload.php-dist';

$path = 'danilina-1el2-arhyz-2021.pdf';

$config = new \Smalot\PdfParser\Config();
$config->setRetainImageContent(false);
$parser = new \Smalot\PdfParser\Parser([], $config);

$content = file_get_contents($path);
$pdf = $parser->parseContent($content);

$text = $pdf->getText();

echo ini_get("memory_limit")."\n";
echo $text;

k00ni · 2023-04-17T07:22:51Z

Wow another PR already. Thanks for investing the time. I will try to get back to you asap!

k00ni

@se-ti In the following example you replaced the part begin of the string (^) with first appearance (\G).

/^\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si
/\G\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si

You altered the scope of the regex, because they can produce different results. As of now, it always checks directly at the beginning of the string and if nothing is to be found, the if-clause is dismissed.

My question is, is this kind of information you check with the regex always at the beginning of the string anyway? Can there multiple instances of the this kind of information in the string at question?

se-ti · 2023-04-25T16:03:18Z

Sorry, i don't understand You. :(
As of now, it always checks directly at the beginning of the string...
It was a /^\s*(?P<name>\/[A-Z0-9\._]+)(?P<value>.*)/si
It DID check (see ^) just very beginning of the string.

I have replaced it with \G, which is "first matching position in subject". And matching position is changed because of the $offset parameter.

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match. It differs from \A when the value of offset is non-zero.
https://www.php.net/manual/en/regexp.reference.escape.php

Multiple instances are not the issue, because of regexp's greedyness -- see .* in the end.
And this code definitly works not with the whole document, but with it's small part, substr-ed at higher level.

se-ti · 2023-04-28T14:10:19Z

Have i answered your question, @k00ni?
Do i miss something?

k00ni

I just wanna make sure we don't change parser behavior by accident.

My question about \G was due to the fact that it behaves differently if $offset is set. Quote:

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero.

And about \A it says:

start of subject (independent of multiline mode)

After reading the docs again I am for merging this.

Offtopic: We have a few "performance" tests in https://github.com/smalot/pdfparser/tree/master/tests/Performance. Their purpose is to check time or RAM usage when processing some more-or-less heavy data. You wrote that you did some performance tests yourself. Do you think we should integrate them?

se-ti · 2023-05-01T08:39:46Z

My performance tests were pretty naive: just launching code from above with different memory limits, testing whether it would fail or not.

se-ti · 2023-05-01T08:54:31Z

Looked at existing performance tests. They are about time performance.
I'd think, but right now i do not see reliable way to test memory performance.

One of reasons are these results "X Mb for the first launch, X/2.5 for the best subsequent one".
I hope, that was OpCache warm up, but won't bet it. :(

se-ti added 4 commits April 13, 2023 12:46

use preg_match offset

9335b1a

clean code

d5ce6da

fix codestyle

b6f133c

preg_match returns false just on error

4378fe6

k00ni added the enhancement label Apr 17, 2023

k00ni reviewed Apr 24, 2023

View reviewed changes

k00ni added the needs more info label Apr 24, 2023

k00ni approved these changes May 1, 2023

View reviewed changes

k00ni merged commit fafe2f2 into smalot:master May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce excessive substring allocation, using preg_match's offset parameter #595

Reduce excessive substring allocation, using preg_match's offset parameter #595

se-ti commented Apr 14, 2023

k00ni commented Apr 17, 2023

k00ni left a comment •

edited

Loading

se-ti commented Apr 25, 2023 •

edited

Loading

se-ti commented Apr 28, 2023

k00ni left a comment •

edited

Loading

se-ti commented May 1, 2023

se-ti commented May 1, 2023

Reduce excessive substring allocation, using preg_match's offset parameter #595

Reduce excessive substring allocation, using preg_match's offset parameter #595

Conversation

se-ti commented Apr 14, 2023

k00ni commented Apr 17, 2023

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

se-ti commented Apr 25, 2023 • edited Loading

se-ti commented Apr 28, 2023

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

se-ti commented May 1, 2023

se-ti commented May 1, 2023

k00ni left a comment •

edited

Loading

se-ti commented Apr 25, 2023 •

edited

Loading

k00ni left a comment •

edited

Loading