Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] bpo-23689: re module, fix memory leak when a match is terminated by a signal #32188

Closed
wants to merge 4 commits into from
Closed

Conversation

ghost
Copy link

@ghost ghost commented Mar 30, 2022

A draft

Count the number of REPEAT when compiling a pattern, and allocate an array in SRE_STATE.
At any time, a REPEAT will have at most one in active, so a SRE_REPEAT array is fine.

https://bugs.python.org/issue23689

Count the number of REPEAT when compiling a pattern, and allocate an array in `SRE_STATE`.
At any time, a REPEAT will have at most one in active, so a `SRE_REPEAT` array is fine.
@ghost
Copy link
Author

ghost commented Mar 30, 2022

@serhiy-storchaka
This is a draft. If use 5~7 days, it may achieve better quality.
I can wait until your PR "bpo-47152: Convert the re module into a package" is merged, then I redo this PR.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems repeat_index is only needed for REPEAT.

Modules/sre_lib.h Outdated Show resolved Hide resolved
Modules/sre_lib.h Outdated Show resolved Hide resolved
@ghost
Copy link
Author

ghost commented Mar 30, 2022

This PR has not been carefully checked yet, it will take a few days to do this.

Just want to ask your arrangement, is this PR merged first, or your PR first?
I tend to merge your PR first, then I can carefully check when you are preparing your PR.

@serhiy-storchaka
Copy link
Member

I prefer to wait until this PR is merged. It is easy to redo my changes.

@ghost
Copy link
Author

ghost commented Mar 30, 2022

Ok, I will finish this PR in a few days.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I propose:

  1. Remove repeat_index in POSSESSIVE_REPEAT.
  2. Remove REPEAT count in INFO and pass it as a separate argument to _sre.compile().
  3. Validate repeat_index for REPEAT in _validate_inner().

@@ -1200,10 +1200,10 @@ SRE(match)(SRE_STATE* state, const SRE_CODE* pattern, int toplevel)

case SRE_OP_POSSESSIVE_REPEAT:
/* create possessive repeat contexts. */
/* <POSSESSIVE_REPEAT> <skip> <1=min> <2=max> pattern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about POSSESSIVE_REPEAT? repeat_index is not used here. MAX_UNTIL and MIN_UNTIL are not used with POSSESSIVE_REPEAT, so there is no problem with keeping POSSESSIVE_REPEAT intact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I add more conditional checks, I'm afraid to slow down the pattern compilation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much. Checking if op is POSSESSIVE_REPEAT is cheap in comparison with parsing or optimization.

Comment on lines +594 to +596
# REPEAT count
assert len(code) == _REPEAT_COUNT_OFFSET
emit(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that INFO is only used for optimization. If you ignore it, all will work, but maybe slower.

Essential information like the number of groups etc is passed to _sre.compile() as separate arguments. The REPEAT count is of such kind. If it is not known we cannot start matching.

I propose to pass it as a separate argument. And in _validate_inner() check that all repeat_index are less than that count (like we validate group indices and the MARK argument).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, I'll do these.
Please merge your PR first.

@ghost
Copy link
Author

ghost commented Mar 30, 2022

In _compile() recursive function, it seems hard to return the number of REPEATs.

   def _compile(code, pattern, flags):
       # `code` is a list object
       # `pattern` and `flags` are input data
       ...

For counting the number of REPEATs, it's better at compile-time than parse-time.
Is it worth adding a parameter to _compile() function for this purpose?

@serhiy-storchaka
Copy link
Member

You can save it as an attribute of pattern.state.

@ghost
Copy link
Author

ghost commented Mar 30, 2022

You can save it as an attribute of pattern.state.

In _compile() function, pattern.state is not accessable. pattern is SubPattern.data, not SubPattern.
I have an idea, I will try tommorw, time to sleep.

@ghost
Copy link
Author

ghost commented Mar 31, 2022

CompileData.diff changes code to a CompileData object, which may save intermediate states during compilation. Only apply this patch to main branch, MSVC2022 non-PGO release build runs test_re.py is 2.1% slower (0.277 sec -> 0.283 sec).

Another way is to temporarily store it in a field in SRE_OP_INFO, but this would make SRE_OP_INFO take up an unnecessary field.

To be honest, the whole patch is intricacy, so reviewing is harder than making it from zero. If you think so too, you may make a PR instead of this PR.

CompileData.diff.txt (.diff attachment is not supportted by GitHub, so I added .txt)

edit: below patch is faster a bit.

@ghost
Copy link
Author

ghost commented Mar 31, 2022

This simpler patch is faster (2.80 sec)
simple_patch.diff.txt

@ghost
Copy link
Author

ghost commented Mar 31, 2022

I can divide the patch into some commits, each commit only modify one simple step, which makes the review much easier. Then this PR will be closed.
I suggest merge your PR first.

@ghost ghost closed this Apr 1, 2022
@ghost ghost deleted the repeat_count branch April 4, 2022 06:11
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants