-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] bpo-23689: re module, fix memory leak when a match is terminated by a signal #32188
Conversation
Count the number of REPEAT when compiling a pattern, and allocate an array in `SRE_STATE`. At any time, a REPEAT will have at most one in active, so a `SRE_REPEAT` array is fine.
@serhiy-storchaka |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems repeat_index
is only needed for REPEAT
.
Same as the code above here, maybe a bit more cache friendly.
This PR has not been carefully checked yet, it will take a few days to do this. Just want to ask your arrangement, is this PR merged first, or your PR first? |
I prefer to wait until this PR is merged. It is easy to redo my changes. |
Ok, I will finish this PR in a few days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I propose:
- Remove repeat_index in POSSESSIVE_REPEAT.
- Remove REPEAT count in INFO and pass it as a separate argument to
_sre.compile()
. - Validate repeat_index for REPEAT in
_validate_inner()
.
@@ -1200,10 +1200,10 @@ SRE(match)(SRE_STATE* state, const SRE_CODE* pattern, int toplevel) | |||
|
|||
case SRE_OP_POSSESSIVE_REPEAT: | |||
/* create possessive repeat contexts. */ | |||
/* <POSSESSIVE_REPEAT> <skip> <1=min> <2=max> pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about POSSESSIVE_REPEAT? repeat_index
is not used here. MAX_UNTIL and MIN_UNTIL are not used with POSSESSIVE_REPEAT, so there is no problem with keeping POSSESSIVE_REPEAT intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I add more conditional checks, I'm afraid to slow down the pattern compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not much. Checking if op is POSSESSIVE_REPEAT
is cheap in comparison with parsing or optimization.
# REPEAT count | ||
assert len(code) == _REPEAT_COUNT_OFFSET | ||
emit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that INFO is only used for optimization. If you ignore it, all will work, but maybe slower.
Essential information like the number of groups etc is passed to _sre.compile()
as separate arguments. The REPEAT count is of such kind. If it is not known we cannot start matching.
I propose to pass it as a separate argument. And in _validate_inner()
check that all repeat_index are less than that count (like we validate group indices and the MARK argument).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, I'll do these.
Please merge your PR first.
In def _compile(code, pattern, flags):
# `code` is a list object
# `pattern` and `flags` are input data
... For counting the number of REPEATs, it's better at compile-time than parse-time. |
You can save it as an attribute of |
In |
Another way is to temporarily store it in a field in To be honest, the whole patch is intricacy, so reviewing is harder than making it from zero. If you think so too, you may make a PR instead of this PR.
edit: below patch is faster a bit. |
This simpler patch is faster (2.80 sec) |
I can divide the patch into some commits, each commit only modify one simple step, which makes the review much easier. Then this PR will be closed. |
A draft
Count the number of REPEAT when compiling a pattern, and allocate an array in
SRE_STATE
.At any time, a REPEAT will have at most one in active, so a
SRE_REPEAT
array is fine.https://bugs.python.org/issue23689