Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow lookarounds in conditionals #163

Closed
mrabarnett opened this issue Oct 25, 2015 · 9 comments
Closed

allow lookarounds in conditionals #163

mrabarnett opened this issue Oct 25, 2015 · 9 comments
Labels
enhancement New feature or request minor

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


It would be really helpful to allow allow lookarounds in addition to group name/id in conditional expression like in PCRE to allow a regex like this:
regex.findall(r'(?(?<=love\s)you|(?<=hate\s)her)', 'I love you but I don't hate her either. You and her are so different)

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


I don't see the point; as far as I can see, it doesn't add anything.

The purpose of a conditional expression is to test whether a capture group has matched anything.

What would it do that a bare lookaround doesn't already do?

Wouldn't r'(?(?<=love\s)you|(?<=hate\s)her)' just give the same results as r'((?<=love\s)you|(?<=hate\s)her)'?

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Hello, I posted the proposal as anonymous by accident.
Yes it is indeed the same. I posted it as an example of behavior, not as a motivation. Sorry, poor explanation, I reckon.

The real reason it interests me is to make the whole expression more general for metaprogramming or dynamic generation of regexes.
Sometimes I want a conditional in a template to respond to the existence of a previously captured group or to content around the current position in the string to be matched. The specific behavior, the regex actually crafted, depends on what is going on within the main program calling the regex facility. To do this dynamically at runtime I have to treat both cases separately. If lookarounds were recognized it wouldn't be the case. It would make the template work cleaner.

In short, the PCRE behavior doesn't add or detract anything from a manually crafted pattern but it would simplify some interesting dynamic techniques, especially in a language like Python that has great metaprogramming capabilities.
Maybe there is a way to create a neat general template that does the same thing without resorting to lookarounds in conditionals but I tried to do that unsuccessfully.

Regards

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


I've realised that they're not the same.

With a bare lookaround, if it chooses the first branch and subsequently fails, it'll backtrack and try the second branch.

With a lookaround in a conditional expression, if it chooses the first branch and subsequently fails, it'll backtrack but won't try the second branch.

For example, on the string "123abc", ^(?:(?=\d)\d+\b|\w+) will match but ^(?(?=\d)\d+\b|\w+) won't match.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Oh, I feel dumb now. It makes sense the whole conditional has to be skipped, while with alternations it backtracks to another alternate expression. It implements mutual exclusion, not alternation. It turned out that, in my code, the 'then' and 'else' subexpressions were simple and mutually exclusive, I got lucky with my ignorance and it worked (close call!).
It doesn't work like a general tool anymore like I thought it would if the 'then' subexpression triggers a backtrack like in your example, otherwise it still works but it's risky business and definitely not a good reason to ask you to implement that behavior.
Unfortunate, but that behavior is still really interesting. I see how it can be useful though.

Any expression of the type (? (test) then | else )
should be refactored like (test) then | (complement-of test) else

This (unsightly) example:

#!python

regex.sub(r'(?(?<=(?:[^3](?=..a)))(\d\D)|.)', r'\1-', '23dac83a6bc93ad')

should be refactored to:

#!python

regex.sub(r'(?<=(?:[^3](?=..a)))(\d\D)|(?<!(?:[^3](?=..a))).', r'\1-', '23dac83a6bc93ad')

both returning '-3d---8-----9---'

Besides making the pattern ugly it can make the match significantly slower since it has to check (complement-of test) every time it has to backtrack. Especially with variable length or complex lookbehinds. For these two reasons I think it is still a valuable enhancement to implement. Actually more than my original motivation since this would be more frequently applicable than some dynamic generation/metaprogramming scenario.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


It's not only the 'then' part that could trigger backtracking. It could match the 'then' part, progress into the remainder of the pattern, fail, backtrack through the 'then' part, then try the 'else' part.

Anyway, it's now on my todo list.

@mrabarnett
Copy link
Owner Author

Original comment by Anonymous.


Great!
But I don't understand previous remark.
How in:

#!python

regex.findall(r'(?(?=\w).{3}|.+)b', 'a123bc')

the 'else' part could be checked after backtracking through the 'then' subexpression? Isn't the whole conditional skipped and have the pattern position pointer after ...\w+) ?
A backtrack could be triggered in the 'else', sure, but how can it be reached after the 'then' has been traversed, regardless of its success?

I get 123b not a123b in https://regex101.com/#pcre

By the way, thank you for your replies and congratulations for the rest of the work you've done to this very good regex package.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Added in regex 2015.10.29.

@mrabarnett
Copy link
Owner Author

Original comment by 王珺 (Bitbucket: sulk, GitHub: sulk).


As I test the original problem fails:

#!python

regex.search(r'(?(?<=love\s)you|(?<=hate\s)her)', "I love you")

yields None while you is expected, and python crushes while executing

#!python

regex.findall(r'(?(?<=love\s)you|(?<=hate\s)her)', "I love you but I don't hate her either")

Regards

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


It's fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor
Projects
None yet
Development

No branches or pull requests

1 participant