set qualifiers - feature idea #11

mrabarnett · 2011-06-02T12:45:37Z

Some background: I've been working with very large REs in CPython and IronPython. We generate the RE pattern from lists, like lists of cities or lists of names, somewhat like this:

namelist = open("names.txt").read().split()
pattern = re.compile("|".join(namelist))

The one I'm working with now is just a pattern for finding substrings that look like the name of a person. It's overflowing the System::Text::RegularExpressions buffers on IronPython, but works OK with CPython 2.6 on 64-bit Ubuntu.

One of the things I've been thinking is that this kind of pattern should be handled differently. Suppose there was some syntax like

pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))

where (?S indicates a named ImmutableSet, the members of that set to be drawn from the keyword argument of that name. The compiler would generate a reasonably fast pattern from that set, say the union of all characters in all the strings in the set, and a max and min size based on the min-lengthed and max-lengthed elements of the set. When the engine runs, it would match that fast pattern, and if it matches, it would then check to see if the matched group is a member of the named set. If so, the match would be confirmed; if not, it would fail.

Seems like this might be a useful feature for regex to have, given the popularity of this kind of machine-generated RE.

The text was updated successfully, but these errors were encountered:

mrabarnett · 2011-06-02T12:50:30Z

Original comment by Anonymous.

Thinking about this a bit more, it would be more appropriate to use something like "\L<name>" instead of "(?S<name>)".

mrabarnett · 2011-06-02T13:35:44Z

Original comment by Anonymous.

Could you provide me with some test data so that I can see what's needed, how it would be used, try some experiments, and see whether 'feels' right, whether it's the right approach?

mrabarnett · 2011-06-03T08:54:11Z

Original comment by Anonymous.

Sure. Here's one I've been trying on CPython 2.6 on 64-bit Ubuntu (works), CPython 2.7 on 64-bit Windows (OverflowError), and IronPython 2.7 on 64-bit .NET (StackOverflowError).

mrabarnett · 2011-06-07T20:46:35Z

Original comment by Anonymous.

Named lists have been added (provisionally).

mrabarnett · 2011-06-09T09:47:57Z

Original comment by Anonymous.

I downloaded the PyPI version, built and installed it on Python 2.5.1, and tried it:

>>> import regex
>>> p = regex.compile(r"333\L<bar>444", bar=set(["one", "two", "three"]))
>>> p.match("333four444")
>>> p.match("333four444")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
SystemError: bad format char passed to Py_BuildValue

Does that seem right to you?

>>> p.match("333one444")
>>>

And that should have matched, right?

mrabarnett · 2011-06-09T12:18:24Z

Original comment by Anonymous.

It was passing "y#" for bytestrings, which is Python 3. Fixed.

mrabarnett · 2011-06-09T19:03:58Z

Original comment by Anonymous.

Ah, OK. I re-downloaded from PyPI, now it's working. But here's another issue:

>>> p = regex.compile(r"3\L<bar>4\L<bar>+5", bar=sets.ImmutableSet(["one", "two", "three"]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.5/site-packages/regex.py", line 266, in compile
    return _compile(pattern, flags, kwargs)
  File "/Library/Python/2.5/site-packages/regex.py", line 371, in _compile
    parsed = parse_pattern(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 296, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 313, in parse_sequence
    item = parse_item(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 323, in parse_item
    element = parse_element(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 424, in parse_element
    return parse_escape(source, info, False)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 833, in parse_escape
    return parse_string_set(source, info)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 950, in parse_string_set
    return string_set(info, name)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 289, in string_set
    return StringSet(info, name)
  File "/Library/Python/2.5/site-packages/_regex_core.py", line 2637, in __init__
    index, min_len, max_len = info.string_sets[self.set_key]
ValueError: too many values to unpack
>>>

mrabarnett · 2011-06-10T03:25:48Z

Original comment by Anonymous.

Fixed.

mrabarnett · 2011-06-15T21:40:17Z

Original comment by Anonymous.

I've updated my test case to add some larger regular expressions.

mrabarnett · 2011-06-16T06:52:53Z

Original comment by Anonymous.

I just tested this enhancement (cf.: http://mail.python.org/pipermail/python-list/2011-June/1274529.html ) and would like to ask about the treatment of metacharacters in the items of the options set; I somehow implied from the overview text, they would be escaped, but they appear to be discarded completely, cf.:

>>> regex.findall(r"^\L<options>", "solid QWERT", options=set(['good', 'brilliant', '+s\\ol[i}d']))
['solid']
>>> regex.findall(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid']))
[]
>>>

I believed, the first pattern shouldn't match if escaped (and cause an error if taken unchanged); the second one would match with escaping; or am I missing something?

regards,
vbr

mrabarnett · 2011-06-16T08:57:51Z

Original comment by Anonymous.

You're not missing anything. They should match as you say. But I'm seeing a different result (Ubuntu 10 with Python 2.6):

>>> regex.findall(r"^\L<options>", "solid QWERT", options=set(['good', 'brilliant', '+s\\ol[i}d']))
[]
>>> regex.findall(r"^\L<options>", "solid QWERT", options=['good', 'brilliant', '+s\\ol[i}d'])
[]
>>> regex.findall(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid']))
[]
>>> regex.search(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', '+solid']))
>>> regex.search(r"^\L<options>", "+solid QWERT", options=set(['good', 'brilliant', 'solid']))
>>> regex.search(r"^\L<options>", "solid QWERT", options=['good', 'brilliant', '+s\\ol[i}d'])
>>>

mrabarnett · 2011-06-16T10:34:17Z

Original comment by Anonymous.

This is an interesting one.

If the pattern is known, it fetches from the cache of already-compiled regexes, but the set of strings is different.

Should it treat the set as part of the pattern and recompile, much as it does with flags?

mrabarnett · 2011-06-16T13:19:34Z

Original comment by Anonymous.

Fixed. The regex will be recompiled.

mrabarnett · 2011-06-16T18:55:52Z

Original comment by Anonymous.

Yes, I think that's the right call. The named keyword argument is local to the particular compile() or search() or findall() call. Different calls may use the same keyword name for different values.

mrabarnett · 2011-06-17T15:47:31Z

Original comment by Anonymous.

Sorry for the delayed reaction (I somehow believed, I would be notified on further comments after my post).
I'd like to confirm the fix in regex-0.1.20110616; I agree with the current solution.
thanks;
vbr

mrabarnett closed this as completed Jun 17, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set qualifiers - feature idea #11

set qualifiers - feature idea #11

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 3, 2011

mrabarnett commented Jun 7, 2011

mrabarnett commented Jun 9, 2011

mrabarnett commented Jun 9, 2011

mrabarnett commented Jun 9, 2011

mrabarnett commented Jun 10, 2011

mrabarnett commented Jun 15, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 17, 2011

set qualifiers - feature idea #11

set qualifiers - feature idea #11

Comments

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 2, 2011

mrabarnett commented Jun 3, 2011

mrabarnett commented Jun 7, 2011

mrabarnett commented Jun 9, 2011

mrabarnett commented Jun 9, 2011

mrabarnett commented Jun 9, 2011

mrabarnett commented Jun 10, 2011

mrabarnett commented Jun 15, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 16, 2011

mrabarnett commented Jun 17, 2011