matchall is very slow #3719

johnmyleswhite · 2013-07-15T18:55:54Z

I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.

To get started, download the file http://norvig.com/ipython/big.txt for tokenization.

We'll tokenize it in Julia first:

function tokenize()
    BIG = readall("big.txt");
    tokens(text::String) =
      [m.match for m in matchall(r"[a-z]+", lowercase(text))]
    @elapsed t = tokens(BIG)
end
tokenize()

This takes 10 seconds on my machine.

In contrast, the following Python code is simpler (because there's no notion that matchall won't return strings directly) and 20x faster.

import re
import time
BIG = file('big.txt').read()
def tokens(text):
    return re.findall('[a-z]+', text.lower())

s = time.time()
t = tokens(BIG)
e = time.time()
e - s

This takes 0.4 seconds on my machine.

The text was updated successfully, but these errors were encountered:

dcjones · 2013-07-15T22:26:49Z

I investigated this a little. From the profiler output it looks like a lot of time is spent creating and destroying the million or so match objects. Pythons findall avoids that be returning an array of strings.

Here's a julia equivalent.

function matchingstrings(re::Regex, str::ByteString)
    matches = String[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, str[ovec[1]+1:ovec[2]])
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end

Using matchingstrings(r"[a-z]+", lowercase(text)) in place of [m.match for m in matchall(r"[a-z]+", lowercase(text))] gives the same results but 0.36 seconds instead of 6.1 seconds for me.

The python is 0.33 for me, so it's a pretty close match for speed.

Given the performance difference, maybe we should include something like this?

p.s.
pypy is 0.49 so it's faster than some pythons at least!

dcjones · 2013-07-15T22:37:51Z

Here's 0.22 seconds by returning SubStrings rather than Strings.

function matchingstrings(re::Regex, str::ByteString)
    matches = SubString[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, SubString(str, ovec[1]+1, ovec[2]))
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end

StefanKarpinski · 2013-07-15T22:43:07Z

Nice work, @dcjones. Great that we're within striking distance for speed here. Maybe we should just change it so that matchall returns an array of strings and if you want to actually get match objects you have to use eachmatch – getting an array of match objects is just a matter of doing collect(eachmatch(...)) anyway. Thanks to @nutsiepully's recent work, those are equivalent. We should be very careful to make sure that a function that returns all matching substrings is always equivalent to doing [m.match for m=matchall(r,s)].

StefanKarpinski · 2013-07-15T22:43:49Z

Ah, we really need to make more of our string stuff non-copying.

johnmyleswhite · 2013-07-15T23:03:06Z

Awesome job, @dcjones. I personally think matchall returning strings is more useful, but I'd be open to giving this another name. I also agree that most of the string functions should avoid making copies.

StefanKarpinski · 2013-07-15T23:04:56Z

The idea that we've kicked around is making all UTF8Strings (and other string types) have an offset as well as a length so that you can take a substring and get something of the normal UTF8String type.

johnmyleswhite · 2013-07-15T23:05:43Z

Making UTF8Strings more like C pointers sounds like a great idea to me.

StefanKarpinski · 2013-07-15T23:10:46Z

Well, they'd still have a bit more baggage since they'd have to contain a reference to an array object, an offset into that array to start at, and a length. That's significantly more stuff than just a pointer.

quinnj · 2013-07-15T23:13:56Z

We also talked about having a searchall function at one point that would
be akin to matchall, but in line with the search functions. searchall
could return an array of substrings implementing one of the solutions above.

-Jacob

On Mon, Jul 15, 2013 at 6:10 PM, Stefan Karpinski
notifications@github.comwrote:

Well, they'd still have a bit more baggage since they'd have to contain a
reference to an array object, an offset into that array to start at, and a
length. That's significantly more stuff than just a pointer.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3719#issuecomment-21010586
.

StefanKarpinski · 2013-07-15T23:50:38Z

The search function returns the index or range of indices that the match occurs at, however, not the matching substring itself.

nutsiepully · 2013-07-16T12:54:08Z

The matchall function internally uses the match function which returns a RegexMatch object after a call to the PCRE library. So in case we want a version of matchall that returns strings, we would have to bypass that and call the PCRE library directly.
However, I'm not sure if the slowness is only due to the creation of RegexMatch as compared to a String. Both do string extraction, except that RegexMatch maintians the regex groups as well. The matchall function changes some regex behavior occasionally to address the edge cases #3525. Comparing the old iterator without these fixes (which follows the same matching strategy above) would suggest whether it's only the object creation or also the matching change that's causing the performance to drop.
I don't have access to a dev machine for a week and would not be able to check it out.

johnmyleswhite · 2013-08-09T04:02:01Z

I'd love to see @dcjones change get into 0.2.

dcjones · 2013-08-09T04:17:26Z

I nearly forgot about this. I'll make a PR later tonight.

randyzwitch mentioned this issue Jul 16, 2013

String comparison/loop performance slow #3733

Closed

dcjones mentioned this issue Aug 9, 2013

RFC: Regex improvements #4002

Merged

JeffBezanson closed this as completed Aug 10, 2013

swadey mentioned this issue Jun 1, 2014

match() is slow #7066

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

matchall is very slow #3719

matchall is very slow #3719

johnmyleswhite commented Jul 15, 2013

dcjones commented Jul 15, 2013

dcjones commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

quinnj commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

nutsiepully commented Jul 16, 2013

johnmyleswhite commented Aug 9, 2013

dcjones commented Aug 9, 2013

matchall is very slow #3719

matchall is very slow #3719

Comments

johnmyleswhite commented Jul 15, 2013

dcjones commented Jul 15, 2013

dcjones commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

quinnj commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

nutsiepully commented Jul 16, 2013

johnmyleswhite commented Aug 9, 2013

dcjones commented Aug 9, 2013