Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matchall is very slow #3719

Closed
johnmyleswhite opened this issue Jul 15, 2013 · 13 comments
Closed

matchall is very slow #3719

johnmyleswhite opened this issue Jul 15, 2013 · 13 comments
Labels
performance Must go faster

Comments

@johnmyleswhite
Copy link
Member

I recently tried translating Norvig's spellchecker into Julia. The following example shows that Julia's string performance needs a lot of work.

To get started, download the file http://norvig.com/ipython/big.txt for tokenization.

We'll tokenize it in Julia first:

function tokenize()
    BIG = readall("big.txt");
    tokens(text::String) =
      [m.match for m in matchall(r"[a-z]+", lowercase(text))]
    @elapsed t = tokens(BIG)
end
tokenize()

This takes 10 seconds on my machine.

In contrast, the following Python code is simpler (because there's no notion that matchall won't return strings directly) and 20x faster.

import re
import time
BIG = file('big.txt').read()
def tokens(text):
    return re.findall('[a-z]+', text.lower())

s = time.time()
t = tokens(BIG)
e = time.time()
e - s

This takes 0.4 seconds on my machine.

@dcjones
Copy link
Contributor

dcjones commented Jul 15, 2013

I investigated this a little. From the profiler output it looks like a lot of time is spent creating and destroying the million or so match objects. Pythons findall avoids that be returning an array of strings.

Here's a julia equivalent.

function matchingstrings(re::Regex, str::ByteString)
    matches = String[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, str[ovec[1]+1:ovec[2]])
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end

Using matchingstrings(r"[a-z]+", lowercase(text)) in place of [m.match for m in matchall(r"[a-z]+", lowercase(text))] gives the same results but 0.36 seconds instead of 6.1 seconds for me.

The python is 0.33 for me, so it's a pretty close match for speed.

Given the performance difference, maybe we should include something like this?

p.s.
pypy is 0.49 so it's faster than some pythons at least!

@dcjones
Copy link
Contributor

dcjones commented Jul 15, 2013

Here's 0.22 seconds by returning SubStrings rather than Strings.

function matchingstrings(re::Regex, str::ByteString)
    matches = SubString[]
    offset = 0
    opts = re.options & PCRE.EXECUTE_MASK
    ovec = Array(Int32, 3)
    while true
        result = ccall((:pcre_exec, :libpcre), Int32,
                       (Ptr{Void}, Ptr{Void}, Ptr{Uint8}, Int32,
                       Int32, Int32, Ptr{Int32}, Int32),
                       re.regex, C_NULL, str, length(str.data),
                       offset, opts, ovec, 3)
        if result >= 0
            push!(matches, SubString(str, ovec[1]+1, ovec[2]))
            offset = ovec[2] + 1
        else
            break
        end
    end
    matches
end

@StefanKarpinski
Copy link
Sponsor Member

Nice work, @dcjones. Great that we're within striking distance for speed here. Maybe we should just change it so that matchall returns an array of strings and if you want to actually get match objects you have to use eachmatch – getting an array of match objects is just a matter of doing collect(eachmatch(...)) anyway. Thanks to @nutsiepully's recent work, those are equivalent. We should be very careful to make sure that a function that returns all matching substrings is always equivalent to doing [m.match for m=matchall(r,s)].

@StefanKarpinski
Copy link
Sponsor Member

Ah, we really need to make more of our string stuff non-copying.

@johnmyleswhite
Copy link
Member Author

Awesome job, @dcjones. I personally think matchall returning strings is more useful, but I'd be open to giving this another name. I also agree that most of the string functions should avoid making copies.

@StefanKarpinski
Copy link
Sponsor Member

The idea that we've kicked around is making all UTF8Strings (and other string types) have an offset as well as a length so that you can take a substring and get something of the normal UTF8String type.

@johnmyleswhite
Copy link
Member Author

Making UTF8Strings more like C pointers sounds like a great idea to me.

@StefanKarpinski
Copy link
Sponsor Member

Well, they'd still have a bit more baggage since they'd have to contain a reference to an array object, an offset into that array to start at, and a length. That's significantly more stuff than just a pointer.

@quinnj
Copy link
Member

quinnj commented Jul 15, 2013

We also talked about having a searchall function at one point that would
be akin to matchall, but in line with the search functions. searchall
could return an array of substrings implementing one of the solutions above.

-Jacob

On Mon, Jul 15, 2013 at 6:10 PM, Stefan Karpinski
notifications@github.comwrote:

Well, they'd still have a bit more baggage since they'd have to contain a
reference to an array object, an offset into that array to start at, and a
length. That's significantly more stuff than just a pointer.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3719#issuecomment-21010586
.

@StefanKarpinski
Copy link
Sponsor Member

The search function returns the index or range of indices that the match occurs at, however, not the matching substring itself.

@nutsiepully
Copy link
Contributor

The matchall function internally uses the match function which returns a RegexMatch object after a call to the PCRE library. So in case we want a version of matchall that returns strings, we would have to bypass that and call the PCRE library directly.
However, I'm not sure if the slowness is only due to the creation of RegexMatch as compared to a String. Both do string extraction, except that RegexMatch maintians the regex groups as well. The matchall function changes some regex behavior occasionally to address the edge cases #3525. Comparing the old iterator without these fixes (which follows the same matching strategy above) would suggest whether it's only the object creation or also the matching change that's causing the performance to drop.
I don't have access to a dev machine for a week and would not be able to check it out.

@johnmyleswhite
Copy link
Member Author

I'd love to see @dcjones change get into 0.2.

@dcjones
Copy link
Contributor

dcjones commented Aug 9, 2013

I nearly forgot about this. I'll make a PR later tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

6 participants