Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gsub/2 currently too slow for split/1 when first arg is not a regex #576

Closed
pkoppstein opened this issue Sep 14, 2014 · 4 comments
Closed

Comments

@pkoppstein
Copy link
Contributor

gsub/2 is very slow, even when the first argument is not a regex.

For example, to split unixdict.txt (*) (read in as a single string) into separate lines takes about 23 minutes using gsub/2, compared to about 0.4 secs using jq 1.4 split/1.

That is, the ratio of running times is roughly 3,400:1!

If this can't be fixed, then split/1 should not be implemented directly using gsub.

Using ruby's gsub in a similar way to read the same file yields times very close to jq 1.4 split/1. So I'm wondering if the choice of PCRE-mode for Oniguruma is the problem. Just wondering.

@wtlangford - could you please look into this? Thanks.

@nicowilliams
Copy link
Contributor

I'll take a look at this this weekend.

@wtlangford
Copy link
Contributor

This is definitely a function of using the regex engine.

Also! Since split was originally implemented using jv_string_split, we have a problem.
Originally, splitting "ABCD.EFGH" on . would return ["ABCD","EFGH"]. Now it returns ["","","","","","","","","",""], due to the regex engine.

I think the default behavior here should be non-regex splitting with the option to use regex to split.

@wtlangford
Copy link
Contributor

Also worth noting- PCRE mode isn't the problem, as much as the problem is using regexes to split is necessarily slower than simple equality comparisons.

@pkoppstein
Copy link
Contributor Author

@nicowilliams wrote:

I think the default behavior here should be non-regex splitting with the option to use regex to split.

Yes. Given that it's unlikely Oniguruma-based splitting is going to be acceptably fast anytime soon, I'd suggest taking the easy path: reverting split/1, and using splits/{1,2} for the regex-based splitting. This doesn't preclude something bolder in the future, and adequately resolves all the issues (performance of split/1; backward-compatibility; stream-vs-array output) for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants