gsub/2 currently too slow for split/1 when first arg is not a regex #576

pkoppstein · 2014-09-14T19:14:23Z

gsub/2 is very slow, even when the first argument is not a regex.

For example, to split unixdict.txt (*) (read in as a single string) into separate lines takes about 23 minutes using gsub/2, compared to about 0.4 secs using jq 1.4 split/1.

That is, the ratio of running times is roughly 3,400:1!

If this can't be fixed, then split/1 should not be implemented directly using gsub.

Using ruby's gsub in a similar way to read the same file yields times very close to jq 1.4 split/1. So I'm wondering if the choice of PCRE-mode for Oniguruma is the problem. Just wondering.

@wtlangford - could you please look into this? Thanks.

http://www.puzzlers.org/pub/wordlists/unixdict.txt

nicowilliams · 2014-10-03T15:27:39Z

I'll take a look at this this weekend.

wtlangford · 2014-10-03T15:43:04Z

This is definitely a function of using the regex engine.

Also! Since split was originally implemented using jv_string_split, we have a problem.
Originally, splitting "ABCD.EFGH" on . would return ["ABCD","EFGH"]. Now it returns ["","","","","","","","","",""], due to the regex engine.

I think the default behavior here should be non-regex splitting with the option to use regex to split.

wtlangford · 2014-10-03T15:43:58Z

Also worth noting- PCRE mode isn't the problem, as much as the problem is using regexes to split is necessarily slower than simple equality comparisons.

pkoppstein · 2014-10-03T21:35:45Z

@nicowilliams wrote:

I think the default behavior here should be non-regex splitting with the option to use regex to split.

Yes. Given that it's unlikely Oniguruma-based splitting is going to be acceptably fast anytime soon, I'd suggest taking the easy path: reverting split/1, and using splits/{1,2} for the regex-based splitting. This doesn't preclude something bolder in the future, and adequately resolves all the issues (performance of split/1; backward-compatibility; stream-vs-array output) for now.

pkoppstein mentioned this issue Oct 1, 2014

bug fix for sub/2 #586

Closed

nicowilliams closed this as completed in 1796a71 Oct 3, 2014

joelpurra mentioned this issue Jan 10, 2015

Split output array length inconsistency #552

Closed

dtolnay added the performance label Jul 27, 2015

dtolnay added this to the 1.5 release milestone Jul 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gsub/2 currently too slow for split/1 when first arg is not a regex #576

gsub/2 currently too slow for split/1 when first arg is not a regex #576

pkoppstein commented Sep 14, 2014

nicowilliams commented Oct 3, 2014

wtlangford commented Oct 3, 2014

wtlangford commented Oct 3, 2014

pkoppstein commented Oct 3, 2014

gsub/2 currently too slow for split/1 when first arg is not a regex #576

gsub/2 currently too slow for split/1 when first arg is not a regex #576

Comments

pkoppstein commented Sep 14, 2014

nicowilliams commented Oct 3, 2014

wtlangford commented Oct 3, 2014

wtlangford commented Oct 3, 2014

pkoppstein commented Oct 3, 2014