Appropriately fast `LIKE` pattern compilation #286

dlurton · 2020-09-21T02:41:02Z

Implements #284

This is able to compile the like pattern %<n>% where <n> is 8000 ! characters on my local machine in ~60ms. The previous implementation took too long to measure on my local machine, even after #279, because it used an an unoptimized state machine. I do not know if there is a name for the algorithm in this PR, but it does not use a state machine.

I haven't yet analysed the performance of the evaluating the compiled pattern--I will need to do that before this can be merged.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

In two ways: - Change fold/union operations to accumulate to a single list. - Replace *ordered* sets and maps to hash sets and maps. This results in a > 10x improvement in compiling large like patterns (i.e. 1000 characters and up).

codecov-commenter · 2020-09-21T03:01:29Z

Codecov Report

Merging #286 into master will increase coverage by 0.64%.
The diff coverage is 92.07%.

@@             Coverage Diff              @@
##             master     #286      +/-   ##
============================================
+ Coverage     82.44%   83.08%   +0.64%     
- Complexity     1202     1287      +85     
============================================
  Files           155      157       +2     
  Lines          9283     9745     +462     
  Branches       1522     1647     +125     
============================================
+ Hits           7653     8097     +444     
- Misses         1175     1190      +15     
- Partials        455      458       +3

Flag	Coverage Δ	Complexity Δ
#CLI	`18.11% <ø> (ø)`	`19.00 <ø> (ø)`
#EXAMPLES	`76.01% <ø> (ø)`	`27.00 <ø> (ø)`
#LANG	`85.72% <92.07%> (+0.59%)`	`1084.00 <21.00> (+85.00)`
#PTS	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
#TEST_SCRIPT	`79.68% <ø> (ø)`	`157.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...g/partiql/lang/eval/like/CheckpointIteratorImpl.kt	`88.88% <88.88%> (ø)`	`7.00 <7.00> (?)`
...tiql/lang/eval/like/CodepointCheckpointIterator.kt	`90.90% <90.90%> (ø)`	`8.00 <8.00> (?)`
lang/src/org/partiql/lang/eval/like/PatternPart.kt	`92.06% <92.06%> (ø)`	`0.00 <0.00> (?)`
...ng/src/org/partiql/lang/eval/EvaluatingCompiler.kt	`83.59% <94.44%> (-0.05%)`	`150.00 <6.00> (ø)`
lang/src/org/partiql/lang/Exceptions.kt	`83.33% <0.00%> (+1.51%)`	`0.00% <0.00%> (ø%)`
lang/src/org/partiql/lang/syntax/SqlParser.kt	`84.84% <0.00%> (+3.55%)`	`287.00% <0.00%> (+70.00%)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67c0e1c...85d494b. Read the comment docs.

lang/src/org/partiql/lang/eval/like/PatternPart.kt

lang/src/org/partiql/lang/eval/EvaluatingCompiler.kt

dlurton · 2020-09-21T18:24:28Z

Added some of my own comments which I will address along with the other reviewer's comments.

lang/src/org/partiql/lang/eval/like/CheckpointIteratorImpl.kt

lang/src/org/partiql/lang/eval/like/CodepointCheckpointIterator.kt

lang/test/org/partiql/lang/eval/LikePredicateTest.kt

lang/src/org/partiql/lang/eval/like/CheckpointIterator.kt

lang/src/org/partiql/lang/eval/like/PatternPart.kt

dlurton · 2020-09-23T00:33:18Z

I've done a fair bit of checking on the evaluation-time performance and memory consumption. That part of it seems exactly on-par with the original LIKE implementation. This is ready to merge, less review.

therapon

LGTM.

Not really an issue for me, more of a limitation, is that the recursive call pattern for dealing with % can cause a stack overflow, the pathological examples would be a pattern with a series of leading %.


@Test
    fun stressTest() {
        executePattern(parsePattern("%".repeat(4000) + "a", null), "a")
    }

Which we could "compile" into a pattern with 1 leading %.
If users end up writing such a pattern (I hope they do not) they can do the rewrite to get around it. :)

dlurton · 2020-09-24T19:23:32Z

LGTM.

Not really an issue for me, more of a limitation, is that the recursive call pattern for dealing with % can cause a stack overflow, the pathological examples would be a pattern with a series of leading %.
@Test
    fun stressTest() {
        executePattern(parsePattern("%".repeat(4000) + "a", null), "a")
    }
Which we could "compile" into a pattern with 1 leading %.
If users end up writing such a pattern (I hope they do not) they can do the rewrite to get around it. :)

I've added a change to consider multiple consecutive % the same as one % which will mitigate this somewhat.

Implements #284 Replaces previous LIKE implementation which is slow when compiling large patterns with wildcard characters `%` with another implementation that compiles the patterns in linear time and has similar performance characteristics at evaluation time.

#32 appears to have been inadvertently fixed by #286. This commit adds a regression test for #32.

dlurton added 6 commits September 14, 2020 16:46

Optimize LIKE pattern compilation

9fc3f8a

In two ways: - Change fold/union operations to accumulate to a single list. - Replace *ordered* sets and maps to hash sets and maps. This results in a > 10x improvement in compiling large like patterns (i.e. 1000 characters and up).

Make LIKE pattern compiling interruptible.

d53ef3d

Seems mostly functional, needs optimization

50d4e16

Fully functional, needs cleanup

db18528

Cleanup a little, remove LikeMatchingAutomata

046a167

Final cleanups

1f36271

dlurton requested a review from therapon September 21, 2020 02:41

dlurton requested review from abhikuhikar and alancai98 September 21, 2020 17:11