Build out rule crawling mechanisms #3717

alanmcruickshank · 2022-08-07T14:57:53Z

This builds on the work started with CrawlBehavior in the base rules file. I've extended that into a new module with a few variants on the idea and some more flexibility for specific rules. More specifically the newest logic is in SegmentSeekerCrawler which more aggressively prunes the tree if segments of particular types do not exist anywhere within a segment or it's children. On the less clever side, I've also introduced RootOnlyCrawler which will be relevant for rules which want to handle their own crawling entirely (which is what I'm thinking with reflow application, see #3673). To facilitate this, RuleContext has now also been moved (mostly untouched) to it's own file.

To achieve this I've done a few other things:

SegmentMetaclass to allow caching of _class_types.
Building out some more mechanics on BaseSegment for accessing types.
A repr() method for LintResult and several places with additional logging to help debugging.
Strip out the old CrawlBehavior class.
Removed .is_final_segment() which is now unused (I thought about this one where I could just put a pragma: no cover on it - but we could always recreate this method if needed later from the git history).
context.raw_segment_pre has been removed, partly because it's hard to reimplement in this model but also because the new, more efficient, crawling is already a lot more efficient.
Refactoring of all rules to work better with the new structure. Significant refactorings on:
- L016
- L039
- L052
- L053
- L063

This is a BIG pull request, but I'm not sure there's a simple way of doing it in less than one big push.

barrywhart · 2022-08-07T15:39:34Z

I'll do a proper review later. One quick thought about raw_segment_pre: it would be pretty easy to implement if each segment knew its parent segment (happy to expand on this idea if you like).

...and linting

alanmcruickshank · 2022-08-07T17:41:33Z

I'll do a proper review later. One quick thought about raw_segment_pre: it would be pretty easy to implement if each segment knew its parent segment (happy to expand on this idea if you like).

Go on?

The problem I see is that at the moment a segment isn't aware of it's parent. All the references go down the tree not up. While crawling we do have good access to the parent, but the previous raw segment might not be in the parent, it might be several layers up. Combine that with more aggressive skipping, where we don't process every raw segment anyway - then the easiest way to get the previous raw would be to use .raw_segments and get the last one (by doing something like .raw_segments[-1] every time we skip). If we're calling .raw_segments anyway then I feel like we're doing almost as much processing as just tracking the raw_stack - unless the only issue is the continuous tuple arithmetic, in which case maybe we just use a list rather than a tuple and just make a raw stack a more efficient thing to provide.

codecov · 2022-08-07T17:46:37Z

Codecov Report

Merging #3717 (edb1251) into main (40f3d81) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main     #3717    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          176       178     +2     
  Lines        13466     13603   +137     
==========================================
+ Hits         13466     13603   +137

Impacted Files	Coverage Δ
src/sqlfluff/core/rules/functional/segments.py	`100.00% <ø> (ø)`
src/sqlfluff/rules/L027.py	`100.00% <ø> (ø)`
src/sqlfluff/rules/L049.py	`100.00% <ø> (ø)`
src/sqlfluff/core/parser/segments/base.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/parser/segments/raw.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/rules/__init__.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/rules/base.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/rules/context.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/rules/crawlers.py	`100.00% <100.00%> (ø)`
src/sqlfluff/rules/L001.py	`100.00% <100.00%> (ø)`
... and 64 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

barrywhart · 2022-08-07T17:56:46Z

Definitely not a necessity. It was a good optimization for the current design. We can get rid of it for now. 👍

barrywhart

Looks great so far! A few small tactical suggestions, but overall direction is 💯.

@WittierDinosaur, are there any specific rules you've noticed use a lot of runtime that could benefit from using the new crawl behavior? Interested in testing this at some point on some of your large SQL files?

src/sqlfluff/core/parser/segments/base.py

src/sqlfluff/core/rules/crawlers.py

src/sqlfluff/rules/L060.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>

+ Don't duplicate the work of L001

barrywhart · 2022-08-12T12:23:32Z

If we're calling .raw_segments anyway then I feel like we're doing almost as much processing as just tracking the raw_stack - unless the only issue is the continuous tuple arithmetic, in which case maybe we just use a list rather than a tuple and just make a raw stack a more efficient thing to provide.

I think the performance issue comes from continuously building and appending to long sequences (every raw segment in the file). That's pretty expensive whether a list or a tuple is used. We don't need to address it in this PR, which significantly improves performance in other ways that probably drown out the impact of this. Just sharing the thought as a possibility for future performance work.

barrywhart

Looks good overall! Lots of code to review, but I tried to identify any red flags. The existing test cases plus the "rules critical error" checks tend to be very helpful, but I did find a few potential issues where an IndexError may be triggered.

src/sqlfluff/core/parser/segments/base.py

src/sqlfluff/core/rules/context.py

src/sqlfluff/core/rules/crawlers.py

src/sqlfluff/rules/L039.py

barrywhart · 2022-08-12T13:24:14Z

src/sqlfluff/rules/L052.py

@@ -79,7 +80,7 @@ def _handle_preceding_inline_comments(before_segment, anchor_segment):
                if s.is_comment
                and s.name != "block_comment"
                and s.pos_marker.working_line_no
-                == anchor_segment.pos_marker.working_line_no
+                == anchor_segment.raw_segments[-1].pos_marker.working_line_no


Is it possible raw_segments is empty and will fail with IndexError?

Good question 🤔 . I'm going to have to look into that one.

I looked into this one - we can actually be sure that anchor_segment is actually a RawSegment itself. Calling .raw_segments on a RawSegment just returns [self], so this can just be reduced to anchor_segment.pos_marker.working_line_no. I've updated the mypy type hints accordingly and simplified the function.

Nope - I'm wrong. anchor_segment isn't always a RawSegment (found that out in the now failing tests). However, raw_segments will always have length - either a segment will have children, or it will return [self] on .raw_segments if it's a Rawsegment.

src/sqlfluff/rules/L053.py

test/core/parser/segments_base_test.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update src/sqlfluff/core/parser/segments/base.py Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update test/core/parser/segments_base_test.py Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update src/sqlfluff/rules/L053.py Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update src/sqlfluff/core/rules/context.py Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update src/sqlfluff/core/rules/context.py Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update src/sqlfluff/core/rules/crawlers.py Co-authored-by: Barry Hart <barrywhart@yahoo.com> Update src/sqlfluff/rules/L010.py Co-authored-by: Barry Hart <barrywhart@yahoo.com>

src/sqlfluff/rules/L039.py

barrywhart

Looks good! I fixed one small typo in a docstring. Good to merge once the build passes.

Excellent work!! 🎉🥳

alanmcruickshank and others added 8 commits August 5, 2022 18:14

Add child type set property

907a2f5

Seperate rule context out into new module

4eecf6c

Basic crawlers and simple RootCrawler case

8dc12de

Seeker Crawler

b0928cb

Implement crawlers in 65 & 66

8368356

61, 61 & 64 to new crawlers

57ed53f

Add pass filter to Segment Crawler

42242f2

Implement raw stack and some of the rules which use it.

e6453f0

alanmcruickshank requested a review from barrywhart August 7, 2022 14:57

fix issue with L004

9a37bfa

...and linting

alanmcruickshank force-pushed the ac/rule_crawling branch from 5bfa7a7 to 9a37bfa Compare August 7, 2022 17:30

Merge remote-tracking branch 'origin/main' into ac/rule_crawling

0cd3a02

coverage

171fc7a

barrywhart reviewed Aug 7, 2022

View reviewed changes

alanmcruickshank and others added 9 commits August 8, 2022 20:24

Update src/sqlfluff/rules/L060.py

901565f

Co-authored-by: Barry Hart <barrywhart@yahoo.com>

Update src/sqlfluff/core/parser/segments/base.py

f4f73f2

Co-authored-by: Barry Hart <barrywhart@yahoo.com>

Update src/sqlfluff/core/rules/crawlers.py

0a4f805

Co-authored-by: Barry Hart <barrywhart@yahoo.com>

Update src/sqlfluff/core/rules/crawlers.py

bee9ade

Co-authored-by: Barry Hart <barrywhart@yahoo.com>

switch to abstract base class

1129209

Merge branch 'main' into ac/rule_crawling

dd741e7

Add crawl behaviour for a bunch of easier rules

0fd5bcc

More rule fixes

17e080f

Big simplification of L039

108cffc

+ Don't duplicate the work of L001

alanmcruickshank force-pushed the ac/rule_crawling branch from 89c39f6 to 108cffc Compare August 9, 2022 11:17

alanmcruickshank added 2 commits August 9, 2022 12:27

Bring in L006 and dependents

d47cac4

Bring in L010 and dependents [except L063]

b8e4742

update example & test rules

6ee8200

alanmcruickshank marked this pull request as ready for review August 11, 2022 13:39

alanmcruickshank requested a review from barrywhart August 11, 2022 13:39

Merge branch 'main' into ac/rule_crawling

679e63c

barrywhart requested changes Aug 12, 2022

View reviewed changes

barrywhart mentioned this pull request Aug 12, 2022

sqlfluff too slow with pre-commit-config #3737

Closed

3 tasks

PR Feedback

4ab0855

alanmcruickshank force-pushed the ac/rule_crawling branch from a1b1252 to 4ab0855 Compare August 14, 2022 16:44

alanmcruickshank added 3 commits August 14, 2022 17:44

Merge remote-tracking branch 'origin/main' into ac/rule_crawling

9618aaa

linting

9a678b0

nope - revent change to L052

9cabd86

alanmcruickshank requested a review from barrywhart August 14, 2022 20:08

barrywhart reviewed Aug 14, 2022

View reviewed changes

src/sqlfluff/rules/L039.py Outdated Show resolved Hide resolved

Update src/sqlfluff/rules/L039.py

92b7c2d

barrywhart approved these changes Aug 14, 2022

View reviewed changes

Merge branch 'main' into ac/rule_crawling

edb1251

alanmcruickshank linked an issue Aug 15, 2022 that may be closed by this pull request

Enhancement: More selective rule crawling #3668

Closed

3 tasks

alanmcruickshank self-assigned this Aug 15, 2022

alanmcruickshank merged commit 0feb870 into main Aug 15, 2022

alanmcruickshank deleted the ac/rule_crawling branch August 15, 2022 08:09

This was referenced Nov 6, 2022

Bug: Self crawling rules don't respect unparsable code. #3989

Closed

Filter out issues in unparsable sections #4032

Merged

alanmcruickshank mentioned this pull request Mar 21, 2023

Linter - Make the way that it steps through the parse tree smarter #3036

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build out rule crawling mechanisms #3717

Build out rule crawling mechanisms #3717

alanmcruickshank commented Aug 7, 2022 •

edited

Loading

barrywhart commented Aug 7, 2022

alanmcruickshank commented Aug 7, 2022

codecov bot commented Aug 7, 2022 •

edited

Loading

barrywhart commented Aug 7, 2022

barrywhart left a comment

barrywhart commented Aug 12, 2022

barrywhart left a comment

barrywhart Aug 12, 2022

alanmcruickshank Aug 12, 2022

alanmcruickshank Aug 14, 2022

alanmcruickshank Aug 14, 2022

barrywhart left a comment

Build out rule crawling mechanisms #3717

Build out rule crawling mechanisms #3717

Conversation

alanmcruickshank commented Aug 7, 2022 • edited Loading

barrywhart commented Aug 7, 2022

alanmcruickshank commented Aug 7, 2022

codecov bot commented Aug 7, 2022 • edited Loading

Codecov Report

barrywhart commented Aug 7, 2022

barrywhart left a comment

Choose a reason for hiding this comment

barrywhart commented Aug 12, 2022

barrywhart left a comment

Choose a reason for hiding this comment

barrywhart Aug 12, 2022

Choose a reason for hiding this comment

alanmcruickshank Aug 12, 2022

Choose a reason for hiding this comment

alanmcruickshank Aug 14, 2022

Choose a reason for hiding this comment

alanmcruickshank Aug 14, 2022

Choose a reason for hiding this comment

barrywhart left a comment

Choose a reason for hiding this comment

alanmcruickshank commented Aug 7, 2022 •

edited

Loading

codecov bot commented Aug 7, 2022 •

edited

Loading