Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build out rule crawling mechanisms #3717

Merged
merged 53 commits into from
Aug 15, 2022
Merged

Build out rule crawling mechanisms #3717

merged 53 commits into from
Aug 15, 2022

Conversation

alanmcruickshank
Copy link
Member

@alanmcruickshank alanmcruickshank commented Aug 7, 2022

This builds on the work started with CrawlBehavior in the base rules file. I've extended that into a new module with a few variants on the idea and some more flexibility for specific rules. More specifically the newest logic is in SegmentSeekerCrawler which more aggressively prunes the tree if segments of particular types do not exist anywhere within a segment or it's children. On the less clever side, I've also introduced RootOnlyCrawler which will be relevant for rules which want to handle their own crawling entirely (which is what I'm thinking with reflow application, see #3673). To facilitate this, RuleContext has now also been moved (mostly untouched) to it's own file.

To achieve this I've done a few other things:

  • SegmentMetaclass to allow caching of _class_types.
  • Building out some more mechanics on BaseSegment for accessing types.
  • A repr() method for LintResult and several places with additional logging to help debugging.
  • Strip out the old CrawlBehavior class.
  • Removed .is_final_segment() which is now unused (I thought about this one where I could just put a pragma: no cover on it - but we could always recreate this method if needed later from the git history).
  • context.raw_segment_pre has been removed, partly because it's hard to reimplement in this model but also because the new, more efficient, crawling is already a lot more efficient.
  • Refactoring of all rules to work better with the new structure. Significant refactorings on:
    • L016
    • L039
    • L052
    • L053
    • L063

This is a BIG pull request, but I'm not sure there's a simple way of doing it in less than one big push.

@barrywhart
Copy link
Member

I'll do a proper review later. One quick thought about raw_segment_pre: it would be pretty easy to implement if each segment knew its parent segment (happy to expand on this idea if you like).

...and linting
@alanmcruickshank
Copy link
Member Author

I'll do a proper review later. One quick thought about raw_segment_pre: it would be pretty easy to implement if each segment knew its parent segment (happy to expand on this idea if you like).

Go on?

The problem I see is that at the moment a segment isn't aware of it's parent. All the references go down the tree not up. While crawling we do have good access to the parent, but the previous raw segment might not be in the parent, it might be several layers up. Combine that with more aggressive skipping, where we don't process every raw segment anyway - then the easiest way to get the previous raw would be to use .raw_segments and get the last one (by doing something like .raw_segments[-1] every time we skip). If we're calling .raw_segments anyway then I feel like we're doing almost as much processing as just tracking the raw_stack - unless the only issue is the continuous tuple arithmetic, in which case maybe we just use a list rather than a tuple and just make a raw stack a more efficient thing to provide.

@codecov
Copy link

codecov bot commented Aug 7, 2022

Codecov Report

Merging #3717 (edb1251) into main (40f3d81) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              main     #3717    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files          176       178     +2     
  Lines        13466     13603   +137     
==========================================
+ Hits         13466     13603   +137     
Impacted Files Coverage Δ
src/sqlfluff/core/rules/functional/segments.py 100.00% <ø> (ø)
src/sqlfluff/rules/L027.py 100.00% <ø> (ø)
src/sqlfluff/rules/L049.py 100.00% <ø> (ø)
src/sqlfluff/core/parser/segments/base.py 100.00% <100.00%> (ø)
src/sqlfluff/core/parser/segments/raw.py 100.00% <100.00%> (ø)
src/sqlfluff/core/rules/__init__.py 100.00% <100.00%> (ø)
src/sqlfluff/core/rules/base.py 100.00% <100.00%> (ø)
src/sqlfluff/core/rules/context.py 100.00% <100.00%> (ø)
src/sqlfluff/core/rules/crawlers.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L001.py 100.00% <100.00%> (ø)
... and 64 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@barrywhart
Copy link
Member

Definitely not a necessity. It was a good optimization for the current design. We can get rid of it for now. 👍

Copy link
Member

@barrywhart barrywhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! A few small tactical suggestions, but overall direction is 💯.

@WittierDinosaur, are there any specific rules you've noticed use a lot of runtime that could benefit from using the new crawl behavior? Interested in testing this at some point on some of your large SQL files?

src/sqlfluff/core/parser/segments/base.py Outdated Show resolved Hide resolved
src/sqlfluff/core/parser/segments/base.py Outdated Show resolved Hide resolved
src/sqlfluff/core/parser/segments/base.py Outdated Show resolved Hide resolved
src/sqlfluff/core/rules/crawlers.py Outdated Show resolved Hide resolved
src/sqlfluff/core/rules/crawlers.py Outdated Show resolved Hide resolved
src/sqlfluff/core/rules/crawlers.py Outdated Show resolved Hide resolved
src/sqlfluff/rules/L060.py Outdated Show resolved Hide resolved
@alanmcruickshank alanmcruickshank marked this pull request as ready for review August 11, 2022 13:39
@barrywhart
Copy link
Member

If we're calling .raw_segments anyway then I feel like we're doing almost as much processing as just tracking the raw_stack - unless the only issue is the continuous tuple arithmetic, in which case maybe we just use a list rather than a tuple and just make a raw stack a more efficient thing to provide.

I think the performance issue comes from continuously building and appending to long sequences (every raw segment in the file). That's pretty expensive whether a list or a tuple is used. We don't need to address it in this PR, which significantly improves performance in other ways that probably drown out the impact of this. Just sharing the thought as a possibility for future performance work.

Copy link
Member

@barrywhart barrywhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Lots of code to review, but I tried to identify any red flags. The existing test cases plus the "rules critical error" checks tend to be very helpful, but I did find a few potential issues where an IndexError may be triggered.

src/sqlfluff/core/parser/segments/base.py Outdated Show resolved Hide resolved
src/sqlfluff/core/parser/segments/base.py Outdated Show resolved Hide resolved
src/sqlfluff/core/rules/context.py Outdated Show resolved Hide resolved
src/sqlfluff/core/rules/context.py Outdated Show resolved Hide resolved
src/sqlfluff/core/rules/crawlers.py Outdated Show resolved Hide resolved
src/sqlfluff/rules/L039.py Show resolved Hide resolved
src/sqlfluff/rules/L039.py Show resolved Hide resolved
@@ -79,7 +80,7 @@ def _handle_preceding_inline_comments(before_segment, anchor_segment):
if s.is_comment
and s.name != "block_comment"
and s.pos_marker.working_line_no
== anchor_segment.pos_marker.working_line_no
== anchor_segment.raw_segments[-1].pos_marker.working_line_no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible raw_segments is empty and will fail with IndexError?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 🤔 . I'm going to have to look into that one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this one - we can actually be sure that anchor_segment is actually a RawSegment itself. Calling .raw_segments on a RawSegment just returns [self], so this can just be reduced to anchor_segment.pos_marker.working_line_no. I've updated the mypy type hints accordingly and simplified the function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope - I'm wrong. anchor_segment isn't always a RawSegment (found that out in the now failing tests). However, raw_segments will always have length - either a segment will have children, or it will return [self] on .raw_segments if it's a Rawsegment.

src/sqlfluff/rules/L053.py Outdated Show resolved Hide resolved
test/core/parser/segments_base_test.py Outdated Show resolved Hide resolved
Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update src/sqlfluff/core/parser/segments/base.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update test/core/parser/segments_base_test.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update src/sqlfluff/rules/L053.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update src/sqlfluff/core/rules/context.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update src/sqlfluff/core/rules/context.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update src/sqlfluff/core/rules/crawlers.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Update src/sqlfluff/rules/L010.py

Co-authored-by: Barry Hart <barrywhart@yahoo.com>
Copy link
Member

@barrywhart barrywhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I fixed one small typo in a docstring. Good to merge once the build passes.

Excellent work!! 🎉🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement: More selective rule crawling
2 participants