Add tsql nested block comment support and add regex package dependency #2027

jpy-git · 2021-12-02T19:23:41Z

Brief summary of the change made

This PR fixes #1716. T-SQL allows for nested block comments i.e. /* I /* am /* a */ block */ comment */, however, our current regex implementation will only capture up to the first block close i.e. /* I /* am /* a */.

This is a suprisingly complicated problem as it requires being able to capture repeated groups in the regex pattern, which is not possible in the standard re library. However, there is a newer regex library available called regex which is completely backwards compatible with re but extends the regex language capabilities.

This PR replaces re. uses with regex. and adds the regex pattern needed to parse T-SQL block comments (its in dialect_tsql.py).

Are there any other side effects of this change that we should be aware of?

Not that I'm aware of.

Pull Request checklist

Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
- .yml rule test cases in test/fixtures/rules/std_rule_cases.
- .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with python test/generate_parse_fixture_yml.py or by running tox locally).
- Full autofix test cases in test/fixtures/linter/autofix.
- Other.
Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.

codecov · 2021-12-02T19:32:23Z

Codecov Report

Merging #2027 (b7d9022) into main (54effa0) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main     #2027   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          148       148           
  Lines        10458     10459    +1     
=========================================
+ Hits         10458     10459    +1

Impacted Files	Coverage Δ
src/sqlfluff/core/parser/lexer.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/parser/parsers.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/rules/base.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/templaters/placeholder.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/templaters/slicers/heuristic.py	`100.00% <100.00%> (ø)`
src/sqlfluff/core/templaters/slicers/tracer.py	`100.00% <100.00%> (ø)`
src/sqlfluff/dialects/dialect_tsql.py	`100.00% <100.00%> (ø)`
src/sqlfluff/rules/L010.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54effa0...b7d9022. Read the comment docs.

jpy-git · 2021-12-03T00:38:56Z

@barrywhart let me know what you think on this one, we replace re with the regex module to take advantage of more advanced regex in the block comments (existing uses are unimpacted). The block comment regex is in dialect_tsql.py.

barrywhart

Looks good overall!

I notice the regex package has a compatibility note here (https://pypi.org/project/regex/):

This module is targeted at CPython. It expects that all codepoints are the same width, so it won’t behave properly with PyPy outside U+0000..U+007F because PyPy stores strings as UTF-8.

SQLFluff currently advertises itself as PyPy compatible, although TBH I don't know if anyone cares about this -- it was an early performance experiment that didn't show any meaningful benefits. But it sounds like we should remove that. That basically means removing mention of it in two of the setup.cfg files:

plugins/sqlfluff-templater-dbt/setup.cfg
setup.cfg

There's also a harmless mention of PyPy in a comment here. Probably no need to change this, but listing it for completeness.

src/sqlfluff/cli/helpers.py

barrywhart · 2021-12-03T15:21:07Z

I took a closer look at the regex repo. Do we know the history of this package? It only has 17 stars and 2 contributors, which might suggest it's a brand-new package, but the git history shows it was created in 2010.

Should we be worried that the package has unknown bugs or may not be maintained in the future? It seems like a low risk, but OTOH, if we start using its additional features extensively and something happens, we could be left in a difficult situation.

If we could learn more about the history of the package, I would feel more comfortable. Any thoughts, @tunetheweb or @alanmcruickshank?

barrywhart · 2021-12-03T15:26:18Z

I did some searching about the history of this.
This is an old Python issue titled "Adding a new regex module (compatible with re)" from 2008. It has a very long discussion thread. It sounds like the reason for creating this issue was due to a bug in re at that time.

One message in the thread says:

Even with in principle approval from Guido, this idea still depends on
volunteers to actually write up a concrete proposal as a PEP (which
shouldn't be too controversial, given Guido already OK'ed the idea) and
then do the integration work to incorporate the code, tests and docs into
CPython (not technically hard, but not trivial either). "pip install
regex" starts looking fairly attractive at that point :)

The last message in the thread says:

It's now a third party project: https://pypi.org/project/regex/
If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first.
I close the issue as REJECTED.'

Overall, I feel much better about using this package now that I can confirm the long history and the fact that it was a somewhat serious contender for replacing re at one point.

A bit more info here: https://harjit.moe/pythonregex.html

barrywhart · 2021-12-03T15:33:31Z

Perhaps we could summarize some of the info above and include it in a comment or docs page somewhere, in case this question comes up again?

jpy-git · 2021-12-03T15:44:45Z

@barrywhart it's a fair point, I thought the same and came to a similar conclusion in the end 👍

Let me read through these articles later and make a summary comment in this PR 😄

I've already got a comment above the package in setup.cfg

but I'll tweak that to refer back to this PR for more info on the motivation behind introducing the package.

jpy-git · 2021-12-03T15:46:03Z

looking at download stats it seems pretty widely used as well:

jpy-git · 2021-12-03T15:47:52Z

for comparison

SQLFluff:

Pandas:

jpy-git · 2021-12-03T17:32:34Z

@barrywhart @tunetheweb @alanmcruickshank

regex package summary:

This PR aims to introduce the regex package into SQLFluff.

regex is a third party package that is backwards-compatible with the standard ‘re’ module, but offers additional advanced RegEx functionality, e.g. Nested Sets, Named Capture Groups, Repeated Groups, etc. This means it will not affect the existing regex used within SQLFluff grammar but enable us to handle more complex matching cases (e.g. T-SQL block comments introduced in this PR).
The package has been in development since 2008, and has received "in principle" approval for future inclusion in the standard library (https://bugs.python.org/issue2636). Due to the size the changes it is currently distributed as a separate package.

Even with in principle approval from Guido, this idea still depends on
volunteers to actually write up a concrete proposal as a PEP (which
shouldn't be too controversial, given Guido already OK'ed the idea) and
then do the integration work to incorporate the code, tests and docs into
CPython (not technically hard, but not trivial either). "pip install
regex" starts looking fairly attractive at that point :)

Looking at PyPI the package is receiving regular releases (https://pypi.org/project/regex/#history) and the PyPI download stats show very healthy usage (~17.7M downloads in the past month).
This PR removes official support for PyPy due to regex compatibility, but we feel this is not going to impact many people.
As with everything, there is always the risk that the package could stop being maintained, but for the reasons listed above I feel we can be pretty comfortable with including this package. Given it's backwards compatibility with re, I expect 99% of regex patterns in SQLFluff are still going to be exactly the same as before and therefore we can fairly easily revert back to re if necessary (pretty much just find and replace regex. with re.). Basically people should only be using the special regex additions if absolutely necessary like in this PR.

(Please edit this comment if you feel I've missed anything so we have this source for future reference. I have added a comment in setup.cfg above regex to refer to this PR for more information)

jpy-git · 2021-12-03T20:40:12Z

I'm still of the opinion that we should just do the straight regex for re swap proposed in this PR and detailed above, but just putting this here for consideration.

If a pattern can't be matched by re it raises re.error therefore we could use this to match with re where possible and only use regex where absolutely necessary. We could also then include a catch that skips the pattern for PyPy, the idea being that we mostly support PyPy but there would be a couple segments (e.g. nested comment blocks) that the user can't parse with it and can be included as a known caveat in the docs.

Pseudocode:

import re
from typing import List
import regex

from sqlfluff.cli.helpers import get_python_implementation

string = "dfjaslfdhdalshfd /* block comment */ sdkjfhdskfdsfds"
pattern = r"/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/"

def regex_findall(pattern, string) -> List[str]:
    """Abstracted regex matcher idea."""
    try:
        # Use re where possible for simple patterns.
        print("Attempting match using re")
        matches = re.findall(pattern, string)
        print("Matched using re")
    except re.error:
        # Switch to regex for more complicated
        # patterns not covered by re.
        if get_python_implementation() == "pypy":
            # regex doesn't cover pypy therefore
            # we skip this pattern for matching.
            # N.B. not ideal but we can't match
            # complicated patterns such as nested
            # block statements with re therefore
            # these are unsupported for pypy.
            print(f"{pattern} pattern not matchable in PyPy.")
            return []
        print("Attempting match using regex")
        matches = regex.findall(pattern, string)
        print("Matched using regex")
    
    return matches

print(regex_findall(pattern, string))

Again I'm of the opinion that we should NOT go this way as it increases complexity for what is likely a negligible part of the user base (I imagine a lot more people use tsql block comments than pypy) and I'm confident that regex alone should handle our needs. But good to have options 👍

barrywhart · 2021-12-03T21:48:08Z

I agree, let's go ahead and use the regex package everywhere.

Thanks for helping think this through. 👍

jpy-git · 2021-12-03T21:51:35Z

awesome thanks @barrywhart 😄

Add tsql nested block comment support and add regex package

715f893

Merge branch 'main' into tsql_nested_block_comments

6cf9648

barrywhart requested changes Dec 3, 2021

View reviewed changes

Implement review feedback

b7d9022

barrywhart approved these changes Dec 3, 2021

View reviewed changes

barrywhart merged commit 4f7fb57 into sqlfluff:main Dec 3, 2021

jpy-git deleted the tsql_nested_block_comments branch December 3, 2021 21:51

jpers36 mentioned this pull request Dec 10, 2021

TSQL: Comment parsing issues #2086

Closed

jpy-git mentioned this pull request Dec 11, 2021

Fix tsql block comment close #2095

Merged

1 task

jpy-git changed the title ~~Add tsql nested block comment support and add regex package~~ Add tsql nested block comment support and add regex package dependency Dec 11, 2021

tunetheweb mentioned this pull request Jun 29, 2022

Core: Compile regexes at init time to avoid overhead #3511

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tsql nested block comment support and add regex package dependency #2027

Add tsql nested block comment support and add regex package dependency #2027

jpy-git commented Dec 2, 2021

codecov bot commented Dec 2, 2021 •

edited

Loading

jpy-git commented Dec 3, 2021

barrywhart left a comment

barrywhart commented Dec 3, 2021

barrywhart commented Dec 3, 2021 •

edited

Loading

barrywhart commented Dec 3, 2021

jpy-git commented Dec 3, 2021

jpy-git commented Dec 3, 2021

jpy-git commented Dec 3, 2021

jpy-git commented Dec 3, 2021 •

edited

Loading

jpy-git commented Dec 3, 2021

barrywhart commented Dec 3, 2021

jpy-git commented Dec 3, 2021

Add tsql nested block comment support and add regex package dependency #2027

Add tsql nested block comment support and add regex package dependency #2027

Conversation

jpy-git commented Dec 2, 2021

Brief summary of the change made

Are there any other side effects of this change that we should be aware of?

Pull Request checklist

codecov bot commented Dec 2, 2021 • edited Loading

Codecov Report

jpy-git commented Dec 3, 2021

barrywhart left a comment

Choose a reason for hiding this comment

barrywhart commented Dec 3, 2021

barrywhart commented Dec 3, 2021 • edited Loading

barrywhart commented Dec 3, 2021

jpy-git commented Dec 3, 2021

jpy-git commented Dec 3, 2021

jpy-git commented Dec 3, 2021

jpy-git commented Dec 3, 2021 • edited Loading

jpy-git commented Dec 3, 2021

barrywhart commented Dec 3, 2021

jpy-git commented Dec 3, 2021

codecov bot commented Dec 2, 2021 •

edited

Loading

barrywhart commented Dec 3, 2021 •

edited

Loading

jpy-git commented Dec 3, 2021 •

edited

Loading