Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tsql nested block comment support and add regex package dependency #2027

Merged
merged 3 commits into from
Dec 3, 2021
Merged

Add tsql nested block comment support and add regex package dependency #2027

merged 3 commits into from
Dec 3, 2021

Conversation

jpy-git
Copy link
Contributor

@jpy-git jpy-git commented Dec 2, 2021

Brief summary of the change made

This PR fixes #1716. T-SQL allows for nested block comments i.e. /* I /* am /* a */ block */ comment */, however, our current regex implementation will only capture up to the first block close i.e. /* I /* am /* a */.

This is a suprisingly complicated problem as it requires being able to capture repeated groups in the regex pattern, which is not possible in the standard re library. However, there is a newer regex library available called regex which is completely backwards compatible with re but extends the regex language capabilities.

This PR replaces re. uses with regex. and adds the regex pattern needed to parse T-SQL block comments (its in dialect_tsql.py).

image

Are there any other side effects of this change that we should be aware of?

Not that I'm aware of.

Pull Request checklist

  • Please confirm you have completed any of the necessary steps below.

  • Included test cases to demonstrate any code changes, which may be one or more of the following:

    • .yml rule test cases in test/fixtures/rules/std_rule_cases.
    • .sql/.yml parser test cases in test/fixtures/dialects (note YML files can be auto generated with python test/generate_parse_fixture_yml.py or by running tox locally).
    • Full autofix test cases in test/fixtures/linter/autofix.
    • Other.
  • Added appropriate documentation for the change.

  • Created GitHub issues for any relevant followup/future enhancements if appropriate.

@codecov
Copy link

codecov bot commented Dec 2, 2021

Codecov Report

Merging #2027 (b7d9022) into main (54effa0) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main     #2027   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          148       148           
  Lines        10458     10459    +1     
=========================================
+ Hits         10458     10459    +1     
Impacted Files Coverage Δ
src/sqlfluff/core/parser/lexer.py 100.00% <100.00%> (ø)
src/sqlfluff/core/parser/parsers.py 100.00% <100.00%> (ø)
src/sqlfluff/core/rules/base.py 100.00% <100.00%> (ø)
src/sqlfluff/core/templaters/placeholder.py 100.00% <100.00%> (ø)
src/sqlfluff/core/templaters/slicers/heuristic.py 100.00% <100.00%> (ø)
src/sqlfluff/core/templaters/slicers/tracer.py 100.00% <100.00%> (ø)
src/sqlfluff/dialects/dialect_tsql.py 100.00% <100.00%> (ø)
src/sqlfluff/rules/L010.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54effa0...b7d9022. Read the comment docs.

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

@barrywhart let me know what you think on this one, we replace re with the regex module to take advantage of more advanced regex in the block comments (existing uses are unimpacted). The block comment regex is in dialect_tsql.py.

Copy link
Member

@barrywhart barrywhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

I notice the regex package has a compatibility note here (https://pypi.org/project/regex/):

This module is targeted at CPython. It expects that all codepoints are the same width, so it won’t behave properly with PyPy outside U+0000..U+007F because PyPy stores strings as UTF-8.

SQLFluff currently advertises itself as PyPy compatible, although TBH I don't know if anyone cares about this -- it was an early performance experiment that didn't show any meaningful benefits. But it sounds like we should remove that. That basically means removing mention of it in two of the setup.cfg files:

  • plugins/sqlfluff-templater-dbt/setup.cfg
  • setup.cfg

There's also a harmless mention of PyPy in a comment here. Probably no need to change this, but listing it for completeness.

src/sqlfluff/cli/helpers.py

@barrywhart
Copy link
Member

I took a closer look at the regex repo. Do we know the history of this package? It only has 17 stars and 2 contributors, which might suggest it's a brand-new package, but the git history shows it was created in 2010.

Should we be worried that the package has unknown bugs or may not be maintained in the future? It seems like a low risk, but OTOH, if we start using its additional features extensively and something happens, we could be left in a difficult situation.

If we could learn more about the history of the package, I would feel more comfortable. Any thoughts, @tunetheweb or @alanmcruickshank?

@barrywhart
Copy link
Member

barrywhart commented Dec 3, 2021

I did some searching about the history of this.
This is an old Python issue titled "Adding a new regex module (compatible with re)" from 2008. It has a very long discussion thread. It sounds like the reason for creating this issue was due to a bug in re at that time.

One message in the thread says:

Even with in principle approval from Guido, this idea still depends on
volunteers to actually write up a concrete proposal as a PEP (which
shouldn't be too controversial, given Guido already OK'ed the idea) and
then do the integration work to incorporate the code, tests and docs into
CPython (not technically hard, but not trivial either). "pip install
regex" starts looking fairly attractive at that point :)

The last message in the thread says:

It's now a third party project: https://pypi.org/project/regex/
If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first.
I close the issue as REJECTED.'

Overall, I feel much better about using this package now that I can confirm the long history and the fact that it was a somewhat serious contender for replacing re at one point.

A bit more info here: https://harjit.moe/pythonregex.html

@barrywhart
Copy link
Member

Perhaps we could summarize some of the info above and include it in a comment or docs page somewhere, in case this question comes up again?

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

@barrywhart it's a fair point, I thought the same and came to a similar conclusion in the end 👍

Let me read through these articles later and make a summary comment in this PR 😄

I've already got a comment above the package in setup.cfg
image
but I'll tweak that to refer back to this PR for more info on the motivation behind introducing the package.

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

looking at download stats it seems pretty widely used as well:
image

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

for comparison

SQLFluff:
image

Pandas:
image

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

@barrywhart @tunetheweb @alanmcruickshank

regex package summary:

This PR aims to introduce the regex package into SQLFluff.

  • regex is a third party package that is backwards-compatible with the standard ‘re’ module, but offers additional advanced RegEx functionality, e.g. Nested Sets, Named Capture Groups, Repeated Groups, etc. This means it will not affect the existing regex used within SQLFluff grammar but enable us to handle more complex matching cases (e.g. T-SQL block comments introduced in this PR).

  • The package has been in development since 2008, and has received "in principle" approval for future inclusion in the standard library (https://bugs.python.org/issue2636). Due to the size the changes it is currently distributed as a separate package.

Even with in principle approval from Guido, this idea still depends on
volunteers to actually write up a concrete proposal as a PEP (which
shouldn't be too controversial, given Guido already OK'ed the idea) and
then do the integration work to incorporate the code, tests and docs into
CPython (not technically hard, but not trivial either). "pip install
regex" starts looking fairly attractive at that point :)

  • Looking at PyPI the package is receiving regular releases (https://pypi.org/project/regex/#history) and the PyPI download stats show very healthy usage (~17.7M downloads in the past month).

  • This PR removes official support for PyPy due to regex compatibility, but we feel this is not going to impact many people.

  • As with everything, there is always the risk that the package could stop being maintained, but for the reasons listed above I feel we can be pretty comfortable with including this package. Given it's backwards compatibility with re, I expect 99% of regex patterns in SQLFluff are still going to be exactly the same as before and therefore we can fairly easily revert back to re if necessary (pretty much just find and replace regex. with re.). Basically people should only be using the special regex additions if absolutely necessary like in this PR.

(Please edit this comment if you feel I've missed anything so we have this source for future reference. I have added a comment in setup.cfg above regex to refer to this PR for more information)

@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

I'm still of the opinion that we should just do the straight regex for re swap proposed in this PR and detailed above, but just putting this here for consideration.

If a pattern can't be matched by re it raises re.error therefore we could use this to match with re where possible and only use regex where absolutely necessary. We could also then include a catch that skips the pattern for PyPy, the idea being that we mostly support PyPy but there would be a couple segments (e.g. nested comment blocks) that the user can't parse with it and can be included as a known caveat in the docs.

Pseudocode:

import re
from typing import List
import regex

from sqlfluff.cli.helpers import get_python_implementation

string = "dfjaslfdhdalshfd /* block comment */ sdkjfhdskfdsfds"
pattern = r"/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/"

def regex_findall(pattern, string) -> List[str]:
    """Abstracted regex matcher idea."""
    try:
        # Use re where possible for simple patterns.
        print("Attempting match using re")
        matches = re.findall(pattern, string)
        print("Matched using re")
    except re.error:
        # Switch to regex for more complicated
        # patterns not covered by re.
        if get_python_implementation() == "pypy":
            # regex doesn't cover pypy therefore
            # we skip this pattern for matching.
            # N.B. not ideal but we can't match
            # complicated patterns such as nested
            # block statements with re therefore
            # these are unsupported for pypy.
            print(f"{pattern} pattern not matchable in PyPy.")
            return []
        print("Attempting match using regex")
        matches = regex.findall(pattern, string)
        print("Matched using regex")
    
    return matches

print(regex_findall(pattern, string))

Again I'm of the opinion that we should NOT go this way as it increases complexity for what is likely a negligible part of the user base (I imagine a lot more people use tsql block comments than pypy) and I'm confident that regex alone should handle our needs. But good to have options 👍

@barrywhart
Copy link
Member

I agree, let's go ahead and use the regex package everywhere.

Thanks for helping think this through. 👍

@barrywhart barrywhart merged commit 4f7fb57 into sqlfluff:main Dec 3, 2021
@jpy-git
Copy link
Contributor Author

jpy-git commented Dec 3, 2021

awesome thanks @barrywhart 😄

@jpy-git jpy-git deleted the tsql_nested_block_comments branch December 3, 2021 21:51
@jpy-git jpy-git mentioned this pull request Dec 11, 2021
1 task
@jpy-git jpy-git changed the title Add tsql nested block comment support and add regex package Add tsql nested block comment support and add regex package dependency Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TSQL: allow for nested comment blocks
2 participants