-
-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tsql nested block comment support and add regex package dependency #2027
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2027 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 148 148
Lines 10458 10459 +1
=========================================
+ Hits 10458 10459 +1
Continue to review full report at Codecov.
|
@barrywhart let me know what you think on this one, we replace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
I notice the regex package has a compatibility note here (https://pypi.org/project/regex/):
This module is targeted at CPython. It expects that all codepoints are the same width, so it won’t behave properly with PyPy outside U+0000..U+007F because PyPy stores strings as UTF-8.
SQLFluff currently advertises itself as PyPy compatible, although TBH I don't know if anyone cares about this -- it was an early performance experiment that didn't show any meaningful benefits. But it sounds like we should remove that. That basically means removing mention of it in two of the setup.cfg
files:
plugins/sqlfluff-templater-dbt/setup.cfg
setup.cfg
There's also a harmless mention of PyPy in a comment here. Probably no need to change this, but listing it for completeness.
src/sqlfluff/cli/helpers.py
I took a closer look at the Should we be worried that the package has unknown bugs or may not be maintained in the future? It seems like a low risk, but OTOH, if we start using its additional features extensively and something happens, we could be left in a difficult situation. If we could learn more about the history of the package, I would feel more comfortable. Any thoughts, @tunetheweb or @alanmcruickshank? |
I did some searching about the history of this. One message in the thread says:
The last message in the thread says:
Overall, I feel much better about using this package now that I can confirm the long history and the fact that it was a somewhat serious contender for replacing A bit more info here: https://harjit.moe/pythonregex.html |
Perhaps we could summarize some of the info above and include it in a comment or docs page somewhere, in case this question comes up again? |
@barrywhart it's a fair point, I thought the same and came to a similar conclusion in the end 👍 Let me read through these articles later and make a summary comment in this PR 😄 I've already got a comment above the package in |
@barrywhart @tunetheweb @alanmcruickshank
This PR aims to introduce the regex package into SQLFluff.
(Please edit this comment if you feel I've missed anything so we have this source for future reference. I have added a comment in |
I'm still of the opinion that we should just do the straight If a pattern can't be matched by Pseudocode: import re
from typing import List
import regex
from sqlfluff.cli.helpers import get_python_implementation
string = "dfjaslfdhdalshfd /* block comment */ sdkjfhdskfdsfds"
pattern = r"/\*(?>[^*/]+|\*[^/]|/[^*])*(?>(?R)(?>[^*/]+|\*[^/]|/[^*])*)*\*/"
def regex_findall(pattern, string) -> List[str]:
"""Abstracted regex matcher idea."""
try:
# Use re where possible for simple patterns.
print("Attempting match using re")
matches = re.findall(pattern, string)
print("Matched using re")
except re.error:
# Switch to regex for more complicated
# patterns not covered by re.
if get_python_implementation() == "pypy":
# regex doesn't cover pypy therefore
# we skip this pattern for matching.
# N.B. not ideal but we can't match
# complicated patterns such as nested
# block statements with re therefore
# these are unsupported for pypy.
print(f"{pattern} pattern not matchable in PyPy.")
return []
print("Attempting match using regex")
matches = regex.findall(pattern, string)
print("Matched using regex")
return matches
print(regex_findall(pattern, string)) Again I'm of the opinion that we should NOT go this way as it increases complexity for what is likely a negligible part of the user base (I imagine a lot more people use tsql block comments than pypy) and I'm confident that |
I agree, let's go ahead and use the Thanks for helping think this through. 👍 |
awesome thanks @barrywhart 😄 |
Brief summary of the change made
This PR fixes #1716. T-SQL allows for nested block comments i.e.
/* I /* am /* a */ block */ comment */
, however, our current regex implementation will only capture up to the first block close i.e./* I /* am /* a */
.This is a suprisingly complicated problem as it requires being able to capture repeated groups in the regex pattern, which is not possible in the standard
re
library. However, there is a newer regex library available calledregex
which is completely backwards compatible withre
but extends the regex language capabilities.This PR replaces
re.
uses withregex.
and adds the regex pattern needed to parse T-SQL block comments (its indialect_tsql.py
).Are there any other side effects of this change that we should be aware of?
Not that I'm aware of.
Pull Request checklist
Please confirm you have completed any of the necessary steps below.
Included test cases to demonstrate any code changes, which may be one or more of the following:
.yml
rule test cases intest/fixtures/rules/std_rule_cases
..sql
/.yml
parser test cases intest/fixtures/dialects
(note YML files can be auto generated withpython test/generate_parse_fixture_yml.py
or by runningtox
locally).test/fixtures/linter/autofix
.Added appropriate documentation for the change.
Created GitHub issues for any relevant followup/future enhancements if appropriate.