Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH copyright-notice: check in the first 4096 bytes instead of 1024 #11927

Merged
merged 3 commits into from
Jun 18, 2024

Conversation

adrinjalali
Copy link
Contributor

@adrinjalali adrinjalali commented Jun 18, 2024

Summary

related to #5306

The check right now only checks in the first 1024 bytes, and that's really not enough when there's a docstring at the beginning of a file.

A more proper fix might be needed, which might be more complex (and I don't have the rust skills to implement that). But this temporary "fix" might enable more users to use this.

Context: We want to use this rule in https://github.com/scikit-learn/scikit-learn/ and we got blocked because of this hardcoded rule (which TBH took us quite a while to figure out why it was failing since it's not documented).

Test Plan

This is already kinda tested, modified the test for the new byte number.

@adrinjalali adrinjalali changed the title 4096 ENH copyright-notice: check in the first 4096 bytes instead of 1024 Jun 18, 2024
Copy link
Member

@zanieb zanieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems okay to me. Thanks for contributing! cc @MichaReiser

@zanieb zanieb added the rule Implementing or modifying a lint rule label Jun 18, 2024
Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay, although I'm not sure why the limit even exists. The only performance issue that I can see is if the regex uses a wildcard at the start. Any regex without a wildcard at the start should end by just testing a few characters.

@zanieb zanieb merged commit 2e7c345 into astral-sh:main Jun 18, 2024
20 checks passed
@MichaReiser
Copy link
Member

Okay it seems that the somewhat arbitrary limit is coming from the original lint rule. I think the limit there makes sense because the rule actually reads the file. The situation for ruff is different because we already have the file in memory.

Anyway, thanks for contributing!

@zanieb
Copy link
Member

zanieb commented Jun 18, 2024

I'm not sure either cc @BurntSushi

Not a lot of context in the original implementation discussion at #4701

@BurntSushi
Copy link
Member

Here's the regex:

Lazy::new(|| Regex::new(r"(?i)Copyright\s+((?:\(C\)|©)\s+)?\d{4}((-|,\s)\d{4})*").unwrap());

Specifically:

(?i)Copyright\s+((?:\(C\)|©)\s+)?\d{4}((-|,\s)\d{4})*

And the search is using Regex::find, which is an unanchored search. So the regex behaves as if there is a (?s:.)*? at the beginning. So putting a limit on how much is searched seems sensible if you only expect it to be near the beginning of file and it's plausible that this could be used to search very large files. I doubt 4096 and 1024 will make much of a difference, so this change seems okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rule Implementing or modifying a lint rule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants