-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flag potentially duplicate rules in CI #12481
Comments
An automated way of detecting duplicated rules would be very helpful, but I don't have a good feeling for how well the described approach would work regarding false positives and false negatives. I think a good first step would be to identify duplicate rules that were later merged/removed and play it through manually. |
Glad to hear there is some interest! I did a quick experiment: Methodology
ResultsThere appear to be no exact matches between existing rules, but we can see if there are any interesting "near matches". The top 10 most similar pairs (which happens to coincide with those pairs having similarity of over 50%) are as follows:
The number of false positives for this test is either a feature or a bug depending on whether or not it's helpful to spot places where additional test cases should be added to the test fixture. One can play with the threshold a bit, depending on how many candidate pairs you want to consider. There are
Next?So perhaps having an automated check that checks a threshold of similarity would be helpful? If CI is too unwieldy for this, one could also manually run a script for this before releases/periodically. The json and notebook used are at this repo if you'd like to play around yourself. Let me know if you'd like me to go further with this (eg continue experimenting, add something to the An example of an additional experiment might be to augment the set of files used to generate each rule's "profile" by including one or all of the codebases in the ecosystem checks. This has the advantage of comparing rule check behavior in a more natural setting, but the disadvantage that rarely tripped rules may get buried. |
Sorry for the late reply. Overall, this does seem useful, but it seems too noisy for a CI job. It requires a manual review of the flagged rules. The time I would find such a check the most useful is when there's a PR for a new rule, similar to the ecosystem check. But I must admit it's unclear to me how to design the check so that it is approachable, understandable, and has a good false-positive/false-negative ratio. |
I wonder if it would be helpful to add a test or CI step to check for potentially duplicate rules.
The most naive version of this I can think of is as follows:
In this case I would guess that either Rule A and Rule B are identical, or else it should be possible to disambiguate them by adding more test cases. (But maybe there is an error in that logic). If this isn't true, perhaps a list of exceptional pairs can be maintained.
Presumably there is also a clever way to implement this such that the first time the check is run it might be slow but is fast on subsequent runs.
(Bonus points: Apply the same principle to speed up/automate the detection of rules from other packages which have already been implemented in Ruff.)
This is sort of tangentially related to the rule re-categorization effort #1774 and of course to the question of aliasing #2186.
Would love to know if there's interest or suggestions. If so, I can try to implement this.
The text was updated successfully, but these errors were encountered: