Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix license discovery: split at dashes for word-based matching #543

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ytausch
Copy link

@ytausch ytausch commented Jul 19, 2024

Description

I noticed the following test is currently failing on main: license/test_discovery/test_discovery.test_short_license_id for ("3-Clause BSD License", "BSD-3-Clause").

Reason: There is a license called DEC-3-Clause which is as similar as BSD-3-Clause to 3-Clause BSD License regarding the similarity measures we currently use. Splitting at dashes is a good way of handling cases correctly in which the license name is given without dashes.

It should be noted that the number of tests is very sparse so there is a possibility of breaking something else with this change. I still think it is safe to proceed since we are only modifying the behavior in situations we are unsure anyway.

Additional Note

The other tests failing on main are not related to this PR.

@ytausch ytausch requested a review from a team as a code owner July 19, 2024 16:02
@dholth
Copy link
Contributor

dholth commented Jul 20, 2024

I fixed the problem in a similar way 9ba91d8

Is the tokenizer wrong (remove dash from token characters?)

@ytausch
Copy link
Author

ytausch commented Jul 23, 2024

The rapidfuzz tokenizer only splits at spaces, not at other characters. This is the relevant source code.

Regarding your fix: I think it confuses the following pairs of licenses:

AFL-1.2 AFL-2.1
GPL-1.0 GPL-1.0+
GPL-2.0 GPL-2.0+
GPL-3.0 GPL-3.0+
LGPL-2.0 LGPL-2.0+
LGPL-2.1 LGPL-2.1+
LGPL-3.0 LGPL-3.0+
OLDAP-1.2 OLDAP-2.1
OLDAP-1.2 OLDAP-2.2.1
OLDAP-2.1 OLDAP-2.2.1
OLDAP-2.2 OLDAP-2.2.2

I obtained this list using a little Python script:

import urllib.request
import json
import re

if __name__ == '__main__':
    with urllib.request.urlopen("https://github.com/spdx/license-list-data/raw/main/json/licenses.json") as url:
        data = json.load(url)
    for i, first_license in enumerate(data["licenses"]):
        first_id = first_license["licenseId"]
        first_bag = set(re.findall(r'\w+', first_id.lower()))
        for second_license in data["licenses"][i+1:]:
            second_id = second_license["licenseId"]
            second_bag = set(re.findall(r'\w+', second_id.lower()))
            if first_bag == second_bag:
                print(first_id, second_id)

Probably we should add a test like this as well.

@ytausch
Copy link
Author

ytausch commented Jul 23, 2024

I now ensured in an additional tests that license IDs are always mapped to themselves

@ytausch
Copy link
Author

ytausch commented Jul 29, 2024

@marcelotrevisani I think the failing tests are not related to the changes I made in this PR

@marcelotrevisani
Copy link
Member

@marcelotrevisani I think the failing tests are not related to the changes I made in this PR

yes, they are not related indeed, gonna take a look and I will try to fix them later.
I am not having much time these days to maintain the project due to work and personal life, but I am trying to keep a look into this :)
Thank you for your contribution!

@marcelotrevisani
Copy link
Member

can you rebase this branch to main please?

@ytausch
Copy link
Author

ytausch commented Aug 19, 2024

Hmm, not sure which parts of this PR are still relevant since @beenje merged #552 which is also targeted at the DEC/BSD confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants