Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in license detection result #2170

Open
qduanmu opened this issue Aug 21, 2020 · 6 comments
Open

Duplicates in license detection result #2170

qduanmu opened this issue Aug 21, 2020 · 6 comments
Labels

Comments

@qduanmu
Copy link
Contributor

qduanmu commented Aug 21, 2020

The content of scanning file is:
{ "ZPL-2.0", new LicenseData(licenseID: "ZPL-2.0", isOsiApproved: true, isDeprecatedLicenseId: false, isFsfLibre: true) }

The license detection result is below:

    {
      "path": "test_file",
      "type": "file",
      "licenses": [
        {
          "key": "zpl-2.0",
          "score": 50.0,
          "name": "Zope Public License 2.0",
          "short_name": "ZPL 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Zope Community",
          "homepage_url": "http://www.zope.org/Resources/License/",
          "text_url": "http://www.zope.org/Resources/License/",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:zpl-2.0",
          "spdx_license_key": "ZPL-2.0",
          "spdx_url": "https://spdx.org/licenses/ZPL-2.0",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "spdx_license_id_zpl-2.0_for_zpl-2.0.RULE",
            "license_expression": "zpl-2.0",
            "licenses": [
              "zpl-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 3,
            "matched_length": 3,
            "match_coverage": 100.0,
            "rule_relevance": 50.0
          }
        },
        {
          "key": "zpl-2.0",
          "score": 50.0,
          "name": "Zope Public License 2.0",
          "short_name": "ZPL 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Zope Community",
          "homepage_url": "http://www.zope.org/Resources/License/",
          "text_url": "http://www.zope.org/Resources/License/",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:zpl-2.0",
          "spdx_license_key": "ZPL-2.0",
          "spdx_url": "https://spdx.org/licenses/ZPL-2.0",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "spdx_license_id_zpl-2.0_for_zpl-2.0.RULE",
            "license_expression": "zpl-2.0",
            "licenses": [
              "zpl-2.0"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 3,
            "matched_length": 3,
            "match_coverage": 100.0,
            "rule_relevance": 50.0
          }
        }
      ],
      "license_expressions": [
        "zpl-2.0",
        "zpl-2.0"
      ],
      "scan_errors": []
    }
  • What OS are you running on? Linux
  • What version of scancode-toolkit was used to generate the scan file? 3.1.1.post848.57dab7d24
  • What installation method was used to install/run scancode? source download
@qduanmu qduanmu added the bug label Aug 21, 2020
@pombredanne
Copy link
Contributor

@qduanmu Hello and hank you again! I hope everything is OK for you!

In each case the text matched is "matched_text": "ZPL-2.0\"," if you run the scan with --license --license-text --license-text-diagnostics and there are two instances so there are two detections alright.
We could:

  1. create a rule with "ZPL-2.0", new LicenseData(licenseID: "ZPL-2.0" but that would be weird as they are likely many more cases like that

  2. create a false positive or a negative rule for most of the content of your file (I assume this is coming from this https://github.com/NuGet/NuGet.Client/blob/7bf0d060f3f1a680121ac17dbda01e6b15ef3b54/src/NuGet.Core/NuGet.Packaging/Licenses/NuGetLicenseData.cs ) but that would be also quite unwieldy too

  3. design something new to match these few cases of code that contains a lot of licenses that are NOT the licenses of the code such as the one you have an issue with and many other such as https://github.com/jslicense/spdx-exceptions.json/blob/master/index.json or ... for instance scancode itself.

Both 1. and 2. would be quick fixes but would not be viable for the long term. I tend to think 3. is a better but harder approach. What do you think?

@hesa
Copy link

hesa commented Apr 21, 2021

Jumping in a bit late. Stumbled on a file, gen.go, yesterday. There are two license texts in the file. Scancode (3.2.3 with -clipe) reports the following for this file:

     .....
      "license_expressions": [
        "apache-2.0",
        "apache-2.0"
      ],
     .....
      "copyrights": [
        {
          "value": "Copyright 2019 The Wuffs Authors",
          "start_line": 1,
          "end_line": 1
        },
        {
          "value": "Copyright 2019 The Wuffs Authors",
          "start_line": 58,
          "end_line": 58
        }
      ],
      "holders": [
        {
          "value": "The Wuffs Authors",
          "start_line": 1,
          "end_line": 1
        },
        {
          "value": "The Wuffs Authors",
          "start_line": 58,
          "end_line": 58
        }
      ],
     ....

So, license_expressions, copyrights and holders are all stated (by authors) and reported (by Scancode) twice.

I am not sure multiple and verbatim copyright and/or license statements stated multiple times should be reported as one. OK, I admit the reason I looked at this issue was because I thought it was something spooky with Scancode and I did spend some time checking my scancode report analyser for errors.

Perhaps it simply should be up to the user (machine or human) to discard duplicate entries?

@pombredanne
Copy link
Contributor

@hesa Hey! 👋 So I think this is a good case for effectively having a simplification here. There two notices alright and scancode detects them all correctly, but a post processing would do nicely!

Unrelated: May you should run the latest version? 3.2.3 starts to be old!

@hesa
Copy link

hesa commented Apr 26, 2021

I think I would prefer to do this post processing myself (i.e. let scancode report the two instances). So, for me, this issue can be closed.

Re unrelated :)

ScanCode version 21.3.31

@qduanmu
Copy link
Contributor Author

qduanmu commented Aug 5, 2021

Thank you for your quick response, @pombredanne , hope everything goes well with you!
I didn't work on this for quite a long time(may be back in near future), so I need to have a check on the latest scancode first.

Both 1. and 2. would be quick fixes but would not be viable for the long term. I tend to think 3. is a better but harder approach. What do you think?

I second the proposal 3., design something new(like a regex pattern/rule for above files, yes, this is a hard approach for files like https://github.com/jslicense/spdx-exceptions.json/blob/master/index.json) to filter out their license matching as false positives or even skip the file scanning.
I will see if I could provide some more feedback after checking the latest update.

@pombredanne
Copy link
Contributor

@qduanmu Hey 👋 !

hope everything goes well with you!

Thank you and yes, A-OK here ... and I hope for you too.
At the moment I think I went with 2. and several false positive rules were added, but that's not a satisfying solution for the ong term. At least https://raw.githubusercontent.com/jslicense/spdx-exceptions.json/master/index.json reports no license are detected.

https://raw.githubusercontent.com/NuGet/NuGet.Client/7bf0d060f3f1a680121ac17dbda01e6b15ef3b54/src/NuGet.Core/NuGet.Packaging/Licenses/NuGetLicenseData.cs is still problematic though

@AyanSinhaMahapatra ^ FUIO you may have another idea for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants