Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple values for copyright holders #182

Closed
jdaguil opened this issue Jan 22, 2016 · 7 comments
Closed

Multiple values for copyright holders #182

jdaguil opened this issue Jan 22, 2016 · 7 comments
Assignees
Milestone

Comments

@jdaguil
Copy link
Contributor

jdaguil commented Jan 22, 2016

In some cases more than one holder is detected when only one copyright statement is detected. For example, when running scancode on the samples scancode directory the following is detected in samples/zlib/deflate.c although there is only one copyright statement detected:

        "statements": [
          "Copyright (c) 1995-2013 Jean-loup Gailly and Mark Adler"
        ],
        "holders": [
          "1995-2013 Jean-loup Gailly and Mark Adler",
          "Jean-loup Gailly"
        ],

Another example is in samples/JGroups/licenses/bouncycastle.txt:

        "statements": [
          "Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle"
        ],
        "holders": [
          "Legion Of The Bouncy Castle",
          "Legion Of The Bouncy"
        ],
@pombredanne
Copy link
Contributor

This is definitely a flaw... though I am not we can guarantee that there will be a one to one relationship between a copyright statement and the number of holders. For instance:
Copyright (c) 2042 John Doe, Michael Foo and Jane Bar.
Technically, we should return three holders there: John Doe, ``Michael FooandJane Bar`.

That said we should not return dupes nor holders contained in holders which is a bug and a side effect IMHO of how the parse tree is walked there: https://github.com/nexB/scancode-toolkit/blob/01bce665468c198e0eaadf1805b82629fb2a5554/src/cluecode/copyrights.py#L706

@jdaguil
Copy link
Contributor Author

jdaguil commented Jan 27, 2016

@pombredanne I had a discussion with @mjherzog regarding the holders. What are your thoughts on instead of returning the 3 separate holders John Doe, Michael Foo and Jane Bar
returning only John Doe, Michael Foo and Jane Bar as one holder and just removing the Copyright (c) 2042 part of the statement? And if there was a completely separate copyright statement in the same file then that would be used as a second holder.

@pombredanne
Copy link
Contributor

@jdaguil splitting John Doe, Michael Foo and Jane Bar in three holders is theoretically possible but eventually hard. I would deal with this later. The hard part is recognizing actual names with some accuracy.
So for now, the focus is to get clean holders possibly composed of multiple names when they are so

@pombredanne
Copy link
Contributor

@jdaguil note that this last topic is a completely different topic that the subject of this bug here which is a duplication bug.

@jdaguil
Copy link
Contributor Author

jdaguil commented Jan 27, 2016

@pombredanne yes, let me open a different issue for the previous comment discussion

@pombredanne
Copy link
Contributor

This has also been reported by @jeffmcaffer
I am boosting the priority to v2.3

@pombredanne pombredanne modified the milestones: v3.0, v2.3 Feb 7, 2018
pombredanne added a commit that referenced this issue Jun 18, 2018
NB: this is a breaking API change

* Each has their own list of items returned for #255
* for now holders are no longer expanded (e.g. this reverts the #182
  implementation available before). This will be reintriduced later as
  a CLI option as it is not possible to get great results for now
* the summary has been improved for #1043 and provides a much better
  holder summary. More refinements needed
* Some spurrious bare SPDX id have been removed to avoid FP #1114



Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne modified the milestones: v2.3, v3.1 Nov 4, 2018
@pombredanne
Copy link
Contributor

This has been fixed in the latest release and I think this is good enough now.
We really

    {
      "path": "bouncycastle.txt",
      "type": "file",
      "holders": [
        {
          "value": "The Legion Of The Bouncy Castle",
          "start_line": 5,
          "end_line": 5
        }
      ],
      "copyrights": [
        {
          "value": "Copyright (c) 2000 - 2006 The Legion Of The Bouncy Castle (http://www.bouncycastle.org)",
          "start_line": 5,
          "end_line": 5
        }
      ],
      "authors": [],
      "scan_errors": []
    }

and

   {
      "path": "deflate.c",
      "type": "file",
      "holders": [
        {
          "value": "Jean-loup Gailly and Mark Adler",
          "start_line": 2,
          "end_line": 3
        },
        {
          "value": "Jean-loup Gailly and Mark Adler",
          "start_line": 54,
          "end_line": 55
        }
      ],
      "copyrights": [
        {
          "value": "Copyright (c) 1995-2013 Jean-loup Gailly and Mark Adler",
          "start_line": 2,
          "end_line": 3
        },
        {
          "value": "Copyright 1995-2013 Jean-loup Gailly and Mark Adler",
          "start_line": 54,
          "end_line": 55
        }
      ],
      "authors": [
        {
          "value": "Leonid Broukhis.",
          "start_line": 34,
          "end_line": 35
        }
      ],
      "scan_errors": []
    }

The other part has been tracked and fixed too in #186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants