Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for license references can we somehow include 2-3 words after and before the detected keywords? #1122

Open
muzsielod opened this issue Jun 28, 2018 · 8 comments

Comments

@muzsielod
Copy link

muzsielod commented Jun 28, 2018

Hello,

I usually scan components for myself with the Scancode to verify the licensing of the files.
And in most cases there are references to some licenses stated as See the license in the root or licensed under the content of the LICENSE file. So I created a license which detects all the "license" keywords in a component. And as you can imagine there are a lots of license hits in even in a small component. Is there a way to create a rule or modify the license entry in that way that the Scancode will detect 2-3 words after and before the "license" keyword and print them in the html report? It will help a lot in filtering the false "license" hits in the files content without the need of manual checking the files.

Thank you for your help.

@muzsielod
Copy link
Author

Hello,

I'll be interested in something like this (\w+\W+\w+\W+(?is)(L|l)icen(s|c)e(?is)\W+\w+\W+\w+).
But as I know scancode not working with regular expressions. Is there a workaround that I could use to get the same result? This regex detects the keyword and two words before and after.

Thanks again for your help!

@pombredanne
Copy link
Contributor

@muzsielod thanks for this ticket.
So if I understand correctly you want to detect correctly things such as see COPYING for license and similar?

Getting some words reported before and after the word "license" would therefore help you figure out what the license statement is and where to look for this. Is this correct?

A few things for your consideration: in #377 there are some discussions of something similar:

Or say that a scanned directory only contains a README file with a license and notice and that all the files in that directory have a comment See README for licensing. Then the license and origin information could be extended from the README to all the files in that tree that carry this comment.

... and also here #377 (comment)

Also, a generic "see-license" license key key was added with several common rules with this commit 7018e94#diff-8e8b79632a0dacaa9fc2321ab1deaead ... see also this search https://github.com/nexB/scancode-toolkit/search?q=%22see-license%22&type=Code
I am not sure this is the right way for this but this can help

You wrote also:

I'll be interested in something like this (\w+\W+\w+\W+(?is)(L|l)icen(s|c)e(?is)\W+\w+\W+\w+).
But as I know scancode not working with regular expressions. Is there a workaround that I could use to get the same result? This regex detects the keyword and two words before and after.

We could add a regex-based matcher to the Scancode license detection, but I am not sure this is the best way since generally speaking the detection is word-based and not character based.

Instead I would be interested for a start to get the text of the examples you want to detect so we can find the best solution.

@muzsielod
Copy link
Author

muzsielod commented Jul 2, 2018

Hi @pombredanne.

Getting some words reported before and after the word "license" would therefore help you figure out what the license statement is and where to look for this. Is this correct?
This statement is correct with the addition that I can from start exclude a false hint.

I will explain you what I did below:
I created two new licenses which detects only the words "license" and "licence".
Now if I run the scan on a single *.js file it will detect 5 "license" hits (the bold text):

line 24 "License that will be placed inside of all created bundles." - which is not relevant information
line 26 "@license" - not a relevant information
line 29 "Use of this source code is governed by an MIT-style license that can be" - detected as an MIT License no license hit detected
line 30 "found in the LICENSE file at https://angular.io/**license**" - this is relevant information

The problem with this is that in all the cases it will return the matched_text just "license" and I have to check 4-5 times in just this (there are sometime huge files with hundreds of lines) file if there is a relevant information or it's just a random keyword.

My idea with the regex was to specify a license keyword in a way that tells the scancode when to list more than just the "license" in case of a detection inside the matched_text, as it does when detects an MIT License with different copyright holder "[Copyright] ([c]) [2018] [Google] [LLC]."

In my case should look like this:
"License [that will be] placed inside of all created bundles."

Because the scancode detects a lot of license references, but in some cases of 2 word references like (License:MIT) or new short licenses/license references which are not added in the tool, it will miss the license hit.
By adding some keywords as licenses you can identify also the references, and by including some extra word before and after you can tell from start if the keyword is relevant or not.

I also red the Scan deduction and summarization. I find it a good idea, however there are some cases when you can have multiple references in one single file which (by additional text, component name or project name) refer to different packages and I think its safer to verify that reference manually, because you can miss important stuff. First it will be good to detect the references, from there you can do it manually.

I think this way you can have a better overview of my idea, but tell me if there is something else to be explained.

Thank you.

@pombredanne
Copy link
Contributor

@muzsielod ok, so this is more or less something such as https://github.com/angular/universal/blob/89856090b4cc92612124584513ff8c717a25618c/build-config.js

I created two new licenses which detects only the words "license" and "licence".

You could instead create only one license key (may be without a text) and multiple rules for the English and US spellings (and possibly rules for singular and plural). Case is always ignored.

Note that in that file the MIT license is detected correctly using the develop branch as an MIT license:

 * Use of this source code is governed by an MIT-style license that can be
 * found in the LICENSE file at https://angular.io/license
      "licenses": [
        {
          "key": "mit",
          "score": 100.0,
          "short_name": "MIT License",
          "category": "Permissive",
          "owner": "MIT",
          "homepage_url": "http://opensource.org/licenses/mit-license.php",
          "text_url": "http://opensource.org/licenses/mit-license.php",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:mit",
          "spdx_license_key": "MIT",
          "spdx_url": "https://spdx.org/licenses/MIT",
          "start_line": 23,
          "end_line": 24,
          "matched_rule": {
            "identifier": "mit_129.RULE",
            "license_expression": "mit",
            "licenses": [
              "mit"
            ],
            "matcher": "2-aho",
            "rule_length": 25,
            "matched_length": 25,
            "match_coverage": 100.0,
            "rule_relevance": 100
          },
          "matched_text": "Use of this source code is governed by an MIT-style license that can be\n * found in the LICENSE file at https://angular.io/license"
        }
      ],

Are you really looking forward to detect all the mentions of license or licence everywhere with a few extra surrounding words? And review all the many false positives your will get?

How often do you find issues with licenses that are not correctly detected? (this would be a bug to me all the times) It may be simpler to fix these bugs with rules one at a time?

Now adding this "license" word detection could be done alright, but this would have to be an extra Cli option as this will generate a lot of noise that not many folks would care for... In any case this would not be a regex but rather something that catches the license word and expands to words left and right. A bit similar to what is done for SPDX-license-identifier and that would then generate regular matches such that these would be filtered alright if already part of larger matches.

@pombredanne
Copy link
Contributor

@muzsielod any feedback?

@muzsielod
Copy link
Author

Hi @pombredanne,

yes I want to detect all the license keywords and even add other ones if somehow this is manageable, there are several keywords which detects short permission notices or licenses like the keyword "permission". I'd like to create a few this type of one keyword-licenses with 2-3 detected words after and before. And after I'll go true the report manually (review) and verify just the files which have a potential license or reference hints.

I didn't do a list sorrily with the incorrectly detected licenses, but if it helps you, in the future I'll try to do that and send them to you. Anyhow they were just a few incorrectly detected, the bigger problem was with the above mentioned short references or new licenses. I know that for the basic user this extra on key-licenses will be seen as "noise" but I like to catch them all.

Can you help me or explain me how to do that SPDX-license-identifier like detection for the "license" key, or this has to be a code base intervention in the Scancode? I'd like to test it out and experiment with it if this is somehow possible, and come back with the results.

Thank your for your feedback.

@pombredanne
Copy link
Contributor

@muzsielod sorry for not replying earlier... somehow this ticket slipped through unnoticed.

  1. SPDX-license-identifier like detection is done with code
  2. ScanCode does not use regex, but plain text rules for detection. The best way to to add new detections including short keywords or phrases is to add new license detection rules. See src/licensedcode/data/rules for examples.
  3. to return a few word before and after this would likely mean to extend the way the matched license text is returned. The current option to return whole matched lines is not surfaced as a command line option and would be a good start
  4. beyond this creating a new scanner plugin to do regex keyword collection is another approach

@pombredanne
Copy link
Contributor

The current option to return whole matched lines is not surfaced as a command line option and would be a good start

This is now the default with the --license-text option... is this working OK for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants