Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Scan deduction and summarization #377

Open
pombredanne opened this issue Nov 25, 2016 · 11 comments
Open

Proposal: Scan deduction and summarization #377

pombredanne opened this issue Nov 25, 2016 · 11 comments

Comments

@pombredanne
Copy link
Contributor

pombredanne commented Nov 25, 2016

Context

Scanning operates at the file level. This is good but in many cases a scan reports too much data at a too detailed level. This happens when related clues are detected across files or inside the same file.

Problem

Multiple related clues in different files

For instance, if every file in a directory tree has the same license and copyright statements, then the license and origin information could be rolled up at the level of this directory and the file details could be omitted.

Or say that a scanned directory only contains a COPYING file with a license and notice and none of the files in that directory have a license or copyright. Then the license and origin information could be extended from the COPYING to all the files in that tree.

Or say that a scanned directory only contains a README file with a license and notice and that all the files in that directory have a comment See README for licensing. Then the license and origin information could be extended from the README to all the files in that tree that carry this comment.

Or say that a Package is detected (such as Maven Jar or an NPM or else) and that the package-level metadata accurately described the licensing of all the files for this package and that the scan of the files in this package does not bring new details. Then only the license and origin information from the package could be kept and the file details omitted.

Or say that a directory contains code in a mix of programming languages: the primary or main language or language stats could be rolled up at the directory level.

Or say that a directory contains both code and build scripts and that the license for the build scripts is different from that of the code (say this is some autotools MIT or FSF notice). Then the licenses for the directory could be summarized based on a classification of the code files, and the build scripts and the build script licenses would not be reported as the directory or package license.

Multiple related clues in the same file

Some scans operate on the same data in a given file and this may trigger reporting extra or spurious clues and could be instead considered together.

For instance a license text may contain a copyright statement for the text of the license and URLs and emails. Detecting licenses, copyrights, emails and urls could report four different clues in same scanned file and scanned text region when this is may be instead a single clue for the license that should be reported and not four clues.

Or a package metadata file would typically contains origin and license information and these would end up reported twice both as package attributes and individual detection for license, copyright and urls.

Solution elements

A comprehensive solution may cover some or all of these:

  • determine where to summarize and roll up clues. For instance, rolling everything at the root directory level would rarely make sense; instead rolling things up at a package level and finding what would be a good directory level to use as a break point would be important
  • implement some classification of files such as test, code proper, build scripts, test code, etc.
  • implement some statistics, rules and/or machine learning to summarize and deduct proper higher-level origin and license.
  • scan all the clues togther in order to combine (and filter) them properly
  • combine package detection with license and copyright detection
@yahalom5776
Copy link

@pombredanne Another case I see quite often is a detection of a generic clue for e.g. LGPL (with no further version info) and then another clue in the same file with the specific license information, e.g. LGPL 2.1 or later. It would help to have some logic to roll these up to the "better" result which is LGPL 2.1 or later. Could be based on the "distance" clues are away from each other in the file and the knowledge that LGPL and LGPL 2.1 are related (this would have to be set in the license meta data/detection definitions). Another topic where such a roll up would be helpful are the typical GPL 2 or later with Autoconf exception headers.

I thought about this for some time and I am still a little bit worried about "auto-resolutions" if I do not know that this resolution even happened. So perhaps we could preserve the raw data of all clues found somehow to be able to retrace the finding?

Assuming licenses from clues on directory level to other files (perhaps with the condition they have no other clues themselves) is a possibility but I think it's a completely different ballgame from a complexity and (legal) risk level. Perhaps it makes more sense to start on file level for that matter. But that's just IMHO.

@pombredanne
Copy link
Contributor Author

@yahalom5776 Thanks for the feedback. This makes 100% sense to me. I agree we should always keep the raw scans: this is more about adding smarts and summaries at the package and some directory levels, but not hiding the things below these

@pombredanne
Copy link
Contributor Author

@yahalom5776 If you can provide some examples for Another topic where such a roll up would be helpful are the typical GPL 2 or later with Autoconf exception headers. this would be great

@pkunz
Copy link

pkunz commented Nov 30, 2016

On

For instance, if every file in a directory tree has the same license and copyright statements, then the license and origin information could be rolled up at the level of this directory and the file details could be omitted.
Yes, if not done automatically, it has to be done manually during an Audit.

One has to be careful with the COPYING file. It may be the text for gpl-2.0 or lgpl-2.1, but in the head of files one may find gpl-2.0-plus or lgpl-2.1-plus. Or the 'or later' might be found in a NOTICE or README file. Also there may be a few files with licenses other than the one stated in the COPYING file.

If autotools are used (quite common) then the same set of licenses show up in a scan which could/should be ignored because the autotool files are copied verbatim or generated from a template. Perhaps we could make a list of such files that can be ignored.

I don't always trust the license info in the metadata of an rpm because this is put in by hand by the author of the rpm spec file who is not necessarily the author of the package.

@yahalom5776
Copy link

yahalom5776 commented Dec 9, 2016

@pombredanne Sorry for the late reply but here is a similar example. It's from glibc 2.19:
glibc.zip-extract/glibc-2.19/wcsmbs/isoc99_vswscanf.c

License header:

/* Copyright (C) 1993-2014 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <http://www.gnu.org/licenses/>.

   As a special exception, if you link the code in this file with
   files compiled with a GNU compiler to produce an executable,
   that does not cause the resulting executable to be covered by
   the GNU Lesser General Public License.  This exception does not
   however invalidate any other reasons why the executable file
   might be covered by the GNU Lesser General Public License.
   This exception applies to code released by its copyright holders
   in files containing the exception.  */

That's the ScanCode result according to the HTML output for that file:

glibc.zip-extract/glibc-2.19/wcsmbs/isoc99_vswscanf.c 	2 	25 	license 	lgpl-2.1-plus
glibc.zip-extract/glibc-2.19/wcsmbs/isoc99_vswscanf.c 	2 	25 	license 	lgpl-2.1-plus-linking

Correct roll-up would be

lgpl-2.1-plus-linking

in this case. Perhaps you can have a look. Thank you!

Edit: Another one from glibc 2.19, this time it is an autoconf clue:

License header of glibc.zip-extract/glibc-2.19/scripts/config.gues (Lines 1 - 32):

#! /bin/sh
# Attempt to guess a canonical system name.
#   Copyright 1992-2013 Free Software Foundation, Inc.

timestamp='2013-11-29'

# This file is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, see <http://www.gnu.org/licenses/>.
#
# As a special exception to the GNU General Public License, if you
# distribute this file as part of a program that contains a
# configuration script generated by Autoconf, you may include it under
# the same distribution terms that you use for the rest of that
# program.  This Exception is an additional permission under section 7
# of the GNU General Public License, version 3 ("GPLv3").
#
# Originally written by Per Bothner.
#
# You can get the latest version of this script from:
# http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
#
# Please send patches with a ChangeLog entry to config-patches@gnu.org.

ScanCode detection:

glibc.zip-extract/glibc-2.19/scripts/config.guess 	7 	18 	license 	gpl-3.0-plus
glibc.zip-extract/glibc-2.19/scripts/config.guess 	20 	25 	license 	gpl-3.0-autoconf
glibc.zip-extract/glibc-2.19/scripts/config.guess 	55 	56 	license 	unknown

The "unkown" detection is further down in the file and should be reviewed and handled independently IMO:

Originally written by Per Bothner.
Copyright 1992-2013 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE."

@pombredanne
Copy link
Contributor Author

@yahalom5776 Thanks!

For the GLibc case, this is something that will dealt with license expressions with #74 e.g. in this case, it would be an expression like: lgpl-2.1-plus with lgpl-2.1-plus-linking . This is because the two licenses need to be reported and are detected together

@pombredanne
Copy link
Contributor Author

@yahalom5776 For config.guess case, (and in general when several licenses are detected in a single file) we have various possibilities:

  • you have several repeated licenses with the same copyright (e.g. as is the case in most detection in a config.guess) and these are good candidate for a summarization
  • you have several licenses (such as in a top level notice that would recap all the embedded third-party licenses) in that becomes hard to summarize anything

In the case of the unknown detection, we have this interesting text: see the source for copying conditions which could be something we could detect on its own (and many variations on the same theme of "see in this other file for licensing...").... and we could be smart about that. Is there a See LICENSE and we detected a license in a LICENSE file nearby? And if so could we infer what this "unknown" license is instead?

Finally in the case of a common build such as config.guess and related autotools scripts, having them classified automatically as being build scripts could offer a way to further do some deduction of what the license is and what is the relative importance of these licenses e.g. the license of the build scripts is not as important as the license of the main code proper and usually has little or impact on the resulting license: I can build an MIT-licensed package with autotools or a GPL-licensed build script and my package will still be MIT-licensed and neither the built binaries nor the source proper will not inherit from the build script licensing.

pombredanne added a commit that referenced this issue Dec 13, 2016
 * detect license references such as "See COPYING for details"

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Feb 17, 2017
 * detect license references such as "See COPYING for details"

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne mentioned this issue Mar 10, 2017
4 tasks
@pombredanne pombredanne added this to the v3.0 milestone Oct 20, 2017
pombredanne added a commit that referenced this issue May 7, 2018
 * also rename CLI option
 * add tests


Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue May 7, 2018
 * this way this can run from a virtual codebase too

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue May 7, 2018
This is very basic at the moment.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jun 18, 2018
The counters are not a summary

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 11, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 11, 2018
 - there is now a single summary option that summarizes whichever scan
 is available from the copyrights, licenses, programming language
 - the summary is report either as a new codebase-level attribute
 or as both codebase-level and file/directory level when using
 --summary-with-details
 - only json output support summaries for now

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 16, 2018
 * Fix test failures (from unstable sort order)
 * Refactor common code where relevant
 * Other minor refinements

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 18, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 18, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 18, 2018
A path pattern must be matched or not. For instance matching a directory
does not mean the children are matched.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 18, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 18, 2018
When doing aggregations ofor key files or grouping by facet, we need to
recompute value summaries for each summarize attribute to get correct
summaries.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Jul 24, 2018
When computing summaries for #377 empty values (e.g. summaries of
None) and attributes without a summary should not be the cause of
crashes. Same for empty directories.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Oct 30, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne modified the milestones: v3.0, v3.1 Nov 5, 2018
pombredanne added a commit that referenced this issue Nov 8, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 8, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne removed this from the v3.3 milestone Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants