ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726

FrankHeimes · 2021-10-07T12:37:37Z

Description

After scanning the Qt code base, scancode failed to run the final steps:

PS C:\My\scancode-toolkit-30.0.0> d:\MyScript.ps1
Setup plugins...
Collect file inventory...
Scan files for: info, licenses, copyrights, packages, emails, urls, generated with 30 process(es)...
[--------------------] 0

The following message repeated many times:
c:\my\scancode-toolkit-30.0.0\src\cluecode\copyrights.py:3382: FutureWarning: Possible set difference at position 3
  remove_tags = re.compile(

[####################] 29462
ERROR: failed to run post-scan plugin: consolidate:
Traceback (most recent call last):
  File "c:\my\scancode-toolkit-30.0.0\src\scancode\cli.py", line 1057, in run_codebase_plugins
    plugin.process_codebase(codebase, **kwargs)
  File "c:\my\scancode-toolkit-30.0.0\src\summarycode\plugin_consolidate.py", line 159, in process_codebase
    consolidations.extend(get_consolidated_packages(codebase))
  File "c:\my\scancode-toolkit-30.0.0\src\summarycode\plugin_consolidate.py", line 239, in get_consolidated_packages
    for _, holder, _, _ in CopyrightDetector().detect(numbered_lines,
TypeError: cannot unpack non-iterable Detection object

C:\My\scancode-toolkit-30.0.0\lib\site-packages\fingerprints\cleanup.py:54: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)

How To Reproduce

Download and extract Qt from https://www.qt.io/download
Install Python 3.9
Download and extract scancode 30.0.0 to C:\My\scancode-toolkit-30.0.0
Run scancode once to set it up
run "C:\My\scancode-toolkit-30.0.0\Scripts\scancode" -clpieu --license-score 60 --license-text --license-text-diagnostics --only-findings --strip-root --classify --json D:\SourceCodeLicenses.json --summary --generated --consolidate -n 30 --ignore-author "\.rc$|::|$User Name$ CString|AppDomainManager|Read\. DataRecord|^the [A-Z][A-Za-z ]+$|Fred Flintstone$FST$|Microsoft Visual|Cortana" --ignore-copyright-holder "Microsoft|BCGSoft|Cortana|Basler.*(Basler|Vision)|Allied Vision|Stemmer" --ignore */BCG/* --ignore */ConfigurationManagement/Certificate.pfx/* --ignore */Salut/* --ignore */doc/* --ignore */tutorials/* --ignore *.acf --ignore *.appxmanifest --ignore *.aux --ignore *.bin --ignore *.bmp --ignore *.config --ignore *.cur --ignore *.dat --ignore *.db --ignore *.def --ignore *.hlsl --ignore *.hlsli --ignore *.ico --ignore *.ifc --ignore *.ilk --ignore *.ipch --ignore *.ism --ignore *.jpg --ignore *.lib --ignore *.manifest --ignore *.mc --ignore *.metagen --ignore *.mp4 --ignore *.nls --ignore *.obj --ignore *.pch --ignore *.pchast --ignore *.pdb --ignore *.pfx --ignore *.png --ignore *.pri --ignore *.resfiles --ignore *.resources --ignore *.rh --ignore *.rsp --ignore *.ruleset --ignore *.snk --ignore *.svd --ignore *.tlb --ignore *.tlh --ignore *.tli --ignore *.tlog --ignore *.ver --ignore *.winmd --ignore *.xbf --ignore *.xdc --ignore *.xsd D:\ExtractedQtPackage\Qt

Note that scanning other packages (e. g. MKL from Intel) using the same command succeeds.

System configuration

AMD Ryzen 9 3950X (16 core, 32 thread), 32GB RAM, M.2 SSD 970 EVO Plus 1TB
Windows 10 Enterprise LTSC (1809), Python 3.9, Scancode 30.0.0, downloaded and extracted to C:\My

The text was updated successfully, but these errors were encountered:

FrankHeimes · 2021-10-07T12:49:36Z

After running the script again, I received more info, this time pointing to the file that caused the error:

TypeError: cannot unpack non-iterable Detection object
Path: x64/bin/Qt5WebEngineCored.dll

That file happens to be the largest one with 548 MB.
May that size be the culprit?

And this time, scancode actually ended and produced a summary and output for the other files.

pombredanne · 2021-10-08T05:24:56Z

@FrankHeimes Thank you for the report. That's a sizeable DLL indeed and the likely cause for troubles. The difficulty in this case is that there is a delicate balance to find between possibly skipping such a file entirely and then missing out on some important information or finding a way to get some scan data (possible DLL metadata and basic file info) and not other (such as license and copyright details)

Another approach could be to split such large file in arbitrary chunks (say 5 to 10MB) and run scans as usual more efficiently on these fragments and have a special check if there are any scannable data and results near the chunk boudndaries that would need restiching and rescanning some chunk regions.

Yet another one could be to have a command line option to skip file above a certain size entirely.

What would be your take there?

FrankHeimes · 2021-10-08T08:26:01Z

@pombredanne IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size.

FrankHeimes · 2021-10-08T08:42:50Z

Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:

boost/typeof/vector150.hpp
boost/typeof/vector200.hpp

These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content.
However, some individual lines are as long as 5KB.

pombredanne · 2021-10-08T14:10:16Z

IMHO, binary files warrant type specific scanners, because they usually have a specific structure. So copyright data can't just appear at random locations in those files. And if it does, then it is random data! For example, the (C) character followed by some arbitrary printable characters notoriously triggers false positive matches when using trivial scanners. Taking the structure of a file into account, it may be possible to just seek beyond 99.9% of the contents of a file to examine the relevant parts. This way it doesn't matter if the file is 4kB or 4GB in size.

@FrankHeimes you are nailing it! The thing is that each format may need specific ways. But in general compressed data does not have much one can squeeze out.... but as it happens I once found GPL references in the paths from a compressed and unextracted Zip central file directory.

And I routinely find proper license and copyright in ELF and DLLs.

I guess one approach is to at least to find a way to ignore most compressed files.

pombredanne · 2021-10-08T14:27:42Z

Last night, I ran scancode on the boost sources. As a result, it reported the consolidate error on these files:
boost/typeof/vector150.hpp
boost/typeof/vector200.hpp
These files are just 1.3MB and 2.2MB in size and appear to have "innocent" content. However, some individual lines are as long as 5KB.

The culprit is the copyright detection on these large files. The process for this is roughly explained here:
https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/cluecode/copyrights.py#L59

The process consists in:

prepare and cleanup text
identify regions of text that may contain copyright (using hints).
These are called "candidates".
tag the text to recognize (e.g. lex) parts-of-speech (POS) tags to identify various copyright
statements parts such as dates, companies, names ("named entities"), etc.
This is done using pygmars which contains a lexer derived from NLTK POS tagger.
feed the tagged text to a parsing grammar describing actual copyright
statements (also using pygmars) and obtain a parse tree.
Walk the parse tree and yield copyright statements, holder and authors with start
and end line from the parse tree with some extra post-detection cleanups.

The issue is that the candidates detection is based on lines. And very long line mean very long time to lex and parse and possibly find nothing.

One solution would be to break very long lines in chunks, which is a strategy adopted for license detection and seen in actiion here https://github.com/nexB/scancode-toolkit/blob/c09309f99c27de4ddb0c1e6e3619b833ceb2aa6e/src/textcode/analysis.py#L138

In the short term, adding a --timeout 1000 to avoid the scan to timeout on such file would help.

On my laptop (Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz, 32GB RAM) I got vector200.hpp to scan alright with a timeout of 300 seconds:

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "30.1.0",
      "options": {
        "input": [
          "vector200.hpp"
        ],
        "--copyright": true,
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true,
        "--timeout": "300.0"
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2021-10-08T141949.792133",
      "end_timestamp": "2021-10-08T142345.461400",
      "output_format_version": "1.0.0",
      "duration": 235.66927790641785,
      "message": null,
      "errors": [],
      "extra_data": {
        "spdx_license_list_version": "3.14",
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "vector200.hpp",
      "type": "file",
      "licenses": [
        {
          "key": "boost-1.0",
          "score": 59.38,
          "name": "Boost Software License 1.0",
          "short_name": "Boost 1.0",
          "category": "Permissive",
          "is_exception": false,
          "is_unknown": false,
          "owner": "Boost",
          "homepage_url": "http://www.boost.org/users/license.html",
          "text_url": "http://www.boost.org/LICENSE_1_0.txt",
          "reference_url": "https://scancode-licensedb.aboutcode.org/boost-1.0",
          "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/boost-1.0.LICENSE",
          "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/boost-1.0.yml",
          "spdx_license_key": "BSL-1.0",
          "spdx_url": "https://spdx.org/licenses/BSL-1.0",
          "start_line": 5,
          "end_line": 6,
          "matched_rule": {
            "identifier": "boost-1.0_21.RULE",
            "license_expression": "boost-1.0",
            "licenses": [
              "boost-1.0"
            ],
            "referenced_filenames": [
              "LICENSE_1_0.txt"
            ],
            "is_license_text": false,
            "is_license_notice": true,
            "is_license_reference": false,
            "is_license_tag": false,
            "is_license_intro": false,
            "has_unknown": false,
            "matcher": "3-seq",
            "rule_length": 32,
            "matched_length": 19,
            "match_coverage": 59.38,
            "rule_relevance": 100
          },
          "matched_text": "Use modification and distribution are subject to the boost Software License,\n// Version 1.0. (See [http]://[www].[boost].[org]/LICENSE_1_0.txt)."
        }
      ],
      "license_expressions": [
        "boost-1.0"
      ],
      "percentage_of_license_text": 0.01,
      "copyrights": [
        {
          "value": "Copyright (c) 2005 Arkadiy Vertleyb",
          "start_line": 2,
          "end_line": 2
        },
        {
          "value": "Copyright (c) 2005 Peder Holt",
          "start_line": 3,
          "end_line": 3
        }
      ],
      "holders": [
        {
          "value": "Arkadiy Vertleyb",
          "start_line": 2,
          "end_line": 2
        },
        {
          "value": "Peder Holt",
          "start_line": 3,
          "end_line": 3
        }
      ],
      "authors": [],
      "scan_errors": []
    }
  ]
}Scanning done.
Summary:        licenses, copyrights with 1 process(es)
Errors count:   0
Scan Speed:     0.00 files/sec. 
Initial counts: 1 resource(s): 1 file(s) and 0 directorie(s) 
Final counts:   1 resource(s): 1 file(s) and 0 directorie(s) 
Timings:
  scan_start: 2021-10-08T141949.792133
  scan_end:   2021-10-08T142345.461400
  setup_scan:licenses: 1.37s
  setup: 1.37s
  scan: 234.30s
  total: 235.67s

FrankHeimes added the bug label Oct 7, 2021

sschuberth mentioned this issue Jan 27, 2022

scancode 21.8.4 starts with warnings #2686

Open

pombredanne changed the title ~~ERROR: failed to run post-scan plugin: consolidate:~~ ERROR: failed to run post-scan plugin: consolidate: and False positive on versy long lines Mar 5, 2022

pombredanne changed the title ~~ERROR: failed to run post-scan plugin: consolidate: and False positive on versy long lines~~ ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines Mar 5, 2022

pombredanne mentioned this issue Mar 5, 2022

RFC: a plan for false positive license detection #2878

Open

This was referenced Jan 31, 2023

Enable configuration for per-file timeout aboutcode-org/scancode.io#593

Closed

Evaluate using re2 Set and Filter for copyright performance boost #3236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726

ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726

FrankHeimes commented Oct 7, 2021

FrankHeimes commented Oct 7, 2021

pombredanne commented Oct 8, 2021

FrankHeimes commented Oct 8, 2021

FrankHeimes commented Oct 8, 2021

pombredanne commented Oct 8, 2021

pombredanne commented Oct 8, 2021

ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726

ERROR: failed to run post-scan plugin: consolidate: and False positive on very long lines #2726

Comments

FrankHeimes commented Oct 7, 2021

Description

How To Reproduce

System configuration

FrankHeimes commented Oct 7, 2021

pombredanne commented Oct 8, 2021

FrankHeimes commented Oct 8, 2021

FrankHeimes commented Oct 8, 2021

pombredanne commented Oct 8, 2021

pombredanne commented Oct 8, 2021