Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide header level details for a scan: Enhance scancode to include a log or history with useful statistics #211

Closed
DennisClark opened this issue Feb 24, 2016 · 15 comments · Fixed by #1285

Comments

@DennisClark
Copy link
Contributor

A run of scancode should generate a log file with meaningful statistics, including such things as:

  • the version number of scancode that was executed
  • start date/time
  • end date/time
  • elapsed time
  • name of the library that was scanned
  • number of files scanned
  • {{anything else that makes sense and is useful}}
@balusarakesh
Copy link
Collaborator

This log file can also contain the reason for the scan failure if the scan is interrupted in the middle due to any particular reasons.

@jdaguil jdaguil added this to the v2.0 milestone Mar 2, 2016
@pombredanne pombredanne modified the milestones: v2.0, v2.1 Aug 5, 2016
@pombredanne pombredanne modified the milestones: v2.1, v2.3 Oct 4, 2017
@pombredanne pombredanne changed the title Enhance scancode to generate a log file with useful statistics Provide header level details for a scacn: Enhance scancode to generate a log file with useful statistics Oct 4, 2017
@pombredanne
Copy link
Contributor

See also aboutcode-org/aboutcode#7

@pombredanne pombredanne changed the title Provide header level details for a scacn: Enhance scancode to generate a log file with useful statistics Provide header level details for a scan: Enhance scancode to generate a log file with useful statistics Oct 17, 2017
@pombredanne pombredanne changed the title Provide header level details for a scan: Enhance scancode to generate a log file with useful statistics Provide header level details for a scan: Enhance scancode to include a log or history with useful statistics Oct 30, 2017
@sschuberth
Copy link
Collaborator

As discussed in #840, having a summary of errors (e.g. in the header of the regular JSON output file) would also be beneficial.

@pombredanne
Copy link
Contributor

There is some improvements in develop post #885. More work is needed towards this though

pombredanne added a commit that referenced this issue Jul 11, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne pombredanne modified the milestones: v2.3, v3.0 Nov 4, 2018
pombredanne added a commit that referenced this issue Nov 14, 2018
* This is a new data structure as designed in
  aboutcode-org/aboutcode#7
* For now, the old header-level data have been kept

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 14, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 27, 2018
* This is a new data structure as designed in
  aboutcode-org/aboutcode#7
* For now, the old header-level data have been kept

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 27, 2018
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 27, 2018
This is the original attribute name we had agreed to

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 27, 2018
As suggested by @sschuberth in
aboutcode-org/aboutcode#7 (comment)

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 27, 2018
 * Remove the top level attributes scancode_notice, scancode_version,
   etc... And move the tope level files_count as an extra_data header
   attribute.

 * Update all outputs and tests accordingly
 * other minor refactorings
  * rename plugincode.output.OutputPlugin.get_results to get_files
  * remove scancode.resource.Codebase.get_headings, now obsolete

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Contributor

GH closed this automatically.... reopening!

@pombredanne pombredanne reopened this Nov 29, 2018
@pombredanne
Copy link
Contributor

Here is what we have now:
$ ./scancode -clip -n4 --summary --json-pp j.son samples

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "2.9.7.post183.795fcc4",
      "options": {
        "input": "samples",
        "--copyright": true,
        "--info": true,
        "--json-pp": "j.son",
        "--license": true,
        "--package": true,
        "--processes": "4",
        "--summary": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2018-11-29T072242.399469",
      "end_timestamp": "2018-11-29T072250.772025",
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 33
      }
    }
  ],
...

and then the same file reprocessed

$ ./scancode --from-json j.son --only-findings --json-pp j2.son --csv j2.csv

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "2.9.7.post183.795fcc4",
      "options": {
        "input": "samples",
        "--copyright": true,
        "--info": true,
        "--json-pp": "j.son",
        "--license": true,
        "--package": true,
        "--processes": "4",
        "--summary": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2018-11-29T072242.399469",
      "end_timestamp": "2018-11-29T072250.772025",
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 33
      }
    },
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "2.9.7.post183.795fcc4",
      "options": {
        "input": "j.son",
        "--csv": "j2.csv",
        "--from-json": true,
        "--json-pp": "j2.son",
        "--only-findings": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2018-11-29T072338.656792",
      "end_timestamp": "2018-11-29T072338.691748",
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 0
      }
    }
  ],
...

@pombredanne
Copy link
Contributor

The only question left is about ordering: for now the top most header item is the oldest, not the newest. It might be better to have the ordering done the other way?

@pombredanne
Copy link
Contributor

@sschuberth also we now have a global errors attribute in it too. See this example:
$ ./scancode -clipeu --json-pp - --timeout 0.000001 --verbose tests/scancode/data/failing/patchelf.pdf

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "2.9.7.post183.795fcc4",
      "options": {
        "input": "tests/scancode/data/failing/patchelf.pdf",
        "--copyright": true,
        "--email": true,
        "--info": true,
        "--json-pp": "-",
        "--license": true,
        "--package": true,
        "--timeout": "1e-06",
        "--url": true,
        "--verbose": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2018-11-29T073446.380441",
      "end_timestamp": "2018-11-29T073448.617701",
      "message": null,
      "errors": [
        "Path: patchelf.pdf\n  ERROR: for scanner: info:\n  ERROR: Processing interrupted: timeout after 0 seconds.\n  ERROR: for scanner: licenses:\n  ERROR: Processing interrupted: timeout after 0 seconds.\n  ERROR: for scanner: copyrights:\n  ERROR: Processing interrupted: timeout after 0 seconds.\n  ERROR: for scanner: packages:\n  ERROR: Processing interrupted: timeout after 0 seconds.\n  ERROR: for scanner: emails:\n  ERROR: Processing interrupted: timeout after 0 seconds.\n  ERROR: for scanner: urls:\n  ERROR: Processing interrupted: timeout after 0 seconds."
      ],
      "extra_data": {
        "files_count": 1
      }
    }
...

@sschuberth
Copy link
Collaborator

The only question left is about ordering

I don't really think it matters, as to be on the safe side you should always sort by start_timestamp anyway. But what about the data following the header? How do you know which data belongs to which header? Or will there always only be data from the last run in the file?

@pombredanne
Copy link
Contributor

@sschuberth re

But what about the data following the header? How do you know which data belongs to which header? Or will there always only be data from the last run in the file?

This is the data as it is from the last run in the file. Tracking actual changes is something to do outside.
Here the headers is just a way to document the fact a file was created through multiple tools touching it, such as multiple scancode runs, editing in aboutcode manager, matching against an index, etc

@sschuberth
Copy link
Collaborator

Here the headers is just a way to document the fact a file was created through multiple tools touching it

I see. Another idea to make this more clear would be to always only keep one top-level header, and move headers from previous processing steps e.g. to the existing extra_data field.

@pombredanne
Copy link
Contributor

@sschuberth re

Another idea to make this more clear would be to always only keep one top-level header, and move headers from previous processing steps e.g. to the existing extra_data field.

I am not inclined to go that way: this would mean that each tool that updates the header would need to move several data bits around instead of just appending a whole new record. I would prefer keep this simpler way unless you feel strongly about it

@sschuberth
Copy link
Collaborator

I would prefer keep this simpler way unless you feel strongly about it

No, not strongly enough 😉

@pombredanne
Copy link
Contributor

@sschuberth thanks!

@pombredanne
Copy link
Contributor

I am closing at last as this is now merged in develop.
Thank you all for the help and review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants