-
Notifications
You must be signed in to change notification settings - Fork 85
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add new intro and concept documentation #8
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
- Loading branch information
1 parent
64fca82
commit 9f14dab
Showing
2 changed files
with
158 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
Why is ScanCode.io | ||
================== | ||
|
||
Modern software is built from many open source packages assembled with new code. | ||
Knowing which free and open source code package is in use matters because: | ||
|
||
- knowning the license of third-party code is required before using it, and | ||
- you want to avoid using buggy, outdated or vulnerable components. | ||
|
||
Because it is so easy to include and reuse new code downloaded from the internet, | ||
it is often surprisingly hard to get a proper inventory of all the third-party | ||
code origins and licenses used in a software project. | ||
There are some great tools available to scan your code and help uncover these. | ||
|
||
And when you reuse only a few FOSS components in a single project, running one | ||
of these tools (such as the scancode-toolkit) by hand together | ||
with a spreadsheet may be enough to manage your software composition analysis. | ||
|
||
But when you scale up, running automated and reproducible analysis pipelines | ||
that are adapted to a software project unique context and technology platform is | ||
difficult. This will require deploying and running multiple specialized tools | ||
and merge their results with a consistent workflow. | ||
|
||
And when reusing thousands of open source packages is becoming commonplace, | ||
code scans pipelines need to be scripted as code and running on servers backed | ||
a database, not on a laptop. | ||
|
||
For instance when you analyze Docker container images, there could be hundreds | ||
to thousands of system packages (such as Debian, RPM, Alpine) and application | ||
packages (such as npm, PyPI, Rubygems, Maven) installed in an image side-by-side | ||
with your own code. | ||
|
||
Taking care of all these can be hard. ScanCode.io can help organize these | ||
complex code analysis as scripted pipelines and store their results in a uniform | ||
database for automated code analysis. | ||
|
||
|
||
What is ScanPipe | ||
================ | ||
|
||
ScanPipe is a developer-friendly framework and application that helps software | ||
analysts and engineers build and manage real-life software composition analysis | ||
projects as scripted pipelines. | ||
|
||
ScanPipe was originally developed to help boost productivity of code analysts | ||
who work on a wide variety of software composition analysis projects. | ||
|
||
ScanPipe provides a unified framework to the infrastructure that is | ||
required to execute and organize these software composition analysis projects. | ||
|
||
|
||
## Should I Use ScanPipe? | ||
|
||
If you are working on a software composition analysis project, or you | ||
are planning to start a new one, consider the following questions: | ||
|
||
1. **Automation**: Is this project part of a larger compliance program and process (as opposed to a one-of) and do you need automation? | ||
2. **Complexity**: Does the project use many third-party components or technologies? | ||
3. **Reproducibility**: Is it important that results are reproducible, traceable and auditable? | ||
|
||
If you answered "yes" to any of the above, keep reading - ScanPipe can help you. | ||
If the answer is "no" to all of the above, which is a valid scenario e.g. when you | ||
are doing small-scale analysis, ScanPipe may provide only limited benefit for you. | ||
|
||
The first set of available pipelines help automate the analysis of Docker | ||
"container" images and virtual machine (VM) disk images that often harbor | ||
comprehensive software stacks from an operating system with its kernel through | ||
system and application packages to original and custom applications. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
ScanPipe Concepts | ||
================= | ||
|
||
Project | ||
------- | ||
|
||
A project is the encapsulates the analysis of software code: | ||
|
||
- it has a workspace which is a directory that contains the software code files under analysis | ||
- it is related to one or more code analysis pipelines scripts to automate its analysis | ||
- it tracks the project Codebase Resources e.g. its code files and directories | ||
- it tracks the project Discovered Packages e.g. its the system and application packages origin and license discovered in the codebase | ||
|
||
Multiple analysis pipelines can be run on a single project. | ||
|
||
In the database, a project is identified by its unique name. | ||
|
||
|
||
Project workspace | ||
----------------- | ||
|
||
A project workspace is the root directory where all the project files are stored. | ||
|
||
The following directories exists under this directory: | ||
|
||
- `input/` contains all the original uploaded and input files used of the project. For instance, it could be a codebase archive. | ||
- `codebase/` contains the files and directories (aka. resources) tracked as CodebaseResource records in the database. | ||
- `output/` contains all output files created by the pipelines: reports, scan results, etc. | ||
- `tmp/` is a scratch pad for temporary files generated during the pipelines runs. | ||
|
||
|
||
Pipelines | ||
--------- | ||
|
||
A pipeline is a Python script that contains a series of steps from start to end | ||
to run in order perform a code analysis. | ||
|
||
It usually starts from the uploaded input files, and may extract these then | ||
generates CodebaseResource records in the database accordingly. | ||
|
||
Those resources can then be analyzed, scanned, matched as needed. | ||
Analysis results and reports are evetually posted at the end of pipeline run | ||
|
||
For now, all pipelines are located in the `scanpipe.pipelines` module. | ||
Each pipeline consist of a Python script including one subclass of the "Pipeline" class. | ||
Each step is a method of the Pipeline class decorated with @step decorator. | ||
At its end, a step states which is the next step to execute. | ||
|
||
One or more pipelines can be assigned to a project as a sequence. | ||
If the one pipeline of a sequence completes successfully, the next pipeline in | ||
queue for this project is run automatically until all pipelines are executed. | ||
|
||
|
||
Codebase Resources | ||
------------------ | ||
|
||
A project Codebase Resources are records of its code files and directories. | ||
CodebaseResource is a database model and each record is identified by its path | ||
under the project workspace. | ||
|
||
Some of the CodebaseResource interesting attributes are: | ||
|
||
- a status used to track the analysis status for this resource. | ||
- a type (such as file, directory or symlink) | ||
- various attributes to track detected copyrights, license expressions, copyright holders, related packages. | ||
|
||
In general the attributes and their names are the same that are used in ScanCode-Toolkit for files. | ||
|
||
|
||
Discovered Packages | ||
------------------- | ||
|
||
A project Discovered Packages are records of the system and application packages | ||
discovered in its code. | ||
DiscoveredPackage is a database model and each record is identified by its Package URL. | ||
Package URL is a grassroot efforts to create informative identifiers for software | ||
packages such as Debian, RPM, npm, Maven PyPI packages. See https://github.com/package-url for details. | ||
|
||
|
||
Some of the DiscoveredPackage interesting attributes are: | ||
|
||
- type, name, version (all Package URL attributes) | ||
- homepage_url, download_url and other URLs | ||
- checksums (such as SHA1, MD5) | ||
- copyright, license_expression, declared_license | ||
|
||
|
||
In general the attributes and their names are the same that are used in ScanCode-Toolkit for packages. |