Skip to content

Commit

Permalink
Add new intro and concept documentation #8
Browse files Browse the repository at this point in the history
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
  • Loading branch information
pombredanne committed Sep 18, 2020
1 parent 64fca82 commit 9f14dab
Show file tree
Hide file tree
Showing 2 changed files with 158 additions and 0 deletions.
70 changes: 70 additions & 0 deletions docs/introduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Why is ScanCode.io
==================

Modern software is built from many open source packages assembled with new code.
Knowing which free and open source code package is in use matters because:

- knowning the license of third-party code is required before using it, and
- you want to avoid using buggy, outdated or vulnerable components.

Because it is so easy to include and reuse new code downloaded from the internet,
it is often surprisingly hard to get a proper inventory of all the third-party
code origins and licenses used in a software project.
There are some great tools available to scan your code and help uncover these.

And when you reuse only a few FOSS components in a single project, running one
of these tools (such as the scancode-toolkit) by hand together
with a spreadsheet may be enough to manage your software composition analysis.

But when you scale up, running automated and reproducible analysis pipelines
that are adapted to a software project unique context and technology platform is
difficult. This will require deploying and running multiple specialized tools
and merge their results with a consistent workflow.

And when reusing thousands of open source packages is becoming commonplace,
code scans pipelines need to be scripted as code and running on servers backed
a database, not on a laptop.

For instance when you analyze Docker container images, there could be hundreds
to thousands of system packages (such as Debian, RPM, Alpine) and application
packages (such as npm, PyPI, Rubygems, Maven) installed in an image side-by-side
with your own code.

Taking care of all these can be hard. ScanCode.io can help organize these
complex code analysis as scripted pipelines and store their results in a uniform
database for automated code analysis.


What is ScanPipe
================

ScanPipe is a developer-friendly framework and application that helps software
analysts and engineers build and manage real-life software composition analysis
projects as scripted pipelines.

ScanPipe was originally developed to help boost productivity of code analysts
who work on a wide variety of software composition analysis projects.

ScanPipe provides a unified framework to the infrastructure that is
required to execute and organize these software composition analysis projects.


## Should I Use ScanPipe?

If you are working on a software composition analysis project, or you
are planning to start a new one, consider the following questions:

1. **Automation**: Is this project part of a larger compliance program and process (as opposed to a one-of) and do you need automation?
2. **Complexity**: Does the project use many third-party components or technologies?
3. **Reproducibility**: Is it important that results are reproducible, traceable and auditable?

If you answered "yes" to any of the above, keep reading - ScanPipe can help you.
If the answer is "no" to all of the above, which is a valid scenario e.g. when you
are doing small-scale analysis, ScanPipe may provide only limited benefit for you.

The first set of available pipelines help automate the analysis of Docker
"container" images and virtual machine (VM) disk images that often harbor
comprehensive software stacks from an operating system with its kernel through
system and application packages to original and custom applications.


88 changes: 88 additions & 0 deletions docs/scanpipe-concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
ScanPipe Concepts
=================

Project
-------

A project is the encapsulates the analysis of software code:

- it has a workspace which is a directory that contains the software code files under analysis
- it is related to one or more code analysis pipelines scripts to automate its analysis
- it tracks the project Codebase Resources e.g. its code files and directories
- it tracks the project Discovered Packages e.g. its the system and application packages origin and license discovered in the codebase

Multiple analysis pipelines can be run on a single project.

In the database, a project is identified by its unique name.


Project workspace
-----------------

A project workspace is the root directory where all the project files are stored.

The following directories exists under this directory:

- `input/` contains all the original uploaded and input files used of the project. For instance, it could be a codebase archive.
- `codebase/` contains the files and directories (aka. resources) tracked as CodebaseResource records in the database.
- `output/` contains all output files created by the pipelines: reports, scan results, etc.
- `tmp/` is a scratch pad for temporary files generated during the pipelines runs.


Pipelines
---------

A pipeline is a Python script that contains a series of steps from start to end
to run in order perform a code analysis.

It usually starts from the uploaded input files, and may extract these then
generates CodebaseResource records in the database accordingly.

Those resources can then be analyzed, scanned, matched as needed.
Analysis results and reports are evetually posted at the end of pipeline run

For now, all pipelines are located in the `scanpipe.pipelines` module.
Each pipeline consist of a Python script including one subclass of the "Pipeline" class.
Each step is a method of the Pipeline class decorated with @step decorator.
At its end, a step states which is the next step to execute.

One or more pipelines can be assigned to a project as a sequence.
If the one pipeline of a sequence completes successfully, the next pipeline in
queue for this project is run automatically until all pipelines are executed.


Codebase Resources
------------------

A project Codebase Resources are records of its code files and directories.
CodebaseResource is a database model and each record is identified by its path
under the project workspace.

Some of the CodebaseResource interesting attributes are:

- a status used to track the analysis status for this resource.
- a type (such as file, directory or symlink)
- various attributes to track detected copyrights, license expressions, copyright holders, related packages.

In general the attributes and their names are the same that are used in ScanCode-Toolkit for files.


Discovered Packages
-------------------

A project Discovered Packages are records of the system and application packages
discovered in its code.
DiscoveredPackage is a database model and each record is identified by its Package URL.
Package URL is a grassroot efforts to create informative identifiers for software
packages such as Debian, RPM, npm, Maven PyPI packages. See https://github.com/package-url for details.


Some of the DiscoveredPackage interesting attributes are:

- type, name, version (all Package URL attributes)
- homepage_url, download_url and other URLs
- checksums (such as SHA1, MD5)
- copyright, license_expression, declared_license


In general the attributes and their names are the same that are used in ScanCode-Toolkit for packages.

0 comments on commit 9f14dab

Please sign in to comment.