Add new intro and concept documentation #8

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
aboutcode-org · Sep 18, 2020 · 9f14dab · 9f14dab
1 parent 64fca82
commit 9f14dab
Show file tree

Hide file tree

Showing 2 changed files with 158 additions and 0 deletions.
diff --git a/docs/introduction.rst b/docs/introduction.rst
@@ -0,0 +1,70 @@
+Why is ScanCode.io
+==================
+
+Modern software is built from many open source packages assembled with new code.
+Knowing which free and open source code package is in use matters because:
+
+- knowning the license of third-party code is required before using it, and
+- you want to avoid using buggy, outdated or vulnerable components.
+
+Because it is so easy to include and reuse new code downloaded from the internet,
+it is often surprisingly hard to get a proper inventory of all the third-party
+code origins and licenses used in a software project.
+There are some great tools available to scan your code and help uncover these.
+
+And when you reuse only a few FOSS components in a single project, running one
+of these tools (such as the scancode-toolkit) by hand together
+with a spreadsheet may be enough to manage your software composition analysis.
+
+But when you scale up, running automated and reproducible analysis pipelines
+that are adapted to a software project unique context and technology platform is
+difficult. This will require deploying and running multiple specialized tools
+and merge their results with a consistent workflow.
+
+And when reusing thousands of open source packages is becoming commonplace,
+code scans pipelines need to be scripted as code and running on servers backed
+a database, not on a laptop.
+
+For instance when you analyze Docker container images, there could be hundreds
+to thousands of system packages (such as Debian, RPM, Alpine) and application
+packages (such as npm, PyPI, Rubygems, Maven) installed in an image side-by-side
+with your own code.
+
+Taking care of all these can be hard. ScanCode.io can help organize these
+complex code analysis as scripted pipelines and store their results in a uniform
+database for automated code analysis.
+
+
+What is ScanPipe
+================
+
+ScanPipe is a developer-friendly framework and application that helps software
+analysts and engineers build and manage real-life software composition analysis
+projects as scripted pipelines.
+
+ScanPipe was originally developed to help boost productivity of code analysts
+who work on a wide variety of software composition analysis projects.
+
+ScanPipe provides a unified framework to the infrastructure that is
+required to execute and organize these software composition analysis projects.
+
+
+## Should I Use ScanPipe?
+
+If you are working on a software composition analysis project, or you
+are planning to start a new one, consider the following questions:
+
+1. **Automation**: Is this project part of a larger compliance program and process (as opposed to a one-of) and do you need automation?
+2. **Complexity**: Does the project use many third-party components or technologies?
+3. **Reproducibility**: Is it important that results are reproducible, traceable and auditable?
+
+If you answered "yes" to any of the above, keep reading - ScanPipe can help you.
+If the answer is "no" to all of the above, which is a valid scenario e.g. when you
+are doing small-scale analysis, ScanPipe may provide only limited benefit for you.
+
+The first set of available pipelines help automate the analysis of Docker
+"container" images and virtual machine (VM) disk images that often harbor
+comprehensive software stacks from an operating system with its kernel through
+system and application packages to original and custom applications.
+
+
diff --git a/docs/scanpipe-concepts.rst b/docs/scanpipe-concepts.rst
@@ -0,0 +1,88 @@
+ScanPipe Concepts
+=================
+
+Project
+-------
+
+A project is the encapsulates the analysis of software code:
+
+- it has a workspace which is a directory that contains the software code files under analysis
+- it is related to one or more code analysis pipelines scripts to automate its analysis
+- it tracks the project Codebase Resources e.g. its code files and directories
+- it tracks the project Discovered Packages e.g. its the system and application packages origin and license discovered in the codebase
+
+Multiple analysis pipelines can be run on a single project.
+
+In the database, a project is identified by its unique name.
+
+
+Project workspace
+-----------------
+
+A project workspace is the root directory where all the project files are stored.
+
+The following directories exists under this directory:
+
+- `input/` contains all the original uploaded and input files used of the project. For instance, it could be a codebase archive.
+- `codebase/` contains the files and directories (aka. resources) tracked as CodebaseResource records in the database.
+- `output/` contains all output files created by the pipelines: reports, scan results, etc.
+- `tmp/` is a scratch pad for temporary files generated during the pipelines runs.
+
+
+Pipelines
+---------
+
+A pipeline is a Python script that contains a series of steps from start to end
+to run in order perform a code analysis.
+
+It usually starts from the uploaded input files, and may extract these then
+generates CodebaseResource records in the database accordingly.
+
+Those resources can then be analyzed, scanned, matched as needed.
+Analysis results and reports are evetually posted at the end of pipeline run
+
+For now, all pipelines are located in the `scanpipe.pipelines` module.
+Each pipeline consist of a Python script including one subclass of the "Pipeline" class.
+Each step is a method of the Pipeline class decorated with @step decorator.
+At its end, a step states which is the next step to execute.
+
+One or more pipelines can be assigned to a project as a sequence. 
+If the one pipeline of a sequence completes successfully, the next pipeline in
+queue for this project is run automatically until all pipelines are executed.
+
+
+Codebase Resources
+------------------
+
+A project Codebase Resources are records of its code files and directories.
+CodebaseResource is a database model and each record is identified by its path
+under the project workspace.
+
+Some of the CodebaseResource interesting attributes are:
+
+- a status used to track the analysis status for this resource.
+- a type (such as file, directory or symlink)
+- various attributes to track detected copyrights, license expressions, copyright holders, related packages.
+
+In general the attributes and their names are the same that are used in ScanCode-Toolkit for files.
+
+
+Discovered Packages
+-------------------
+
+A project Discovered Packages are records of the system and application packages
+discovered in its code.
+DiscoveredPackage is a database model and each record is identified by its Package URL.
+Package URL is a grassroot efforts to create informative identifiers for software
+packages such as Debian, RPM, npm, Maven PyPI packages. See https://github.com/package-url for details.
+
+
+Some of the DiscoveredPackage interesting attributes are:
+
+- type, name, version (all Package URL attributes)
+- homepage_url, download_url and other URLs
+- checksums (such as SHA1, MD5)
+- copyright, license_expression, declared_license
+
+
+In general the attributes and their names are the same that are used in ScanCode-Toolkit for packages.