Skip to content

Commit

Permalink
feat: SP-1856 Fix skip dir
Browse files Browse the repository at this point in the history
  • Loading branch information
matiasdaloia committed Nov 20, 2024
1 parent 58bd5c1 commit 12c80b8
Show file tree
Hide file tree
Showing 3 changed files with 137 additions and 13 deletions.
128 changes: 128 additions & 0 deletions docs/source/scanoss_settings_schema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,134 @@ The ``self`` section contains basic information about your project:
}
}
Settings
========
The ``settings`` object allows you to configure various aspects of the scanning process. Currently, it provides control over which files should be skipped during scanning through the ``skip`` property.

Skip Configuration
------------------
The ``skip`` object lets you define rules for excluding files from being scanned. This can be useful for improving scan performance and avoiding unnecessary processing of certain files.

Properties
~~~~~~~~~~

skip.patterns
^^^^^^^^^^^^^
A list of patterns that determine which files should be skipped during scanning. The patterns follow the same format as ``.gitignore`` files. For more information, see the `gitignore patterns documentation <https://git-scm.com/docs/gitignore#_pattern_format>`_.

:Type: Array of strings
:Required: No
:Example:
.. code-block:: json
{
"settings": {
"skip": {
"patterns": [
"*.log",
"!important.log",
"temp/",
"debug[0-9]*.txt",
"src/client/specific-file.js",
"src/nested/folder/"
]
}
}
Pattern Format Rules
''''''''''''''''''''
* Patterns are matched **relative to the scan root directory**
* A trailing slash indicates a directory (e.g., ``path/`` matches only directories)
* An asterisk ``*`` matches anything except a slash
* Two asterisks ``**`` match zero or more directories (e.g., ``path/**/folder`` matches ``path/to``, ``path/to/folder``, ``path/to/folder/b``)
* Range notations like ``[0-9]`` match any character in the range
* Question mark ``?`` matches any single character except a slash
Examples with Explanations
''''''''''''''''''''''''''
.. code-block:: none
# Match all .txt files
*.txt
# Match all .log files except important.log
*.log
!important.log
# Match all files in the build directory
build/
# Match all .pdf files in docs directory and its subdirectories
docs/**/*.pdf
# Match files like test1.js, test2.js, etc.
test[0-9].js
skip.sizes
^^^^^^^^^^
Rules for skipping files based on their size.
:Type: Object
:Required: No
:Properties:
* ``min`` (integer): Minimum file size in bytes
* ``max`` (integer): Maximum file size in bytes (Required)
:Example:
.. code-block:: json
{
"settings": {
"skip": {
"sizes": {
"min": 100,
"max": 1000000
}
}
}
}
Complete Example
-------------------
Here's a comprehensive example combining pattern and size-based skipping:
.. code-block:: json
{
"settings": {
"skip": {
"patterns": [
"# Node.js dependencies",
"node_modules/",
"# Build outputs",
"dist/",
"build/",
"# Logs except important ones",
"*.log",
"!important.log",
"# Temporary files",
"temp/",
"*.tmp",
"# Debug files with numbers",
"debug[0-9]*.txt",
"# All test files in any directory",
"**/*test.js"
],
"sizes": {
"min": 512,
"max": 5242880
}
}
}
}
BOM Rules
---------
Expand Down
21 changes: 9 additions & 12 deletions src/scanoss/scan_filter.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import os
from typing import List, Set, Tuple
from typing import List

from pathspec import PathSpec

Expand Down Expand Up @@ -230,6 +230,7 @@ def __init__(
skip_patterns.extend(skip.get('patterns', []))

self.skip_patterns = skip_patterns
self.path_spec = PathSpec.from_lines('gitwildmatch', self.skip_patterns)
self.min_size = skip.get('sizes', {}).get('min', 0)
self.max_size = skip.get('sizes', {}).get('max', float('inf'))

Expand All @@ -249,13 +250,11 @@ def _walk_with_ignore(self, scan_root: str) -> List[str]:
files = []
root = os.path.abspath(scan_root)

path_spec, dir_patterns = self._create_skip_path_matchers()

for dirpath, dirnames, filenames in os.walk(root):
rel_path = os.path.relpath(dirpath, root)

# Return early if the entire directory should be skipped
if any(rel_path.startswith(p) for p in dir_patterns):
# Early skip directories if they match any of the patterns
if self._should_skip_dir(rel_path):
self.print_debug(f'Skipping directory: {rel_path}')
dirnames.clear()
continue
Expand All @@ -268,17 +267,15 @@ def _walk_with_ignore(self, scan_root: str) -> List[str]:
if file_size < self.min_size or file_size > self.max_size:
self.print_debug(f'Skipping file: {file_rel_path} (size: {file_size})')
continue
if path_spec.match_file(file_rel_path):
if self.path_spec.match_file(file_rel_path):
self.print_debug(f'Skipping file: {file_rel_path}')
continue
else:
files.append(file_rel_path)

return files

def _create_skip_path_matchers(self) -> Tuple[PathSpec, Set[str]]:
dir_patterns = {p.rstrip('/') for p in self.skip_patterns if p.endswith('/')}

path_spec = PathSpec.from_lines('gitwildmatch', self.skip_patterns)

return path_spec, dir_patterns
def _should_skip_dir(self, dir_rel_path: str) -> bool:
return any(dir_rel_path.startswith(p) for p in self.skip_patterns) or self.path_spec.match_file(
dir_rel_path + '/'
)
1 change: 0 additions & 1 deletion src/scanoss/scanner.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,7 +401,6 @@ def scan_folder(self, scan_dir: str) -> bool:
if not os.path.exists(scan_dir) or not os.path.isdir(scan_dir):
raise Exception(f'ERROR: Specified folder does not exist or is not a folder: {scan_dir}')

scan_dir_len = len(scan_dir) if scan_dir.endswith(os.path.sep) else len(scan_dir) + 1
self.print_msg(f'Searching {scan_dir} for files to fingerprint...')
spinner = None
if not self.quiet and self.isatty:
Expand Down

0 comments on commit 12c80b8

Please sign in to comment.