Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add improver pipeline to flag ghost packages #644 #917 #1395 #1533

Merged
merged 16 commits into from
Aug 26, 2024

Conversation

keshav-space
Copy link
Member

@keshav-space keshav-space commented Aug 2, 2024

This PR introduces the aboutcode pipeline for defining and running improvements on data (see #1395). It also adds a new pipeline to flag ghost packages. Ghost packages are those packages that are not present in the upstream package manager index.

Screenshot from 2024-08-23 18-45-24

@keshav-space keshav-space marked this pull request as draft August 2, 2024 12:48
@keshav-space keshav-space marked this pull request as ready for review August 7, 2024 07:57
vulnerabilities/pipelines/remove_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/remove_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/remove_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/remove_ghost_packages.py Outdated Show resolved Hide resolved
@keshav-space keshav-space changed the title Add improver pipeline to remove ghost packages #644 #917 #1395 Add improver pipeline to flag ghost packages #644 #917 #1395 Aug 21, 2024
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few nits for your review.

requirements.txt Show resolved Hide resolved
vulnerabilities/models.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
- Package can have malicious, ghost, yanked, valid and unknown status

Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks... here a few nits for your review... I wonder if there is a real need to convert versions with univers for this use case.

Also:

  • Thanks for adding logging!
  • Please add a CHANGELOG entry for logging and pipeline introduction

version_class = RANGE_CLASS_BY_SCHEMES[purl.type].version_class

try:
return {version_class(v.value) for v in versions(str(purl))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check for a ghost, why do we need to convert things to a version class?
Could we just check the if the version string exists?
Otherwise we are eventually ghosting all the packages with a version we cannot parse

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check for a ghost, why do we need to convert things to a version class?
Could we just check the if the version string exists?

Simple string comparison may lead to incorrectly flagging legitimate versions as ghost packages:

"1.2.0" != "1.02.0" 
"v1.2.3" != "1.2.3"
"1.0" != "1.0.0"

Otherwise we are eventually ghosting all the packages with a version we cannot parse

If we cannot parse the version for a given base_purl, we skip those and don’t mark them as ghosts.

vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
vulnerabilities/pipelines/flag_ghost_packages.py Outdated Show resolved Hide resolved
keshav-space and others added 3 commits August 23, 2024 20:45
Signed-off-by: Keshav Priyadarshi <git@keshav.space>

Co-authored-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Signed-off-by: Keshav Priyadarshi <git@keshav.space>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go!

@keshav-space keshav-space merged commit 1e3afdc into main Aug 26, 2024
9 checks passed
@keshav-space keshav-space deleted the 917-ghost-remover branch August 26, 2024 07:50
@armijnhemel
Copy link
Contributor

Would this also address situations where the upstream index is wrong, such as #915 ?

@keshav-space
Copy link
Member Author

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

@armijnhemel
Copy link
Contributor

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

I am not sure you understood my question: what if upstream is wrong (as is the case in Alpine see #915 )? Are you always assuming that the index is correct?

@keshav-space
Copy link
Member Author

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

I am not sure you understood my question: what if upstream is wrong (as is the case in Alpine see #915 )? Are you always assuming that the index is correct?

@armijnhemel Yes, this pipeline assumes that package index is right. We should have seperate pipeline for case where index is wrong. We will need some generalized way to conclude that package index is reporting wrong version.

@armijnhemel
Copy link
Contributor

Would this also address situations where the upstream index is wrong, such as #915 ?

@armijnhemel at present, this will flag the ghost packages from the following ecosystems: cargo, composer, conan, deb, gem, github, golang, hex, maven, npm, nuget, and pypi. Once we have support for alpine in https://github.com/aboutcode-org/fetchcode/blob/master/src/fetchcode/package_versions.py, this will also flag ghost packages for the alpine ecosystem.

I am not sure you understood my question: what if upstream is wrong (as is the case in Alpine see #915 )? Are you always assuming that the index is correct?

@armijnhemel Yes, this pipeline assumes that package index is right. We should have seperate pipeline for case where index is wrong. We will need some generalized way to conclude that package index is reporting wrong version.

So far I have only seen it once (in Alpine, see #915) so perhaps it could just be added as an exception without building a separate pipeline but I guess it might happen more often.

@pombredanne
Copy link
Member

@armijnhemel The general approach to curation is going to be:

  1. use federatedcode to store backing data in VCS repos
  2. treat a curation outcome as ultimately creating a new "advisory" like record there
  3. use an importer that treat this advisories as any other data sources, but with an increased priority, and also being aware of the historical context of this curation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

Add Improver to find out Ghost packages Add improver to validate that collected purl are real.
3 participants