Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🪄 Schema version 6 wishlist #108

Closed
1 of 20 tasks
wagoodman opened this issue May 24, 2023 · 10 comments
Closed
1 of 20 tasks

🪄 Schema version 6 wishlist #108

wagoodman opened this issue May 24, 2023 · 10 comments
Assignees
Labels
planning high level epic that should be broken into smaller tasks

Comments

@wagoodman
Copy link
Contributor

wagoodman commented May 24, 2023

As these are implemented, please edit this field to include the PR that implements it within the wishlist below:

  • Consider switching json column processing to bson or another format that is space-efficient and potentially more performant to parse json compresses better than binary data, and we already have a compressed archive, so no benefit of changing this
  • Drop the version_constraint and version_type columns and consolidate into the new package_qualifiers column
  • Remove the id table
  • Remove the RecordSource field from vulnerability metadata (indicates the feed group, which is now fictitious)
  • Be able to show disputed state for vulnerability without conflating with fixed state
  • Express and distinguish NVD vs CNA CVSS scores (but keep both)
  • Allow for different vulnerability formats to be stored in the DB at once (e.g. OSV + original format)
  • Consider a table per provider approach
  • Capture dates where available (published, updated, withdrawn)
  • Capture CISA info from NVD records (NVD should probably have it's own dedicated table)
  • Capture the built timestamp per provider workspace (already available from the vunnel metadata, just needs someplace to live in grypedb)
  • We want to be able to know what fix in information is available for any record related to a match (even downstream , No fix detected for CVE-* based vulnerabilities, although there are updates fixing them grype#1329 and No FIXED-IN values for CVE findings grype#236)
  • Want to be able to correlate / search records by PURL (not only CPEs) Integration with osv.dev grype#1267 (comment)
  • maven sha to groupid/artifactid lookup\
  • some way to capture the affected function calls when available from a source (Unified field for describing affected functions ossf/osv-schema#127)
  • retain more details about namespaces so lookups from available information in grype can be more robust when some data is omitted (🪄 Schema version 6 wishlist #108 (comment))
  • retain not-affected entries somewhere so they can be used for negative match lookups in grype (🪄 Schema version 6 wishlist #108 (comment))
  • GHSA entries should have "title" and "description" fields (right now only "description" column exists, and contains titles of GHSA entries), see GHSA descriptions only have the advisory title grype#1069
  • GHSA entries should have "published at"
  • Re-orient vulnerability records for non-os packaging ecosystems around the package type and registry? Currently they are oriented around language, but that can cause problems when a singe package registry is shared by multiple language ecosystems and also can get confusing when a single language has many packaging ecosystems (we can support the second case in the current model, but it is probably clearer to re-orient)

Focuses:

  • look at space savings for uncompressed DB while keeping matching speed a priority
@wagoodman wagoodman pinned this issue May 24, 2023
@willmurphyscode
Copy link
Contributor

This wouldn't save space, but currently when RHEL and Mariner feeds report a package as "not affected" we just drop the record in vunnel. It would be helpful if this was instead expressed in the database. (Might overlap with the disputed requirement mentioned above.)

@westonsteimel
Copy link
Contributor

westonsteimel commented Aug 18, 2023

v6 should have some way of looking up the correct namespace off of something more than just version. For instance, we're already maintaining mappings in vunnel of codenames to versions for ubuntu and debian and it would be useful to retain that information in the db itself so that grype can still find the correct vuln namespace even if one of the pieces of information might be missing (like in anchore/grype#1446)

Keeping it in the db would mean it could be updated automatically as new namespaces are added and grype could make use of it for lookups immediately whereas maintaining a static mapping in grype means we'd need to remember to maintain that mapping in multiple places (vunnel and grype), and users would need to upgrade to the latest grype for newer namespaces

@westonsteimel
Copy link
Contributor

The ability to know when a particular record was added or modified within the grype database came up in a community discussion. Although we are currently always building up the database from scratch, I think it may make sense to include something like an added, and modified timestamp column to each record in the db so that if/when we add diffing/partial updates like in #143 we'd already have the necessary columns in place and not necessarily need another schema bump at that time

@willmurphyscode
Copy link
Contributor

Updated the top comment to point to anchore/grype#1498 as a specific issue for "Capture dates where available".

@willmurphyscode
Copy link
Contributor

We need the ability to represent a CVE that affects different packages, but has different severity ratings for each package. As a concrete example, https://security-tracker.debian.org/tracker/CVE-2023-44487 lists a table with multiple severities for different packages. Here's a snippet of the relevant table on that page, in case it changes or moves:

Package Type Release Fixed Version Urgency Origin Debian Bugs
h2o source buster 2.2.5+dfsg2-2+deb10u2   DLA-3638-1  
h2o source (unstable) 2.2.5+dfsg2-8     1054232
haproxy source (unstable) 1.8.13-1      
jetty9 source buster 9.4.50-4+deb10u1   DLA-3641-1  
jetty9 source bullseye 9.4.50-4+deb11u1   DSA-5540-1  
jetty9 source bookworm 9.4.50-4+deb12u2   DSA-5540-1  
jetty9 source (unstable) 9.4.53-1      
netty source (unstable) (unfixed)     1054234
nghttp2 source buster 1.36.0-2+deb10u2   DLA-3621-1  
nghttp2 source (unstable) 1.57.0-1     1053769
nginx source (unstable) 1.24.0-2 unimportant   1053770
tomcat10 source bookworm 10.1.6-1+deb12u1   DSA-5521-1  

Note the "urgency" is marked as "unimportant" on the row for nginx. This rating translates in Vunnel to "negligible" in grype-db.

Currently, in the database, this CVE is represented like this:

-- VULNERABILITY TABLE
sqlite> select id, package_name from vulnerability where 
id="CVE-2023-44487" and namespace="debian:distro:debian:12";
id              package_name
--------------  -------------
CVE-2023-44487  h2o
CVE-2023-44487  haproxy
CVE-2023-44487  jetty9
CVE-2023-44487  netty
CVE-2023-44487  nghttp2
CVE-2023-44487  nginx
CVE-2023-44487  tomcat10
CVE-2023-44487  tomcat9
CVE-2023-44487  trafficserver
-- VULNERABILITY_METADATA table
sqlite> select id, severity from vulnerability_metadata where 
id="CVE-2023-44487" and namespace="debian:distro:debian:12";
id              severity
--------------  ----------
CVE-2023-44487  Negligible

As you can see, the database has no good way of writing down, "this CVE is more severe if matched against tomcat than against nginx," but Debian's data is clearly trying to tell us that.

I believe this would be fixed by having a proper foreign key from vulnerability to vulnerability_metadata, rather than just relying on ID+Namespace to match. We could also move the severity column to the vulnerability table, but that would probably result in a lot of duplicate values.

@westonsteimel
Copy link
Contributor

We need to better capture the relationships between identifiers between ecosystems. Currently, we have the related_vulnerabilities column, but that is on the package vuln record and requires knowing the grypedb vuln namespace that an id corresponds to, which is difficult when that namespace is something dynamic like with github where the GHSA could cover several package ecosystems. We don't really care about the namespace, only that a particular CVE corresponds to a GHSA or vice versa.

@westonsteimel
Copy link
Contributor

Tables per provider/ecosystem pair with a schema specific to the ecosystem
So we'd have a table with a lookup on provider name and ecosystem name (likely based on the package url spec, though extended in the case of generic purls so we could have specific schemas tailored to classes of binary packages like java, python interpreter, etc?) that would get the name of the table to load the relevant data from.
Should it also be separate tables for affected/not affected or the same table?

@wagoodman wagoodman added the planning high level epic that should be broken into smaller tasks label Feb 16, 2024
@wagoodman wagoodman added this to OSS Feb 16, 2024
@wagoodman wagoodman moved this to Backlog in OSS Feb 16, 2024
willmurphyscode pushed a commit that referenced this issue Mar 27, 2024
Signed-off-by: Alex Goodman <alex.goodman@anchore.com>
@asomya asomya unpinned this issue May 6, 2024
@wagoodman wagoodman self-assigned this Jun 7, 2024
@westonsteimel
Copy link
Contributor

westonsteimel commented Aug 17, 2024

@wagoodman, I was just wondering in light of all of the CDN issues that have been cropping up if it might make sense to do something like partition v6 databases per provider and then perhaps make grype smarter about what it downloads based on what it needs? Like we know currently the sles data far outnumbers the rest but if someone is never scanning sles containers there would be no need to ever fetch that subset of the data in their ci pipelines running grype

Anyways, just a very rough thought and apologies if this has already been discussed elsewhere. I'm "on vacation" so haven't yet seen all of the discussion that may have occurred on Friday.

@willmurphyscode
Copy link
Contributor

@westonsteimel we were talking about that yesterday. There might be some big performance gains by splitting the DB by provider.

The main drawback I see is that, right now, grype does 2 relatively slow things at the same time: it generates the SBOM, and it downloads the database. But I don't think it can know what database pieces to download before the SBOM is generated. Maybe we could do something like check the distro early, and then download the database for the distro while making the rest of the SBOM.

We've also talked about trying to make incremental updates to the database available, but I think partitioning by provider would be much simpler to implement.

@wagoodman
Copy link
Contributor Author

I'm going to close this since this wishlist has been converted into tangible issues for v6

@wagoodman wagoodman closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in OSS Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
planning high level epic that should be broken into smaller tasks
Projects
Archived in project
Development

No branches or pull requests

3 participants