SRE-2505: Fix Trivy scan upload to the Security tab #15201

grom72 · 2024-09-26T12:36:52Z

Do not start Trivy scan if changes not related to dependencies. Run Trivy on daily bases.
Add badge to follow cycle Trivy scans
Enable scans on request

Doc-only: true

Required-githooks: true

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

Do not start Trivy scan if changes not related to dependencies. Run Trivy on daily bases. Add badge to follow cycle Trivy scans Enable scans on request Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

github-actions · 2024-09-26T12:37:12Z

Errors are component not formatted correctly,Ticket number suffix is not a number. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/SRE-2505:

brianjmurrell

If we limit the scope here so that this is done by a timer how do we know when the timed run has found something new? I.e. Who gets notified and how?

grom72 · 2024-10-08T07:20:45Z

If we limit the scope here so that this is done by a timer how do we know when the timed run has found something new? I.e. Who gets notified and how?

Somebody who monitors Security issues of the project as we report all detected issues there:
https://github.com/daos-stack/daos/security/code-scanning?query=is%3Aopen+branch%3Amaster+tool%3ATrivy

There is also a tag in the README file:

daos/README.md

Line 8 in 766d9e1

    
           [![Trivy scan](https://github.com/daos-stack/daos/actions/workflows/trivy.yml/badge.svg)](https://github.com/daos-stack/daos/actions/workflows/trivy.yml)

If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues.

Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later.

brianjmurrell · 2024-10-08T11:17:39Z

Somebody

Who exactly is this somebody? Because unless it's somebody who's job description includes being responsible for monitoring (monitoring often gets back-burnered by new high priority tasks, FWIW) these, somebody will be nobody.

There is also a tag in the README file:

No developer pays attention to those.

If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues.

Not if such scan failures fail their PR and they cannot get them landed. This is how we handle all of our linting (for example), etc. currently. Periodically somebody will discover that some new linting requirement on their PR that's not related to their code and they either have to fix it in their PR or raise the flag that it needs fixing. In the meanwhile, gatekeepers are requested to force-land. All of this keeps the issue front-and-centre and motivates a fix in quick time.

Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later.

Is Trivy entirely related to dependencies? That is, does it not find any other kinds of vulnerabilities other than dependencies that are out of date?

grom72 · 2024-10-08T15:16:10Z

Somebody

Who exactly is this somebody? Because unless it's somebody who's job description includes being responsible for monitoring (monitoring often gets back-burnered by new high priority tasks, FWIW) these, somebody will be nobody.

Do we follow the security tab (https://github.com/daos-stack/daos/security)?
If not, what is the purpose of maintaining it?

There is also a tag in the README file:

No developer pays attention to those.

If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues.

Not if such scan failures fail their PR and they cannot get them landed. This is how we handle all of our linting (for example), etc. currently. Periodically somebody will discover that some new linting requirement on their PR that's not related to their code and they either have to fix it in their PR or raise the flag that it needs fixing. In the meanwhile, gatekeepers are requested to force-land. All of this keeps the issue front-and-centre and motivates a fix in quick time.

linting can not be fixed automatically while Trivy issues are usually fixed by Dependabot
how is it possible to have linting issue not related to modified code?
force-landing exception shows that the process does not work well - it is a workaround for too heavy validation of PR

Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later.

Is Trivy entirely related to dependencies? That is, does it not find any other kinds of vulnerabilities other than dependencies that are out of date?

Trivy detects CVEs based on detected dependencies. Dependabot does the same, but Dependabot also creates a fix.
Finally, Trivy is somehow a guard that confirm Dependabot works well.

brianjmurrell · 2024-10-08T17:36:44Z

Do we

Who is "we"?

follow the security tab (daos-stack/daos/security)?

I don't know that anyone has been "assigned" to and is responsible for what shows up there. I would guess the answer is no.

If not, what is the purpose of maintaining it?

We can dream (that somebody might be made responsible for it) can't we? 😄

linting can not be fixed automatically while Trivy issues are usually fixed by Dependabot

Usually? Does that mean there are cases where they not fixed by Dependabot?

how is it possible to have linting issue not related to modified code?

New linting rules can be introduced.

force-landing exception shows that the process does not work well - it is a workaround for too heavy validation of PR

I disagree. Force-landing, in this case, just means that we are a large team working on a lot of code and we do not want to halt everyone's progress while surprises are taken care of and we don't want to have such things landable without a gatekeeper being aware by knowing they are force landing.

Trivy detects CVEs based on detected dependencies.

So it's entirely driven by dependencies?

grom72 · 2024-10-09T07:47:35Z

Do we

Who is "we"?

follow the security tab (daos-stack/daos/security)?

I don't know that anyone has been "assigned" to and is responsible for what shows up there. I would guess the answer is no.

If not, what is the purpose of maintaining it?

We can dream (that somebody might be made responsible for it) can't we? 😄

We have different *owners groups in daos-stack/daos. Perhaps it is time to create the security-owners group that will monitor security issues reported by Github.

linting can not be fixed automatically while Trivy issues are usually fixed by Dependabot

Usually? Does that mean there are cases where they not fixed by Dependabot?

I have not seen such a case so far. If Dependabot can not fix the issue then we have to create an exception, as it is already done for many Hadoop issues

daos/utils/trivy/.trivyignore

Line 2 in dbc2808

    
           ## CVE-2023-52428,MEDIUM,,"Denial of Service in Connect2id Nimbus JOSE+JWT","com.nimbusds:nimbus-jose-jwt","9.8.1","9.37.2",https://avd.aquasec.com/nvd/cve-2023-52428

But, I do not think it is a good approach to teach everybody how to search for exceptions and how to resolve them.

how is it possible to have linting issue not related to modified code?

New linting rules can be introduced.

force-landing exception shows that the process does not work well - it is a workaround for too heavy validation of PR

I disagree. Force-landing, in this case, just means that we are a large team working on a lot of code and we do not want to halt everyone's progress while surprises are taken care of and we don't want to have such things landable without a gatekeeper being aware by knowing they are force landing.

Yes, exactly, we use force-landing because we do not want to wait for Dependabot to fix the issue.

I want to avoid a situation that can be described in the following steps:

PR is created
Trivy reports an issue
A related ticket is created to monitor Trivy issue (who is an owner?)
PR is force-landing, as you mentioned.
Dependabot detect an issue and provide a fix
Dependabot PR shall pass Trivy scan if not -> that means a bigger problem to be handled separately
Fix is merged
Ticket from 3) can be closed (by whom?)
Original PR is marked as failed as nobody re-runs Trivy scan on merged PR

Trivy detects CVEs based on detected dependencies.

So it's entirely driven by dependencies?
Yes, it checks files that are now mentioned in PR:

      - '**/go.mod'
      - '**/pom.xml'
      - '**/requirements.txt'

compares them with CVE database and produces a report.
I do not know how Java code is released (whether at all), but in the case of Go and Python, Trivy issues are more related to project CI infrastructure rather than the final product.

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

brianjmurrell · 2024-10-09T13:31:31Z

We have different *owners groups in daos-stack/daos. Perhaps it is time to create the security-owners group that will monitor security issues reported by Github.

As an implementation matter, perhaps. But first and foremost, there is an administrative component that needs to be completed first. Management needs to decide that this is an important task and then needs to decide who is going to be responsible for it and to assign it to them.

I have not seen such a case so far.

You probably don't submit as many daos PRs as other developers and have not been here long enough to see it.

But, I do not think it is a good approach to teach everybody how to search for exceptions and how to resolve them.

The issue not necessarily that everyone is responsible for fixing them. The idea is that making a CI stage fail due to an issue is what drives getting it fixed. Around here, Issues that don't stop people's work just get pushed aside and continue to pile up without anyone stopping what they are doing to address them.

Yes, exactly, we use force-landing because we do not want to wait for Dependabot to fix the issue.

And if Trivy can find the issue sooner so that somebody can beat Dependabot to fixing the issue, great. But the Trivy scan has to interrupt/block people's work, requiring them to request force-landings in order for whatever issue the scan found to be pushed up in importance. If it's just a scan that affects nobody, it gets ignored. That's just how things work around here.

I want to avoid a situation that can be described in the following steps:

PR is created

Trivy reports an issue

A related ticket is created to monitor Trivy issue (who is an owner?)

Who's going to create that ticket? What prompts them to create it if the scan did not impact their PR? And yes, "who is an owner"?

PR is force-landing, as you mentioned.

The developer needing to request this is the only thing that drives a ticket being created and the issue being fixed. If no developer is impacted, nothing happens.

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

ryon-jensen · 2024-10-11T18:28:10Z

Can the Trivy workflow generate a notification or create a ticket so findings aren't lost?

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

grom72 · 2024-10-15T07:22:24Z

Can the Trivy workflow generate a notification or create a ticket so findings aren't lost?

Notification is created (if CVE detected) on Security tab:
https://github.com/daos-stack/daos/security/code-scanning?query=is%3Aopen+branch%3Amaster+tool%3ATrivy

brianjmurrell · 2024-10-15T11:56:07Z

.github/workflows/trivy.yml

+  schedule:
+    - cron: '45 8 * * *'


Does the scheduled run add any value if we are (correctly I will add) running this on both PRs and landings?

There are three usecases

PR - to inform the PR's author about new problems delivered together with the PR - this scan does not add anything to Security tab

push reports new issues introduced by PR to the Security tab if not resolved by the author

schedule reports new issues to the Security tab detected mainly due to new issues reported by CVE databases
There is an overlap between schedule and push but the schedule does not need to wait for any PR to land.

brianjmurrell · 2024-10-15T11:57:19Z

.github/workflows/trivy.yml

+    paths:
+      - '**/go.mod'
+      - '**/pom.xml'
+      - '**/requirements.txt'
+      - '**/*trivy*'


What happens if something new is added to the repo that Trivy should be scanning but of course the developer adding that something new has no idea that Trivy should be scanning it?

What happens if something new is added to the repo that Trivy should be scanning but of course the developer adding that something new has no idea that Trivy should be scanning it?

There is nothing like "something new" in Trivy. Every scan follows the Trivy configuration file, which defines precisely what shall be scanned. We need to modify the Trivy configuration to extend the scope of scans. But it is not a developer's task but rather the security engineer's one—to decide that additional checks are required. So far, Trivy scans Java, Python, and Go dependencies, and this is the basement for file path filtering.

I don't think I was very clear in my question.

The problem I see is that the above is only running Trivy scan on PRs that modify one of the listed paths.

But let's say a developer adds some new code in a new path to the source tree and that code should be scanned by Trivy. But the average developer is not going to know about this Trivy scan workflow and is not going to think about the need to add their path to the list of paths that trigger Trivy scan.

So they get their PR finished and landed without any Trivy scanning but the moment it's landed Trivy scan runs and finds the new code and finds a failure in it. Now master has to be closed and somebody has a high-priority task to resolve this new scan failure, interrupting their scheduled work and nobody else can get anything landed until the issue is resolved.

All of this could have been prevented by always scanning in PRs, not only when some whitelist of paths are modified.

The latest commit limits the list of files to be monitored by the push event.

The file list has been extended by all supported by Trivy files - https://aquasecurity.github.io/trivy/v0.56/docs/coverage/language/#supported-languages

The solution does not block any PR from being landed if not Trivy-related.

schedule run reports detected issues to the Security tab, which should be somehow monitored.
(We have to define a procedure if no one monitors Security related issues reported via Security tab)

the Security tab, which should be somehow monitored. (We have to define a procedure if no one monitors Security related issues reported via Security tab)

This needs to be defined and put in place before we can land anything that only reports there.

the Security tab, which should be somehow monitored. (We have to define a procedure if no one monitors Security related issues reported via Security tab)

This needs to be defined and put in place before we can land anything that only reports there.

We already have similar situation with Scorecard where all issues are reported only to Security tab.

We already have similar situation with Scorecard where all issues are reported only to Security tab.

Scorecard issues show up in the PR that introduces them.

https://aquasecurity.github.io/trivy/v0.56/docs/coverage/language/#supported-languages provides the full list of scanned file in the 'filesystem' scan. Keep the same condition for PR and merge trigger. Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

brianjmurrell · 2024-10-21T14:45:46Z

.github/workflows/trivy.yml

+    paths:
+      - '**/go.mod'
+      - '**/pom.xml'
+      - '**/*gradle.lockfile'
+      - '**/*.sbt.lock'
+      - '**/requirements.txt'
+      - '**/poetry.lock'
+      - '**/Pipfile.lock'
+      - '**/*trivy*'
  pull_request:
    branches: ["master", "release/**"]
    paths:
      - '**/go.mod'
      - '**/pom.xml'
+      - '**/*gradle.lockfile'
+      - '**/*.sbt.lock'
      - '**/requirements.txt'
+      - '**/poetry.lock'
+      - '**/Pipfile.lock'
      - '**/*trivy*'


Is Trivy expensive to run? If not, why do we bother with these path filters and not just run it on every PR and landing so that new issues are caught as soon as possible?

Who is going to know to update these path filters if/when Trivy expands their coverage to other languages, or even when some already supported language changes the file that Trivy should use, etc.?

I suppose the scheduled run will catch these kinds of changes, but again, without a process/responsible person put in place to monitor the Security Tab, these changes will go unnoticed and we will not be performing complete Trivy scans on PRs or even on landings.

I have observed some limits on the Trivy database access.
(https://github.com/daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)

Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective. Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?

Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.

Why we can not treat schedule in the same way as we treat daily tests?
I expect that one scan per day is enough in normal case.

I have observed some limits on the Trivy database access. (daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)

44000/minute seems like a lot. Are we sure something did not run amok?

Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective.

No more than the status quo. Just look at the isort action/check that has been broken for a few days last week. It finally (and only, I suspect!) got updated because somebody noticed it in their PR and started asking about it/getting it fixed. If it did not break a PR I am pretty sure it would still be broken and nobody noticing.

Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?

Just like the isort problem last week. Somebody will raise their PR being broken because of it and somebody jumps in to fix it.

Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.

Which is even worse. Because when a long time goes by and they do finally change something, nobody remembers that we need to change on our end also to capture that new something.

Why we can not treat schedule in the same way as we treat daily tests? I expect that one scan per day is enough in normal case.

Phil has been assigned ownership of monitoring daily scans. Who will be assigned daily monitoring of the Security Tab? This is something that needs to be addressed before we introduce a security scan that may only publish results to the Security Tab and not be raised by people's PRs or landings. This is something that needs to be taken up with @ryon-jensen and/or the team in general to define the process and responsible part for monitoring the Security Tab.

I have observed some limits on the Trivy database access. (daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)

44000/minute seems like a lot. Are we sure something did not run amok?

I guess this is a limit for all Github runners - let's check in the future how many false-positive we will have.

Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective.

No more than the status quo. Just look at the isort action/check that has been broken for a few days last week. It finally (and only, I suspect!) got updated because somebody noticed it in their PR and started asking about it/getting it fixed. If it did not break a PR I am pretty sure it would still be broken and nobody noticing.

Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?

Just like the isort problem last week. Somebody will raise their PR being broken because of it and somebody jumps in to fix it.

"Somebody will raise it ... somebody jumps in to fix it" is not a well-defined process :(. But let's keep it as it is for a while and see how it works.

Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.

Which is even worse. Because when a long time goes by and they do finally change something, nobody remembers that we need to change on our end also to capture that new something.

Why we can not treat schedule in the same way as we treat daily tests? I expect that one scan per day is enough in normal case.

Phil has been assigned ownership of monitoring daily scans. Who will be assigned daily monitoring of the Security Tab? This is something that needs to be addressed before we introduce a security scan that may only publish results to the Security Tab and not be raised by people's PRs or landings. This is something that needs to be taken up with @ryon-jensen and/or the team in general to define the process and responsible part for monitoring the Security Tab.

Let's keep the schedule trigger just in case the push (to master) fails and we do not have any results updated in the Security tab until next PR lands.

Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

grom72 changed the title ~~Limit scope of changes that are monitored by Trivy scan~~ SRE-2505: Limit scope of changes that are monitored by Trivy scan Oct 2, 2024

grom72 requested review from brianjmurrell and ryon-jensen October 2, 2024 19:15

grom72 marked this pull request as ready for review October 2, 2024 19:16

grom72 requested a review from a team as a code owner October 2, 2024 19:16

grom72 added doc-only Changes only affect documentation, not code release-2.6.2 Targeted for release 2.6.2 labels Oct 4, 2024

grom72 mentioned this pull request Oct 4, 2024

DAOS-16659 common: Ignore Hadoop related CVE-2024-47554 #15241

Closed

18 tasks

brianjmurrell requested changes Oct 7, 2024

View reviewed changes

grom72 requested a review from brianjmurrell October 8, 2024 07:22

Fix SPDX license header

c029a80

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

fix: restore unnecessary cache backend

3f4c483

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

Merge remote-tracking branch 'origin/master' into grom72/SRE-2505-trivy

f71df83

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

brianjmurrell reviewed Oct 15, 2024

View reviewed changes

grom72 requested a review from brianjmurrell October 16, 2024 12:30

grom72 added 2 commits October 21, 2024 14:34

Fix: documentation

d7c35a2

Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

brianjmurrell reviewed Oct 21, 2024

View reviewed changes

grom72 requested a review from brianjmurrell October 21, 2024 18:47

Fix: simplify triggering rules

1d52932

Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

grom72 changed the title ~~SRE-2505: Limit scope of changes that are monitored by Trivy scan~~ SRE-2505: Fix Trivy scan upload to the Security tab Oct 22, 2024

ryon-jensen approved these changes Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SRE-2505: Fix Trivy scan upload to the Security tab #15201

SRE-2505: Fix Trivy scan upload to the Security tab #15201

grom72 commented Sep 26, 2024 •

edited

Loading

github-actions bot commented Sep 26, 2024 •

edited

Loading

brianjmurrell left a comment

grom72 commented Oct 8, 2024

brianjmurrell commented Oct 8, 2024

grom72 commented Oct 8, 2024

brianjmurrell commented Oct 8, 2024

grom72 commented Oct 9, 2024

brianjmurrell commented Oct 9, 2024

ryon-jensen commented Oct 11, 2024

grom72 commented Oct 15, 2024

brianjmurrell Oct 15, 2024

grom72 Oct 16, 2024

brianjmurrell Oct 15, 2024

grom72 Oct 16, 2024

brianjmurrell Oct 16, 2024

grom72 Oct 21, 2024

brianjmurrell Oct 21, 2024

grom72 Oct 21, 2024

brianjmurrell Oct 21, 2024

brianjmurrell Oct 21, 2024

grom72 Oct 21, 2024

brianjmurrell Oct 21, 2024

grom72 Oct 22, 2024

SRE-2505: Fix Trivy scan upload to the Security tab #15201

Are you sure you want to change the base?

SRE-2505: Fix Trivy scan upload to the Security tab #15201

Conversation

grom72 commented Sep 26, 2024 • edited Loading

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Sep 26, 2024 • edited Loading

brianjmurrell left a comment

Choose a reason for hiding this comment

grom72 commented Oct 8, 2024

brianjmurrell commented Oct 8, 2024

grom72 commented Oct 8, 2024

brianjmurrell commented Oct 8, 2024

grom72 commented Oct 9, 2024

brianjmurrell commented Oct 9, 2024

ryon-jensen commented Oct 11, 2024

grom72 commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grom72 commented Sep 26, 2024 •

edited

Loading

github-actions bot commented Sep 26, 2024 •

edited

Loading