-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SRE-2505: Fix Trivy scan upload to the Security tab #15201
base: master
Are you sure you want to change the base?
Conversation
Do not start Trivy scan if changes not related to dependencies. Run Trivy on daily bases. Add badge to follow cycle Trivy scans Enable scans on request Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Errors are component not formatted correctly,Ticket number suffix is not a number. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we limit the scope here so that this is done by a timer how do we know when the timed run has found something new? I.e. Who gets notified and how?
Somebody who monitors There is also a tag in the README file: Line 8 in 766d9e1
If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues. Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later. |
Who exactly is this somebody? Because unless it's somebody who's job description includes being responsible for monitoring (monitoring often gets back-burnered by new high priority tasks, FWIW) these, somebody will be nobody.
No developer pays attention to those.
Not if such scan failures fail their PR and they cannot get them landed. This is how we handle all of our linting (for example), etc. currently. Periodically somebody will discover that some new linting requirement on their PR that's not related to their code and they either have to fix it in their PR or raise the flag that it needs fixing. In the meanwhile, gatekeepers are requested to force-land. All of this keeps the issue front-and-centre and motivates a fix in quick time.
Is Trivy entirely related to dependencies? That is, does it not find any other kinds of vulnerabilities other than dependencies that are out of date? |
Do we follow the
Trivy detects CVEs based on detected dependencies. Dependabot does the same, but Dependabot also creates a fix. |
Who is "we"?
I don't know that anyone has been "assigned" to and is responsible for what shows up there. I would guess the answer is no.
We can dream (that somebody might be made responsible for it) can't we? 😄
Usually? Does that mean there are cases where they not fixed by Dependabot?
New linting rules can be introduced.
I disagree. Force-landing, in this case, just means that we are a large team working on a lot of code and we do not want to halt everyone's progress while surprises are taken care of and we don't want to have such things landable without a gatekeeper being aware by knowing they are force landing.
So it's entirely driven by dependencies? |
We have different
I have not seen such a case so far. If Dependabot can not fix the issue then we have to create an exception, as it is already done for many Hadoop issues Line 2 in dbc2808
But, I do not think it is a good approach to teach everybody how to search for exceptions and how to resolve them.
Yes, exactly, we use force-landing because we do not want to wait for Dependabot to fix the issue. I want to avoid a situation that can be described in the following steps:
compares them with CVE database and produces a report. |
Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
As an implementation matter, perhaps. But first and foremost, there is an administrative component that needs to be completed first. Management needs to decide that this is an important task and then needs to decide who is going to be responsible for it and to assign it to them.
You probably don't submit as many daos PRs as other developers and have not been here long enough to see it.
The issue not necessarily that everyone is responsible for fixing them. The idea is that making a CI stage fail due to an issue is what drives getting it fixed. Around here, Issues that don't stop people's work just get pushed aside and continue to pile up without anyone stopping what they are doing to address them.
And if Trivy can find the issue sooner so that somebody can beat Dependabot to fixing the issue, great. But the Trivy scan has to interrupt/block people's work, requiring them to request force-landings in order for whatever issue the scan found to be pushed up in importance. If it's just a scan that affects nobody, it gets ignored. That's just how things work around here.
Who's going to create that ticket? What prompts them to create it if the scan did not impact their PR? And yes, "who is an owner"?
The developer needing to request this is the only thing that drives a ticket being created and the issue being fixed. If no developer is impacted, nothing happens. |
Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Can the Trivy workflow generate a notification or create a ticket so findings aren't lost? |
Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Notification is created (if CVE detected) on Security tab: |
schedule: | ||
- cron: '45 8 * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the scheduled run add any value if we are (correctly I will add) running this on both PRs and landings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three usecases
PR
- to inform the PR's author about new problems delivered together with the PR - this scan does not add anything to Security tabpush
reports new issues introduced by PR to the Security tab if not resolved by the authorschedule
reports new issues to the Security tab detected mainly due to new issues reported by CVE databases
There is an overlap betweenschedule
andpush
but theschedule
does not need to wait for any PR to land.
.github/workflows/trivy.yml
Outdated
paths: | ||
- '**/go.mod' | ||
- '**/pom.xml' | ||
- '**/requirements.txt' | ||
- '**/*trivy*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if something new is added to the repo that Trivy should be scanning but of course the developer adding that something new has no idea that Trivy should be scanning it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if something new is added to the repo that Trivy should be scanning but of course the developer adding that something new has no idea that Trivy should be scanning it?
There is nothing like "something new" in Trivy. Every scan follows the Trivy configuration file, which defines precisely what shall be scanned. We need to modify the Trivy configuration to extend the scope of scans. But it is not a developer's task but rather the security engineer's one—to decide that additional checks are required. So far, Trivy scans Java, Python, and Go dependencies, and this is the basement for file path filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I was very clear in my question.
The problem I see is that the above is only running Trivy scan on PRs that modify one of the listed paths.
But let's say a developer adds some new code in a new path to the source tree and that code should be scanned by Trivy. But the average developer is not going to know about this Trivy scan workflow and is not going to think about the need to add their path to the list of paths that trigger Trivy scan.
So they get their PR finished and landed without any Trivy scanning but the moment it's landed Trivy scan runs and finds the new code and finds a failure in it. Now master has to be closed and somebody has a high-priority task to resolve this new scan failure, interrupting their scheduled work and nobody else can get anything landed until the issue is resolved.
All of this could have been prevented by always scanning in PRs, not only when some whitelist of paths are modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest commit limits the list of files to be monitored by the push
event.
The file list has been extended by all supported by Trivy files - https://aquasecurity.github.io/trivy/v0.56/docs/coverage/language/#supported-languages
The solution does not block any PR from being landed if not Trivy-related.
schedule
run reports detected issues to the Security tab, which should be somehow monitored.
(We have to define a procedure if no one monitors Security related issues reported via Security tab)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Security tab, which should be somehow monitored. (We have to define a procedure if no one monitors Security related issues reported via Security tab)
This needs to be defined and put in place before we can land anything that only reports there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Security tab, which should be somehow monitored. (We have to define a procedure if no one monitors Security related issues reported via Security tab)
This needs to be defined and put in place before we can land anything that only reports there.
We already have similar situation with Scorecard where all issues are reported only to Security tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have similar situation with Scorecard where all issues are reported only to Security tab.
Scorecard issues show up in the PR that introduces them.
https://aquasecurity.github.io/trivy/v0.56/docs/coverage/language/#supported-languages provides the full list of scanned file in the 'filesystem' scan. Keep the same condition for PR and merge trigger. Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Doc-only: true Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
.github/workflows/trivy.yml
Outdated
paths: | ||
- '**/go.mod' | ||
- '**/pom.xml' | ||
- '**/*gradle.lockfile' | ||
- '**/*.sbt.lock' | ||
- '**/requirements.txt' | ||
- '**/poetry.lock' | ||
- '**/Pipfile.lock' | ||
- '**/*trivy*' | ||
pull_request: | ||
branches: ["master", "release/**"] | ||
paths: | ||
- '**/go.mod' | ||
- '**/pom.xml' | ||
- '**/*gradle.lockfile' | ||
- '**/*.sbt.lock' | ||
- '**/requirements.txt' | ||
- '**/poetry.lock' | ||
- '**/Pipfile.lock' | ||
- '**/*trivy*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Trivy expensive to run? If not, why do we bother with these path filters and not just run it on every PR and landing so that new issues are caught as soon as possible?
Who is going to know to update these path filters if/when Trivy expands their coverage to other languages, or even when some already supported language changes the file that Trivy should use, etc.?
I suppose the scheduled run will catch these kinds of changes, but again, without a process/responsible person put in place to monitor the Security Tab, these changes will go unnoticed and we will not be performing complete Trivy scans on PRs or even on landings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have observed some limits on the Trivy database access.
(https://github.com/daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)
Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective. Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?
Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.
Why we can not treat schedule
in the same way as we treat daily tests?
I expect that one scan per day is enough in normal case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have observed some limits on the Trivy database access. (daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)
44000/minute seems like a lot. Are we sure something did not run amok?
Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective.
No more than the status quo. Just look at the isort action/check that has been broken for a few days last week. It finally (and only, I suspect!) got updated because somebody noticed it in their PR and started asking about it/getting it fixed. If it did not break a PR I am pretty sure it would still be broken and nobody noticing.
Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?
Just like the isort problem last week. Somebody will raise their PR being broken because of it and somebody jumps in to fix it.
Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.
Which is even worse. Because when a long time goes by and they do finally change something, nobody remembers that we need to change on our end also to capture that new something.
Why we can not treat
schedule
in the same way as we treat daily tests? I expect that one scan per day is enough in normal case.
Phil has been assigned ownership of monitoring daily scans. Who will be assigned daily monitoring of the Security Tab? This is something that needs to be addressed before we introduce a security scan that may only publish results to the Security Tab and not be raised by people's PRs or landings. This is something that needs to be taken up with @ryon-jensen and/or the team in general to define the process and responsible part for monitoring the Security Tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have observed some limits on the Trivy database access. (daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)
44000/minute seems like a lot. Are we sure something did not run amok?
I guess this is a limit for all Github runners - let's check in the future how many false-positive we will have.
Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective.
No more than the status quo. Just look at the isort action/check that has been broken for a few days last week. It finally (and only, I suspect!) got updated because somebody noticed it in their PR and started asking about it/getting it fixed. If it did not break a PR I am pretty sure it would still be broken and nobody noticing.
Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?
Just like the isort problem last week. Somebody will raise their PR being broken because of it and somebody jumps in to fix it.
"Somebody will raise it ... somebody jumps in to fix it" is not a well-defined process :(. But let's keep it as it is for a while and see how it works.
Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.
Which is even worse. Because when a long time goes by and they do finally change something, nobody remembers that we need to change on our end also to capture that new something.
Why we can not treat
schedule
in the same way as we treat daily tests? I expect that one scan per day is enough in normal case.Phil has been assigned ownership of monitoring daily scans. Who will be assigned daily monitoring of the Security Tab? This is something that needs to be addressed before we introduce a security scan that may only publish results to the Security Tab and not be raised by people's PRs or landings. This is something that needs to be taken up with @ryon-jensen and/or the team in general to define the process and responsible part for monitoring the Security Tab.
Let's keep the schedule
trigger just in case the push
(to master) fails and we do not have any results updated in the Security tab until next PR lands.
Required-githooks: true Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Do not start Trivy scan if changes not related to dependencies. Run Trivy on daily bases.
Add badge to follow cycle Trivy scans
Enable scans on request
Doc-only: true
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: