Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRE-2505: Fix Trivy scan upload to the Security tab #15201

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

grom72
Copy link
Contributor

@grom72 grom72 commented Sep 26, 2024

Do not start Trivy scan if changes not related to dependencies. Run Trivy on daily bases.
Add badge to follow cycle Trivy scans
Enable scans on request

Doc-only: true

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Do not start Trivy scan if changes not related to dependencies.
Run Trivy on daily bases.
Add badge to follow cycle Trivy scans
Enable scans on request

Doc-only: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Copy link

github-actions bot commented Sep 26, 2024

Errors are component not formatted correctly,Ticket number suffix is not a number. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/SRE-2505:

@grom72 grom72 changed the title Limit scope of changes that are monitored by Trivy scan SRE-2505: Limit scope of changes that are monitored by Trivy scan Oct 2, 2024
@grom72 grom72 marked this pull request as ready for review October 2, 2024 19:16
@grom72 grom72 requested a review from a team as a code owner October 2, 2024 19:16
@grom72 grom72 added doc-only Changes only affect documentation, not code release-2.6.2 Targeted for release 2.6.2 labels Oct 4, 2024
Copy link
Contributor

@brianjmurrell brianjmurrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we limit the scope here so that this is done by a timer how do we know when the timed run has found something new? I.e. Who gets notified and how?

@grom72
Copy link
Contributor Author

grom72 commented Oct 8, 2024

If we limit the scope here so that this is done by a timer how do we know when the timed run has found something new? I.e. Who gets notified and how?

Somebody who monitors Security issues of the project as we report all detected issues there:
https://github.com/daos-stack/daos/security/code-scanning?query=is%3Aopen+branch%3Amaster+tool%3ATrivy

There is also a tag in the README file:

[![Trivy scan](https://github.com/daos-stack/daos/actions/workflows/trivy.yml/badge.svg)](https://github.com/daos-stack/daos/actions/workflows/trivy.yml)

If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues.

Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later.

@brianjmurrell
Copy link
Contributor

Somebody

Who exactly is this somebody? Because unless it's somebody who's job description includes being responsible for monitoring (monitoring often gets back-burnered by new high priority tasks, FWIW) these, somebody will be nobody.

There is also a tag in the README file:

No developer pays attention to those.

If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues.

Not if such scan failures fail their PR and they cannot get them landed. This is how we handle all of our linting (for example), etc. currently. Periodically somebody will discover that some new linting requirement on their PR that's not related to their code and they either have to fix it in their PR or raise the flag that it needs fixing. In the meanwhile, gatekeepers are requested to force-land. All of this keeps the issue front-and-centre and motivates a fix in quick time.

Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later.

Is Trivy entirely related to dependencies? That is, does it not find any other kinds of vulnerabilities other than dependencies that are out of date?

@grom72
Copy link
Contributor Author

grom72 commented Oct 8, 2024

Somebody

Who exactly is this somebody? Because unless it's somebody who's job description includes being responsible for monitoring (monitoring often gets back-burnered by new high priority tasks, FWIW) these, somebody will be nobody.

Do we follow the security tab (https://github.com/daos-stack/daos/security)?
If not, what is the purpose of maintaining it?

There is also a tag in the README file:

No developer pays attention to those.

If we keep the original version, developers will start ignoring Trivy scans as soon as they detect that Trivy reports issues that are completely out of the scope of their PRs. E.g. modifications in VOS (c language) that trigger (???) Java issues.

Not if such scan failures fail their PR and they cannot get them landed. This is how we handle all of our linting (for example), etc. currently. Periodically somebody will discover that some new linting requirement on their PR that's not related to their code and they either have to fix it in their PR or raise the flag that it needs fixing. In the meanwhile, gatekeepers are requested to force-land. All of this keeps the issue front-and-centre and motivates a fix in quick time.

  1. linting can not be fixed automatically while Trivy issues are usually fixed by Dependabot
  2. how is it possible to have linting issue not related to modified code?
  3. force-landing exception shows that the process does not work well - it is a workaround for too heavy validation of PR

Additionally, Dependabot is the main protection mechanism for our code. Trivy scan mainly provides information for the SDLe process. It reacts faster than Dependabots as we observed last week, but we should not bother developers if we know that Dependabot creates a proper fix a day or two days later.

Is Trivy entirely related to dependencies? That is, does it not find any other kinds of vulnerabilities other than dependencies that are out of date?

Trivy detects CVEs based on detected dependencies. Dependabot does the same, but Dependabot also creates a fix.
Finally, Trivy is somehow a guard that confirm Dependabot works well.

@brianjmurrell
Copy link
Contributor

Do we

Who is "we"?

follow the security tab (daos-stack/daos/security)?

I don't know that anyone has been "assigned" to and is responsible for what shows up there. I would guess the answer is no.

If not, what is the purpose of maintaining it?

We can dream (that somebody might be made responsible for it) can't we? 😄

  1. linting can not be fixed automatically while Trivy issues are usually fixed by Dependabot

Usually? Does that mean there are cases where they not fixed by Dependabot?

  1. how is it possible to have linting issue not related to modified code?

New linting rules can be introduced.

  1. force-landing exception shows that the process does not work well - it is a workaround for too heavy validation of PR

I disagree. Force-landing, in this case, just means that we are a large team working on a lot of code and we do not want to halt everyone's progress while surprises are taken care of and we don't want to have such things landable without a gatekeeper being aware by knowing they are force landing.

Trivy detects CVEs based on detected dependencies.

So it's entirely driven by dependencies?

@grom72
Copy link
Contributor Author

grom72 commented Oct 9, 2024

Do we

Who is "we"?

follow the security tab (daos-stack/daos/security)?

I don't know that anyone has been "assigned" to and is responsible for what shows up there. I would guess the answer is no.

If not, what is the purpose of maintaining it?

We can dream (that somebody might be made responsible for it) can't we? 😄

We have different *owners groups in daos-stack/daos. Perhaps it is time to create the security-owners group that will monitor security issues reported by Github.

  1. linting can not be fixed automatically while Trivy issues are usually fixed by Dependabot

Usually? Does that mean there are cases where they not fixed by Dependabot?

I have not seen such a case so far. If Dependabot can not fix the issue then we have to create an exception, as it is already done for many Hadoop issues

## CVE-2023-52428,MEDIUM,,"Denial of Service in Connect2id Nimbus JOSE+JWT","com.nimbusds:nimbus-jose-jwt","9.8.1","9.37.2",https://avd.aquasec.com/nvd/cve-2023-52428

But, I do not think it is a good approach to teach everybody how to search for exceptions and how to resolve them.

  1. how is it possible to have linting issue not related to modified code?

New linting rules can be introduced.

  1. force-landing exception shows that the process does not work well - it is a workaround for too heavy validation of PR

I disagree. Force-landing, in this case, just means that we are a large team working on a lot of code and we do not want to halt everyone's progress while surprises are taken care of and we don't want to have such things landable without a gatekeeper being aware by knowing they are force landing.

Yes, exactly, we use force-landing because we do not want to wait for Dependabot to fix the issue.

I want to avoid a situation that can be described in the following steps:

  1. PR is created
  2. Trivy reports an issue
  3. A related ticket is created to monitor Trivy issue (who is an owner?)
  4. PR is force-landing, as you mentioned.
  5. Dependabot detect an issue and provide a fix
  6. Dependabot PR shall pass Trivy scan if not -> that means a bigger problem to be handled separately
  7. Fix is merged
  8. Ticket from 3) can be closed (by whom?)
  9. Original PR is marked as failed as nobody re-runs Trivy scan on merged PR

Trivy detects CVEs based on detected dependencies.

So it's entirely driven by dependencies?
Yes, it checks files that are now mentioned in PR:

      - '**/go.mod'
      - '**/pom.xml'
      - '**/requirements.txt'

compares them with CVE database and produces a report.
I do not know how Java code is released (whether at all), but in the case of Go and Python, Trivy issues are more related to project CI infrastructure rather than the final product.

Doc-only: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@brianjmurrell
Copy link
Contributor

We have different *owners groups in daos-stack/daos. Perhaps it is time to create the security-owners group that will monitor security issues reported by Github.

As an implementation matter, perhaps. But first and foremost, there is an administrative component that needs to be completed first. Management needs to decide that this is an important task and then needs to decide who is going to be responsible for it and to assign it to them.

I have not seen such a case so far.

You probably don't submit as many daos PRs as other developers and have not been here long enough to see it.

But, I do not think it is a good approach to teach everybody how to search for exceptions and how to resolve them.

The issue not necessarily that everyone is responsible for fixing them. The idea is that making a CI stage fail due to an issue is what drives getting it fixed. Around here, Issues that don't stop people's work just get pushed aside and continue to pile up without anyone stopping what they are doing to address them.

Yes, exactly, we use force-landing because we do not want to wait for Dependabot to fix the issue.

And if Trivy can find the issue sooner so that somebody can beat Dependabot to fixing the issue, great. But the Trivy scan has to interrupt/block people's work, requiring them to request force-landings in order for whatever issue the scan found to be pushed up in importance. If it's just a scan that affects nobody, it gets ignored. That's just how things work around here.

I want to avoid a situation that can be described in the following steps:

  1. PR is created
  2. Trivy reports an issue
  3. A related ticket is created to monitor Trivy issue (who is an owner?)

Who's going to create that ticket? What prompts them to create it if the scan did not impact their PR? And yes, "who is an owner"?

  1. PR is force-landing, as you mentioned.

The developer needing to request this is the only thing that drives a ticket being created and the issue being fixed. If no developer is impacted, nothing happens.

Doc-only: true
Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@ryon-jensen
Copy link
Contributor

Can the Trivy workflow generate a notification or create a ticket so findings aren't lost?

Doc-only: true

Required-githooks: true
Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72
Copy link
Contributor Author

grom72 commented Oct 15, 2024

Can the Trivy workflow generate a notification or create a ticket so findings aren't lost?

Notification is created (if CVE detected) on Security tab:
https://github.com/daos-stack/daos/security/code-scanning?query=is%3Aopen+branch%3Amaster+tool%3ATrivy

Comment on lines +8 to +9
schedule:
- cron: '45 8 * * *'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the scheduled run add any value if we are (correctly I will add) running this on both PRs and landings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three usecases

  • PR - to inform the PR's author about new problems delivered together with the PR - this scan does not add anything to Security tab
  • push reports new issues introduced by PR to the Security tab if not resolved by the author
  • schedule reports new issues to the Security tab detected mainly due to new issues reported by CVE databases
    There is an overlap between schedule and push but the schedule does not need to wait for any PR to land.

Comment on lines 14 to 18
paths:
- '**/go.mod'
- '**/pom.xml'
- '**/requirements.txt'
- '**/*trivy*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if something new is added to the repo that Trivy should be scanning but of course the developer adding that something new has no idea that Trivy should be scanning it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if something new is added to the repo that Trivy should be scanning but of course the developer adding that something new has no idea that Trivy should be scanning it?

There is nothing like "something new" in Trivy. Every scan follows the Trivy configuration file, which defines precisely what shall be scanned. We need to modify the Trivy configuration to extend the scope of scans. But it is not a developer's task but rather the security engineer's one—to decide that additional checks are required. So far, Trivy scans Java, Python, and Go dependencies, and this is the basement for file path filtering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I was very clear in my question.

The problem I see is that the above is only running Trivy scan on PRs that modify one of the listed paths.

But let's say a developer adds some new code in a new path to the source tree and that code should be scanned by Trivy. But the average developer is not going to know about this Trivy scan workflow and is not going to think about the need to add their path to the list of paths that trigger Trivy scan.

So they get their PR finished and landed without any Trivy scanning but the moment it's landed Trivy scan runs and finds the new code and finds a failure in it. Now master has to be closed and somebody has a high-priority task to resolve this new scan failure, interrupting their scheduled work and nobody else can get anything landed until the issue is resolved.

All of this could have been prevented by always scanning in PRs, not only when some whitelist of paths are modified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit limits the list of files to be monitored by the push event.

The file list has been extended by all supported by Trivy files - https://aquasecurity.github.io/trivy/v0.56/docs/coverage/language/#supported-languages

The solution does not block any PR from being landed if not Trivy-related.

schedule run reports detected issues to the Security tab, which should be somehow monitored.
(We have to define a procedure if no one monitors Security related issues reported via Security tab)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Security tab, which should be somehow monitored. (We have to define a procedure if no one monitors Security related issues reported via Security tab)

This needs to be defined and put in place before we can land anything that only reports there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Security tab, which should be somehow monitored. (We have to define a procedure if no one monitors Security related issues reported via Security tab)

This needs to be defined and put in place before we can land anything that only reports there.

We already have similar situation with Scorecard where all issues are reported only to Security tab.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have similar situation with Scorecard where all issues are reported only to Security tab.

Scorecard issues show up in the PR that introduces them.

https://aquasecurity.github.io/trivy/v0.56/docs/coverage/language/#supported-languages
provides the full list of scanned file in the 'filesystem' scan.

Keep the same condition for PR and merge trigger.

Doc-only: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Doc-only: true

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
Comment on lines 12 to 31
paths:
- '**/go.mod'
- '**/pom.xml'
- '**/*gradle.lockfile'
- '**/*.sbt.lock'
- '**/requirements.txt'
- '**/poetry.lock'
- '**/Pipfile.lock'
- '**/*trivy*'
pull_request:
branches: ["master", "release/**"]
paths:
- '**/go.mod'
- '**/pom.xml'
- '**/*gradle.lockfile'
- '**/*.sbt.lock'
- '**/requirements.txt'
- '**/poetry.lock'
- '**/Pipfile.lock'
- '**/*trivy*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Trivy expensive to run? If not, why do we bother with these path filters and not just run it on every PR and landing so that new issues are caught as soon as possible?

Who is going to know to update these path filters if/when Trivy expands their coverage to other languages, or even when some already supported language changes the file that Trivy should use, etc.?

I suppose the scheduled run will catch these kinds of changes, but again, without a process/responsible person put in place to monitor the Security Tab, these changes will go unnoticed and we will not be performing complete Trivy scans on PRs or even on landings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have observed some limits on the Trivy database access.
(https://github.com/daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)

Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective. Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?

Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.

Why we can not treat schedule in the same way as we treat daily tests?
I expect that one scan per day is enough in normal case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have observed some limits on the Trivy database access. (daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)

44000/minute seems like a lot. Are we sure something did not run amok?

Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective.

No more than the status quo. Just look at the isort action/check that has been broken for a few days last week. It finally (and only, I suspect!) got updated because somebody noticed it in their PR and started asking about it/getting it fixed. If it did not break a PR I am pretty sure it would still be broken and nobody noticing.

Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?

Just like the isort problem last week. Somebody will raise their PR being broken because of it and somebody jumps in to fix it.

Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.

Which is even worse. Because when a long time goes by and they do finally change something, nobody remembers that we need to change on our end also to capture that new something.

Why we can not treat schedule in the same way as we treat daily tests? I expect that one scan per day is enough in normal case.

Phil has been assigned ownership of monitoring daily scans. Who will be assigned daily monitoring of the Security Tab? This is something that needs to be addressed before we introduce a security scan that may only publish results to the Security Tab and not be raised by people's PRs or landings. This is something that needs to be taken up with @ryon-jensen and/or the team in general to define the process and responsible part for monitoring the Security Tab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have observed some limits on the Trivy database access. (daos-stack/daos/actions/runs/11444700004/job/31840159315#step:4:21)

44000/minute seems like a lot. Are we sure something did not run amok?

I guess this is a limit for all Github runners - let's check in the future how many false-positive we will have.

Everything that breakes the PR build, but is not related to PR-related changes is VERY expensive from the PR's author perspective.

No more than the status quo. Just look at the isort action/check that has been broken for a few days last week. It finally (and only, I suspect!) got updated because somebody noticed it in their PR and started asking about it/getting it fixed. If it did not break a PR I am pretty sure it would still be broken and nobody noticing.

Moreover, the same problem may hit several PRs in parallel. Who will be responsible to fix it?

Just like the isort problem last week. Somebody will raise their PR being broken because of it and somebody jumps in to fix it.

"Somebody will raise it ... somebody jumps in to fix it" is not a well-defined process :(. But let's keep it as it is for a while and see how it works.

Trivy is configured intentionally to detect CVEs in Python/Go/Java. This defines scope of files to be scaned. I do not expect that this will change soon.

Which is even worse. Because when a long time goes by and they do finally change something, nobody remembers that we need to change on our end also to capture that new something.

Why we can not treat schedule in the same way as we treat daily tests? I expect that one scan per day is enough in normal case.

Phil has been assigned ownership of monitoring daily scans. Who will be assigned daily monitoring of the Security Tab? This is something that needs to be addressed before we introduce a security scan that may only publish results to the Security Tab and not be raised by people's PRs or landings. This is something that needs to be taken up with @ryon-jensen and/or the team in general to define the process and responsible part for monitoring the Security Tab.

Let's keep the schedule trigger just in case the push (to master) fails and we do not have any results updated in the Security tab until next PR lands.

Required-githooks: true

Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>
@grom72 grom72 changed the title SRE-2505: Limit scope of changes that are monitored by Trivy scan SRE-2505: Fix Trivy scan upload to the Security tab Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-only Changes only affect documentation, not code release-2.6.2 Targeted for release 2.6.2
Development

Successfully merging this pull request may close these issues.

3 participants