Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report detailed Git provenance information for matches #16

Closed
bradlarsen opened this issue Dec 16, 2022 · 9 comments · Fixed by #66
Closed

Report detailed Git provenance information for matches #16

bradlarsen opened this issue Dec 16, 2022 · 9 comments · Fixed by #66
Assignees
Labels
enhancement New feature or request reporting Related to reporting of findings

Comments

@bradlarsen
Copy link
Collaborator

Like #15, this was also demoed at Black Hat EU 2022.

When a match is found within a blob in a Git repository, detailed provenance information should be reported, including:

  • The commit(s) that first introduced the match, along with author, date, and commit message
  • The path(s) that the introducing blob appeared as
  • The repository origin URL, if available

With all this information, it is possible to generate permalinks to GitHub for matches.

@Coruscant11
Copy link
Contributor

Coruscant11 commented Feb 19, 2023

Hello,

I am very curious. Do you already know how are you going to use git features in order to find this kind of information ?
I wanted to find file paths and commits associated to a blob in a very efficient way but it does not seems so easy.
It is possible, but I am afraid of ways which solves the problem, but which are taking more time than the scan itself.

Thanks!

@bradlarsen
Copy link
Collaborator Author

Hi @Coruscant11. Good question!

So, Nosey Parker does its scan of Git history not by looking at commit metadata at all, but by simply scanning all blobs within the Git repository. This allows fast scanning of the content, but doesn't provide any metadata such as pathname for a blob or the commit(s) that introduced it. This makes the reporting of findings rather unfriendly to humans.

So the problem that needs to be solved is this: for a given Git blob, determine the set of (commit, pathname) pairs for the set of commit(s) that introduced that blob. With this information, Nosey Parker could emit more useful findings that include metadata.

The information needed to solve this problem is present only in indirect for in a Git repository: pathnames of blobs are encoded in a DAG of tree objects, and commit ancestry is encoded in a DAG of commit objects. The problem can be posed as a graph problem.

I've implemented this twice before in other non-public codebases, both times solving the above problem in one big operation for all blobs in a Git repo. This does indeed end up being expensive on certain large repositories, likely more expensive than the actual scanning step.

In Nosey Parker, instead of solving this (commit, pathname) problem for all blobs, it might be possible to much more quickly solve it only for the blobs that have findings reported in them. (In a typical Git repository, Nosey Parker only reports findings from a tiny fraction of all blobs.)

For a quick hack, there are ways to determine this information using Git command-line tools, like this. But that sort of approach is slow.

I'm planning to work on this next, probably in the next couple weeks.

@Coruscant11
Copy link
Contributor

Thank you for your excellent answer!

Because it was related to some questions I had for #4. I was a little worried about the details of the missing commits. But I think I'll do exactly as the json format has managed to do. Giving either the path when available or just blob id. sarif allows it.
And if one day we manage to add the commits details, then we will update 😄

I was just wondering, when you say more expensive than actual scanning step. Let's take for example the linux kernel scan, which take let's say two minutes. If we want to find all information about commits and path names, how long could it take? Because I have the actually the idea that it could take few hours but I do not know if I am mistaking.

This seems to be a very difficult problem, good luck! Don't hesitate to let me know if I can help you.

@bradlarsen
Copy link
Collaborator Author

I was just wondering, when you say more expensive than actual scanning step. Let's take for example the linux kernel scan, which take let's say two minutes. If we want to find all information about commits and path names, how long could it take? Because I have the actually the idea that it could take few hours but I do not know if I am mistaking.

In one older prototype implementation I have, it takes about 6 minutes on my laptop to find that information for 500 blobs. I suspect that code could be made a few times faster, also. So, a few times slower than actually scanning, but certainly not hours.

@bradlarsen
Copy link
Collaborator Author

A stopgap until this feature is implemented: a little shell script that can tell which commits in a repo contain a given blob, found on Stack Overflow:

#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=tformat:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

I named this git-whichcommit and put it in my PATH. Then I can run like this, for example in a clone of the noseyparker repo, getting commits listed one per line:

% git whichcommit f55471089e0e49e87db6fd635ecbb687c642c140
49c8837 Add references for a few rules
659136a Add a rule for Postman API keys
b97a3af Update CHANGELOG fix note
...

This could be easily adapted to also print the pathname of the object.

@Coruscant11
Copy link
Contributor

Nice!
I will check the repository which takes few hours with gitleaks and trufflehog but 15 seconds with noseyparker.
There is a ton of commits and findings, I wonder how much time it will take. I will keep you updated!

@bradlarsen bradlarsen changed the title Add detailed Git provenance information for matches Report detailed Git provenance information for matches Apr 19, 2023
@bradlarsen bradlarsen added the reporting Related to reporting of findings label Apr 19, 2023
bradlarsen added a commit that referenced this issue Jun 30, 2023
This commit adds blob metadata to Nosey Parker.

The scan command now collects and records some basic metadata about blobs (size in bytes, guessed mime type, guessed charset). The guessed metadata is based on path names, and at present only works on plain file inputs and not blobs found in Git history (see #16).

If Nosey Parker is built with the libmagic feature, blob metadata is collected an recorded using an additional content-based mechanism that uses libmagic, which collects this information even for blobs found in Git history that do not have pathnames. This feature slows down scanning time something like 6-10x, and requires additional system-installed libraries to build, and so is not enabled by default.

When scanning, by default, the metadata is collected and recorded only for blobs that have rule matches within them. The collection of blob metadata can be controlled slightly by the new `--record-all-blobs <BOOL>` command-line option; a true value causes all discovered blobs to have metadata collected and recorded, not just those with rule matches.

The report command makes use of the newly collected metadata. In all output formats, the metadata is included.

Additionally in this pull request: the performance of scanning on certain match-heavy workloads has been improved as much as 2x. This was achieved through using fewer sqlite transactions in the datastore implementation.
@bradlarsen
Copy link
Collaborator Author

@Coruscant11 I'm working on adding native support to Nosey Parker to collect pathname and commit information for blobs. Hoping to merge that back soon.

In the meantime, I have discovered a different workaround for determining this information from git when all you have is the blob ID: https://stackoverflow.com/a/66662476

git whatchanged --all --find-object=$BLOB_ID

This seems to work faster than the shell script I posted above.

@Coruscant11
Copy link
Contributor

Awesome ! I knew this workaround, and I think it is working in a pretty similar way as what you did in the past. I have to check what's happening under the hood.
In fact, if I correctly remember, the performances are similar with the script discussed earlier.
But it is much simpler to implement ! 😄

For me this feature should be optional, used by user demand. Because one of the main advantage of NoseyParker, the performances, would be drastically decreased (resolving the blob provenance is very slow in huge repositories).
Or maybe disabling the feature by default starting from a specific repository size ?

Pretty curious to hear your opinion on this 😄

@bradlarsen
Copy link
Collaborator Author

For me this feature should be optional, used by user demand. Because one of the main advantage of NoseyParker, the performances, would be drastically decreased (resolving the blob provenance is very slow in huge repositories).
Or maybe disabling the feature by default starting from a specific repository size ?

I have a pull request I'm actively working on (#66) that efficiently collects a bunch of additional metadata for all blobs. This includes the commit(s) that introduced the blob, the pathname it first appeared with, commit timestamps, messages, etc.

This metadata is computed using a novel (?) algorithm combining Kahn's algorithm for topological graph traversal with a priority queue to minimize memory use. In most cases the total runtime overhead of computing this information is small, perhaps 15%. This is probably many orders of magnitude faster than running the git whatchanged workaround on each commit above.

On large/unusual repos I have tried (Linux and Homebrew Core) the overhead is much more noticeable, but still usable. In either case, the PR includes command-line options to disable metadata collection—and avoid its overhead—if desired.

Support for computing this information natively will be merged back into Nosey Parker's main branch Sometime Real Soon now, and I'll cut a new release after that.

bradlarsen added a commit that referenced this issue Aug 16, 2023
The `scan` command now collects additional metadata about blobs found within Git repositories. Specifically, for each blob found in Git repository history, the set of commits where it was introduced and the accompanying pathname for the blob is collected. This is enabled by default, but can be controlled using the new `--git-blob-provenance={first-seen,minimal}` parameter.

Fixes #16.

There are several other improvements and new features in this commit:

* Add a new rule to detect Amazon ARN strings
* Fix a typo in the Okta API Key rule
* Update dependencies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request reporting Related to reporting of findings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants