Report detailed Git provenance information for matches #16

bradlarsen · 2022-12-16T05:34:35Z

Like #15, this was also demoed at Black Hat EU 2022.

When a match is found within a blob in a Git repository, detailed provenance information should be reported, including:

The commit(s) that first introduced the match, along with author, date, and commit message
The path(s) that the introducing blob appeared as
The repository origin URL, if available

With all this information, it is possible to generate permalinks to GitHub for matches.

Coruscant11 · 2023-02-19T19:03:51Z

Hello,

I am very curious. Do you already know how are you going to use git features in order to find this kind of information ?
I wanted to find file paths and commits associated to a blob in a very efficient way but it does not seems so easy.
It is possible, but I am afraid of ways which solves the problem, but which are taking more time than the scan itself.

Thanks!

bradlarsen · 2023-02-19T22:46:29Z

Hi @Coruscant11. Good question!

So, Nosey Parker does its scan of Git history not by looking at commit metadata at all, but by simply scanning all blobs within the Git repository. This allows fast scanning of the content, but doesn't provide any metadata such as pathname for a blob or the commit(s) that introduced it. This makes the reporting of findings rather unfriendly to humans.

So the problem that needs to be solved is this: for a given Git blob, determine the set of (commit, pathname) pairs for the set of commit(s) that introduced that blob. With this information, Nosey Parker could emit more useful findings that include metadata.

The information needed to solve this problem is present only in indirect for in a Git repository: pathnames of blobs are encoded in a DAG of tree objects, and commit ancestry is encoded in a DAG of commit objects. The problem can be posed as a graph problem.

I've implemented this twice before in other non-public codebases, both times solving the above problem in one big operation for all blobs in a Git repo. This does indeed end up being expensive on certain large repositories, likely more expensive than the actual scanning step.

In Nosey Parker, instead of solving this (commit, pathname) problem for all blobs, it might be possible to much more quickly solve it only for the blobs that have findings reported in them. (In a typical Git repository, Nosey Parker only reports findings from a tiny fraction of all blobs.)

For a quick hack, there are ways to determine this information using Git command-line tools, like this. But that sort of approach is slow.

I'm planning to work on this next, probably in the next couple weeks.

Coruscant11 · 2023-02-20T11:37:31Z

Thank you for your excellent answer!

Because it was related to some questions I had for #4. I was a little worried about the details of the missing commits. But I think I'll do exactly as the json format has managed to do. Giving either the path when available or just blob id. sarif allows it.
And if one day we manage to add the commits details, then we will update 😄

I was just wondering, when you say more expensive than actual scanning step. Let's take for example the linux kernel scan, which take let's say two minutes. If we want to find all information about commits and path names, how long could it take? Because I have the actually the idea that it could take few hours but I do not know if I am mistaking.

This seems to be a very difficult problem, good luck! Don't hesitate to let me know if I can help you.

bradlarsen · 2023-02-20T15:45:55Z

I was just wondering, when you say more expensive than actual scanning step. Let's take for example the linux kernel scan, which take let's say two minutes. If we want to find all information about commits and path names, how long could it take? Because I have the actually the idea that it could take few hours but I do not know if I am mistaking.

In one older prototype implementation I have, it takes about 6 minutes on my laptop to find that information for 500 blobs. I suspect that code could be made a few times faster, also. So, a few times slower than actually scanning, but certainly not hours.

bradlarsen · 2023-03-15T14:13:48Z

A stopgap until this feature is implemented: a little shell script that can tell which commits in a repo contain a given blob, found on Stack Overflow:

#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=tformat:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

I named this git-whichcommit and put it in my PATH. Then I can run like this, for example in a clone of the noseyparker repo, getting commits listed one per line:

% git whichcommit f55471089e0e49e87db6fd635ecbb687c642c140
49c8837 Add references for a few rules
659136a Add a rule for Postman API keys
b97a3af Update CHANGELOG fix note
...

This could be easily adapted to also print the pathname of the object.

Coruscant11 · 2023-03-17T20:34:16Z

Nice!
I will check the repository which takes few hours with gitleaks and trufflehog but 15 seconds with noseyparker.
There is a ton of commits and findings, I wonder how much time it will take. I will keep you updated!

This commit adds blob metadata to Nosey Parker. The scan command now collects and records some basic metadata about blobs (size in bytes, guessed mime type, guessed charset). The guessed metadata is based on path names, and at present only works on plain file inputs and not blobs found in Git history (see #16). If Nosey Parker is built with the libmagic feature, blob metadata is collected an recorded using an additional content-based mechanism that uses libmagic, which collects this information even for blobs found in Git history that do not have pathnames. This feature slows down scanning time something like 6-10x, and requires additional system-installed libraries to build, and so is not enabled by default. When scanning, by default, the metadata is collected and recorded only for blobs that have rule matches within them. The collection of blob metadata can be controlled slightly by the new `--record-all-blobs <BOOL>` command-line option; a true value causes all discovered blobs to have metadata collected and recorded, not just those with rule matches. The report command makes use of the newly collected metadata. In all output formats, the metadata is included. Additionally in this pull request: the performance of scanning on certain match-heavy workloads has been improved as much as 2x. This was achieved through using fewer sqlite transactions in the datastore implementation.

bradlarsen · 2023-07-27T22:15:04Z

@Coruscant11 I'm working on adding native support to Nosey Parker to collect pathname and commit information for blobs. Hoping to merge that back soon.

In the meantime, I have discovered a different workaround for determining this information from git when all you have is the blob ID: https://stackoverflow.com/a/66662476

git whatchanged --all --find-object=$BLOB_ID

This seems to work faster than the shell script I posted above.

Coruscant11 · 2023-08-03T09:35:47Z

Awesome ! I knew this workaround, and I think it is working in a pretty similar way as what you did in the past. I have to check what's happening under the hood.
In fact, if I correctly remember, the performances are similar with the script discussed earlier.
But it is much simpler to implement ! 😄

For me this feature should be optional, used by user demand. Because one of the main advantage of NoseyParker, the performances, would be drastically decreased (resolving the blob provenance is very slow in huge repositories).
Or maybe disabling the feature by default starting from a specific repository size ?

Pretty curious to hear your opinion on this 😄

bradlarsen · 2023-08-03T14:28:49Z

For me this feature should be optional, used by user demand. Because one of the main advantage of NoseyParker, the performances, would be drastically decreased (resolving the blob provenance is very slow in huge repositories).
Or maybe disabling the feature by default starting from a specific repository size ?

I have a pull request I'm actively working on (#66) that efficiently collects a bunch of additional metadata for all blobs. This includes the commit(s) that introduced the blob, the pathname it first appeared with, commit timestamps, messages, etc.

This metadata is computed using a novel (?) algorithm combining Kahn's algorithm for topological graph traversal with a priority queue to minimize memory use. In most cases the total runtime overhead of computing this information is small, perhaps 15%. This is probably many orders of magnitude faster than running the git whatchanged workaround on each commit above.

On large/unusual repos I have tried (Linux and Homebrew Core) the overhead is much more noticeable, but still usable. In either case, the PR includes command-line options to disable metadata collection—and avoid its overhead—if desired.

Support for computing this information natively will be merged back into Nosey Parker's main branch Sometime Real Soon now, and I'll cut a new release after that.

The `scan` command now collects additional metadata about blobs found within Git repositories. Specifically, for each blob found in Git repository history, the set of commits where it was introduced and the accompanying pathname for the blob is collected. This is enabled by default, but can be controlled using the new `--git-blob-provenance={first-seen,minimal}` parameter. Fixes #16. There are several other improvements and new features in this commit: * Add a new rule to detect Amazon ARN strings * Fix a typo in the Okta API Key rule * Update dependencies

bradlarsen added the enhancement New feature or request label Dec 16, 2022

This was referenced Dec 16, 2022

Make scan --ignore FILENAME apply to blobs in Git repositories #17

Closed

Oxidize NoseyParker Byron/gitoxide#654

Closed

bradlarsen self-assigned this Dec 30, 2022

bradlarsen mentioned this issue Feb 16, 2023

Add SARIF output format for report #4

Closed

Coruscant11 mentioned this issue Feb 25, 2023

Add SARIF output format for report #4 #33

Merged

bradlarsen changed the title ~~Add detailed Git provenance information for matches~~ Report detailed Git provenance information for matches Apr 19, 2023

bradlarsen added the reporting Related to reporting of findings label Apr 19, 2023

bradlarsen mentioned this issue Jun 30, 2023

Add support for collecting and recording per-file metadata #63

Merged

bradlarsen mentioned this issue Jul 27, 2023

Support parallel enumeration of Git repositories #69

Open

bradlarsen mentioned this issue Aug 15, 2023

Collect additional metadata from scanned Git repositories #66

Merged

bradlarsen closed this as completed in #66 Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report detailed Git provenance information for matches #16

Report detailed Git provenance information for matches #16

bradlarsen commented Dec 16, 2022

Coruscant11 commented Feb 19, 2023 •

edited

Loading

bradlarsen commented Feb 19, 2023

Coruscant11 commented Feb 20, 2023

bradlarsen commented Feb 20, 2023

bradlarsen commented Mar 15, 2023

Coruscant11 commented Mar 17, 2023

bradlarsen commented Jul 27, 2023

Coruscant11 commented Aug 3, 2023

bradlarsen commented Aug 3, 2023

Report detailed Git provenance information for matches #16

Report detailed Git provenance information for matches #16

Comments

bradlarsen commented Dec 16, 2022

Coruscant11 commented Feb 19, 2023 • edited Loading

bradlarsen commented Feb 19, 2023

Coruscant11 commented Feb 20, 2023

bradlarsen commented Feb 20, 2023

bradlarsen commented Mar 15, 2023

Coruscant11 commented Mar 17, 2023

bradlarsen commented Jul 27, 2023

Coruscant11 commented Aug 3, 2023

bradlarsen commented Aug 3, 2023

Coruscant11 commented Feb 19, 2023 •

edited

Loading