Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic test failure: scan::appmaker::scan_workflow_from_git_url #179

Closed
bradlarsen opened this issue Apr 26, 2024 · 13 comments
Closed

Sporadic test failure: scan::appmaker::scan_workflow_from_git_url #179

bradlarsen opened this issue Apr 26, 2024 · 13 comments
Labels
bug Something isn't working content discovery Related to enumerating or specifying content to scan

Comments

@bradlarsen
Copy link
Collaborator

bradlarsen commented Apr 26, 2024

Describe the bug
Sometimes, rarely, the scan::appmaker::scan_workflow_from_git_url test case fails.

To Reproduce
🥴

Expected behavior
The test case should always succeed

Actual behavior
See this CI run log:

failures:

---- scan::appmaker::scan_workflow_from_git_url stdout ----
thread 'scan::appmaker::scan_workflow_from_git_url' panicked at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/ops/function.rs:250:5:
Unexpected stdout, failed var.is_match((?m)^Scanned (549.97|550.05) MiB from 7,92[68] blobs in .*; 23/23 new matches$)
├── var: Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
│   
│    Rule                         Total Findings   Total Matches 
│   ─────────────────────────────────────────────────────────────
│    AWS API Key                               3               3 
│    AWS S3 Bucket (path style)                3              13 
│    Amazon Resource Name                      3               3 
│    Generic Secret                            3               3 
│    AWS Secret Access Key                     1               1 
│   
│   Run the `report` command next to show finding details.
└── var as str: Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
    
     Rule                         Total Findings   Total Matches 
    ─────────────────────────────────────────────────────────────
     AWS API Key                               3               3 
     AWS S3 Bucket (path style)                3              13 
     Amazon Resource Name                      3               3 
     Generic Secret                            3               3 
     AWS Secret Access Key                     1               1 
    
    Run the `report` command next to show finding details.

command=`"/home/runner/work/noseyparker/noseyparker/target/release/noseyparker-cli" "scan" "--datastore=/tmp/.tmpexTCIC/datastore" "--git-url=https://github.com/mozilla-appmaker/appmaker" "--ruleset=all"`
code=0
stdout=\```
Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
error: test failed, to rerun pass `-p noseyparker-cli --test test_noseyparker`

 Rule                         Total Findings   Total Matches 
─────────────────────────────────────────────────────────────
 AWS API Key                               3               3 
 AWS S3 Bucket (path style)                3              13 
 Amazon Resource Name                      3               3 
 Generic Secret                            3               3 
 AWS Secret Access Key                     1               1 

Run the `report` command next to show finding details.
\```

stderr="2024-04-26T16:38:47.0[527](https://github.com/praetorian-inc/noseyparker/actions/runs/8851458936/job/24308038487#step:9:528)03Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 803c41c9e0c0bd42ef1d430001f994af24a7c535 from Git repository at /tmp/.tmpexTCIC/datastore/clones/https/github.com/mozilla-appmaker/appmaker: An object with id 803c41c9e0c0bd42ef1d430001f994af24a7c535 could not be found\n"

stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: assert_cmd::assert::AssertError::panic::panic_cold_display
   3: core::ops::function::FnOnce::call_once
   4: test_noseyparker::scan::appmaker::scan_workflow_from_git_url
   5: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    scan::appmaker::scan_workflow_from_git_url

test result: FAILED. 70 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 7.60s

Error: Process completed with exit code 101.

Output of noseyparker --version

noseyparker 0.18.0-dev

Build Configuration:

    Build Timestamp:    2024-04-26T16:38:12.626689961Z

    Commit Timestamp:   2024-04-26T12:30:46.000000000-04:00
    Commit Branch:      main
    Commit SHA:         0cce95910be38058cc0cf00b548a892e3ed5dc35

    Cargo Features:     default,disable_trace,log,mimalloc,release
    Debug:              false
    Optimization:       3
    Target Triple:      x86_64-unknown-linux-gnu

Build System:

    OS:                 Ubuntu
    OS Version:         Linux 20.04 Ubuntu

    CPU Vendor:         AuthenticAMD
    CPU Brand:          AMD EPYC 7763 64-Core Processor
    CPU Cores:          2

    rustc Version:      1.77.2
    rustc Channel:      stable
    rustc Host Triple:  x86_64-unknown-linux-gnu
    rustc Commit Date:  2024-04-09
    rustc Commit SHA:   25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04
    rustc LLVM Version: 17.0

Additional context
I think I've only ever seen this in GitHub Actions CI, though I'm not certain.

I suspect there is some race condition in the gix library.

@bradlarsen bradlarsen added bug Something isn't working content discovery Related to enumerating or specifying content to scan labels Apr 26, 2024
@bradlarsen
Copy link
Collaborator Author

I was able to reproduce locally!

I ran the following in a loop on a 14-core M3 MacBook Pro, and after 99 attempts, saw a similar error message:

$ i=0; while true; do ((i = i + 1)) && echo "### take2.$i ###" && rm -rf appmaker.np && cargo run -F release -r -q -- scan -d appmaker.np appmaker.git -j20; done
...
### take2.99 ###
2024-04-26T17:05:48.307767Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 0413e897ddb651d58e86dc2363c631b6b7909820 from Git repository at appmaker.git: An object with id 0413e897ddb651d58e86dc2363c631b6b7909820 could not be found
Scanned 549.96 MiB from 7,925 blobs in 0 seconds (2.09 GiB/s); 4/4 new matches

 Rule                    Total Findings   Total Matches 
────────────────────────────────────────────────────────
 Generic Secret                       3               3 
 AWS Secret Access Key                1               1 

Run the `report` command next to show finding details.
...

@bradlarsen
Copy link
Collaborator Author

This is difficult to reproduce. I've had to run hundreds or thousands of times in a loop to see that error message again.

@bradlarsen
Copy link
Collaborator Author

bradlarsen commented Apr 26, 2024

Interestingly, the appmaker.git clone — which I obtained with git clone --bare https://github.com/mozilla-appmaker/appmaker appmaker.git — contains only a single packfile:

% tree appmaker.git
appmaker.git
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   ├── push-to-checkout.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
│       ├── pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.idx
│       └── pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.pack
├── packed-refs
└── refs
    ├── heads
    └── tags

9 directories, 20 files

The version of git on my laptop, where I've been able to reproduce very rarely:

$ git --version
git version 2.39.3 (Apple Git-146)

@bradlarsen
Copy link
Collaborator Author

@Byron FYI, there appears to be a race condition somewhere in gix, either its reading code or a caching layer somewhere, which I can intermittently trigger in Nosey Parker.

Specifically, when scanning a Git repository, Nosey Parker first enumerates all object IDs, building a vector of blob IDs. Then in a later stage, that vector of blob IDs is used in a Rayon pipeline to read each blob and scan it, in parallel. At the rate of around one in every few hundred attempts when using multiple worker threads, I see gix failing to read one of those blob objects whose ID was previously enumerated.

Last week I tried isolating where the behavior is coming from by trying to disable all caching mechanisms in gix, but that didn't make the problem go away. Either it's in the core ODB reading code, or (more likely) I failed to effectively disable all of the gix caching mechanisms.

This is not a super high priority Nosey Parker issue for me right now, but I wanted to make you aware of it. I'm happy to do further investigation if you have suggestions to better debug this. Thank you!

@Byron
Copy link

Byron commented May 11, 2024

Thanks so much for letting me know!
Bugs in the ODB portion of the gix codebase are among my most feared as they will be spurious, hard to reproduce, and typically hard to fix (after all, who says they are not just much less likely then?).

I agree that it's most definitely a race condition where somehow a thread won't notice that a new pack-index file has become available that contains the object, probably because another thread has just loaded it.

My plan here is to write a program to hopefully reproduce the issue, ideally with a repository with a lot of pack files to make its occurrence more likely. From there, it should be possible to fix it or at least improve the situation.

@bradlarsen
Copy link
Collaborator Author

@Byron Thank you! I could look at this more next week. Let me know if I can help.

@Byron
Copy link

Byron commented May 11, 2024

Thank you! I am slowly getting back into it and noticed something interesting: The code to get the initial set of objects also uses a parallel implementation that calls odb.header(), which is when multiple threads trigger loading all the packs (racily) right after the indices have been pre-loaded to support the iteration. From what I can tell, this never hits a race-condition, probably because indices are pre-loaded here.

After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero. Now find_object() runs into the problem occasionally. What I am saying is that in theory, re-using the same repository instance that was used for counting could possibly fix the issue.

Chances are that the underlying issue in somewhere in the code that dynamically loads indices, as all of the sudden it fails to find an index with the object, at least in one thread (maybe another did successfully load it, or is in the process of doing so and somehow one thread doesn't get to wait for that).

Hopefully I will have a repro soon.

@bradlarsen
Copy link
Collaborator Author

After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero.

Yes, you're right — Nosey Parker enumerates repos in one phase, and then in the next phase reopens each repo and reads the blobs from it. This re-opening is done to avoid running out of memory, as otherwise it might need to keep thousands of repos open at once.

I intend to rework this scheme eventually, to start scanning immediately rather than first enumerating everything and then scanning it. But that's a medium-term project for me.

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024
Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024
@Byron
Copy link

Byron commented May 11, 2024

Great news! I could reproduce the issue and can do so consistently. Having less packs seems to be the key here, and from what I can tell, this situation will put threads into a waiting-loop and it's my guess that something there goes wrong so they don't actually pickup newly loaded indices.

❯ hyperfine -m100 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup' --show-output
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
 18:04:54 counting done 41.4k objects in 0.01s (4.4M objects/s)
 18:04:54  tracing ERROR    🚨 [error]: Object that is known to be present wasn't found | err: NotFound { oid: Sha1(2558c83eb8f313829b7573f70081c2c7ab1581c4) }
 18:04:54 re-counting done 41.4k objects in 0.10s (394.8k objects/s)
 18:04:54     tracing INFO     run [ 116ms | 98.39% / 100.00% ]
 18:04:54     tracing INFO     ┝━ ThreadSafeRepository::discover() [ 1.59ms | 0.09% / 1.37% ]
 18:04:54     tracing INFO     │  ┕━ open_from_paths() [ 1.48ms | 1.22% / 1.28% ]
 18:04:54     tracing INFO     │     ┕━ gix_odb::Store::at() [ 70.0µs | 0.06% ]
 18:04:54     tracing INFO     ┕━ ThreadSafeRepository::open() [ 277µs | 0.06% / 0.24% ]
 18:04:54     tracing INFO        ┕━ open_from_paths() [ 208µs | 0.15% / 0.18% ]
 18:04:54     tracing INFO           ┕━ gix_odb::Store::at() [ 38.2µs | 0.03% ]
Only JSON is implemented - using that instead
Error: At least one object couldn't be looked up even though it must exist
Error: Command terminated with non-zero exit code: 1. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024
Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024
Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024
Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024
@Byron
Copy link

Byron commented May 11, 2024

With the latest modification, I cannot reproduce the issue anymore even after a thousand runs, where previously it would trigger after less than 50 iterations usually.

❯ hyperfine -m1000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
  Time (mean ± σ):     124.9 ms ±   2.2 ms    [User: 157.1 ms, System: 15.8 ms]
  Range (min … max):   116.3 ms … 137.4 ms    1000 runs

The only unfortunate thing here is that I wasn't able to fix it in a way that avoids re-reading the state on disk to pick up newly added packs, which shouldn't be needed here. On the bright side, it's quite a rare issue in the first place so this inefficiency probably doesn't matter much for most.

After a little more 'tuning' I think this issue is a thing of the past, as it can't be reproduced anymore even in 10.000 runs.

❯ hyperfine -m10000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
  Time (mean ± σ):     130.4 ms ±   2.5 ms    [User: 165.3 ms, System: 16.6 ms]
  Range (min … max):   113.8 ms … 173.8 ms    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

It's not impossible that there will be more races that are discovered in long-running servers maybe, but we get there when we get there - the trampling-herd problem here is now certainly resolved.

Byron added a commit to Byron/gitoxide that referenced this issue May 12, 2024
…ng a single index.

The motivating example is here: praetorian-inc/noseyparker#179

Previously, it was possible for a trampling herd of threads to consolidate the
disk state. Most of them would be 'needs-init' threads which could notice that
the initialization already happened, and just use that.

But a thread might be late for the party and somehow manages to not get any
newly loaded index, and thus tries to consolidate with what's on disk again.
Then it would again determine no change, and return nothing, causing the caller
to abort and not find objects it should find because it wouldn't see the index
that it should have seen.

The reason the thread got into this mess is that the 'is-load-ongoing' flagging
was racy itself, so it would not wait for ongoing loads and just conclude nothing
happened. An extra delay (by yielding) now assures it either seees the loading state
and waits for it, sees the newly loaded indices.

Note that this issue can be reproduced with:

```
'./target/release/gix -r repo-with-one-pack -t10 --trace odb stats --extra-header-lookup'
```
@bradlarsen
Copy link
Collaborator Author

Thank you @Byron for the speedy response! I will give the main branch of gitoxide a try in Nosey Parker.

@Byron
Copy link

Byron commented May 22, 2024

Gix v0.63 was just released, and it should fix this issue for good :).

@bradlarsen
Copy link
Collaborator Author

Yes, upgrading gix from 0.62 to 0.63 seems to fix this problem for me. Thank you @Byron!

bradlarsen added a commit that referenced this issue May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working content discovery Related to enumerating or specifying content to scan
Projects
None yet
Development

No branches or pull requests

2 participants