Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url` #179

bradlarsen · 2024-04-26T16:45:08Z

Describe the bug
Sometimes, rarely, the scan::appmaker::scan_workflow_from_git_url test case fails.

To Reproduce
🥴

Expected behavior
The test case should always succeed

Actual behavior
See this CI run log:

failures:

---- scan::appmaker::scan_workflow_from_git_url stdout ----
thread 'scan::appmaker::scan_workflow_from_git_url' panicked at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/ops/function.rs:250:5:
Unexpected stdout, failed var.is_match((?m)^Scanned (549.97|550.05) MiB from 7,92[68] blobs in .*; 23/23 new matches$)
├── var: Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
│   
│    Rule                         Total Findings   Total Matches 
│   ─────────────────────────────────────────────────────────────
│    AWS API Key                               3               3 
│    AWS S3 Bucket (path style)                3              13 
│    Amazon Resource Name                      3               3 
│    Generic Secret                            3               3 
│    AWS Secret Access Key                     1               1 
│   
│   Run the `report` command next to show finding details.
└── var as str: Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
    
     Rule                         Total Findings   Total Matches 
    ─────────────────────────────────────────────────────────────
     AWS API Key                               3               3 
     AWS S3 Bucket (path style)                3              13 
     Amazon Resource Name                      3               3 
     Generic Secret                            3               3 
     AWS Secret Access Key                     1               1 
    
    Run the `report` command next to show finding details.

command=`"/home/runner/work/noseyparker/noseyparker/target/release/noseyparker-cli" "scan" "--datastore=/tmp/.tmpexTCIC/datastore" "--git-url=https://github.com/mozilla-appmaker/appmaker" "--ruleset=all"`
code=0
stdout=\```
Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
error: test failed, to rerun pass `-p noseyparker-cli --test test_noseyparker`

 Rule                         Total Findings   Total Matches 
─────────────────────────────────────────────────────────────
 AWS API Key                               3               3 
 AWS S3 Bucket (path style)                3              13 
 Amazon Resource Name                      3               3 
 Generic Secret                            3               3 
 AWS Secret Access Key                     1               1 

Run the `report` command next to show finding details.
\```

stderr="2024-04-26T16:38:47.0[527](https://github.com/praetorian-inc/noseyparker/actions/runs/8851458936/job/24308038487#step:9:528)03Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 803c41c9e0c0bd42ef1d430001f994af24a7c535 from Git repository at /tmp/.tmpexTCIC/datastore/clones/https/github.com/mozilla-appmaker/appmaker: An object with id 803c41c9e0c0bd42ef1d430001f994af24a7c535 could not be found\n"

stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: assert_cmd::assert::AssertError::panic::panic_cold_display
   3: core::ops::function::FnOnce::call_once
   4: test_noseyparker::scan::appmaker::scan_workflow_from_git_url
   5: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    scan::appmaker::scan_workflow_from_git_url

test result: FAILED. 70 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 7.60s

Error: Process completed with exit code 101.

Output of noseyparker --version

noseyparker 0.18.0-dev

Build Configuration:

    Build Timestamp:    2024-04-26T16:38:12.626689961Z

    Commit Timestamp:   2024-04-26T12:30:46.000000000-04:00
    Commit Branch:      main
    Commit SHA:         0cce95910be38058cc0cf00b548a892e3ed5dc35

    Cargo Features:     default,disable_trace,log,mimalloc,release
    Debug:              false
    Optimization:       3
    Target Triple:      x86_64-unknown-linux-gnu

Build System:

    OS:                 Ubuntu
    OS Version:         Linux 20.04 Ubuntu

    CPU Vendor:         AuthenticAMD
    CPU Brand:          AMD EPYC 7763 64-Core Processor
    CPU Cores:          2

    rustc Version:      1.77.2
    rustc Channel:      stable
    rustc Host Triple:  x86_64-unknown-linux-gnu
    rustc Commit Date:  2024-04-09
    rustc Commit SHA:   25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04
    rustc LLVM Version: 17.0

Additional context
I think I've only ever seen this in GitHub Actions CI, though I'm not certain.

I suspect there is some race condition in the gix library.

The text was updated successfully, but these errors were encountered:

bradlarsen · 2024-04-26T17:09:48Z

I was able to reproduce locally!

I ran the following in a loop on a 14-core M3 MacBook Pro, and after 99 attempts, saw a similar error message:

$ i=0; while true; do ((i = i + 1)) && echo "### take2.$i ###" && rm -rf appmaker.np && cargo run -F release -r -q -- scan -d appmaker.np appmaker.git -j20; done
...
### take2.99 ###
2024-04-26T17:05:48.307767Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 0413e897ddb651d58e86dc2363c631b6b7909820 from Git repository at appmaker.git: An object with id 0413e897ddb651d58e86dc2363c631b6b7909820 could not be found
Scanned 549.96 MiB from 7,925 blobs in 0 seconds (2.09 GiB/s); 4/4 new matches

 Rule                    Total Findings   Total Matches 
────────────────────────────────────────────────────────
 Generic Secret                       3               3 
 AWS Secret Access Key                1               1 

Run the `report` command next to show finding details.
...

bradlarsen · 2024-04-26T17:26:52Z

This is difficult to reproduce. I've had to run hundreds or thousands of times in a loop to see that error message again.

bradlarsen · 2024-04-26T17:31:14Z

Interestingly, the appmaker.git clone — which I obtained with git clone --bare https://github.com/mozilla-appmaker/appmaker appmaker.git — contains only a single packfile:

% tree appmaker.git
appmaker.git
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   ├── push-to-checkout.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
│       ├── pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.idx
│       └── pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.pack
├── packed-refs
└── refs
    ├── heads
    └── tags

9 directories, 20 files

The version of git on my laptop, where I've been able to reproduce very rarely:

$ git --version
git version 2.39.3 (Apple Git-146)

bradlarsen · 2024-05-09T20:36:20Z

@Byron FYI, there appears to be a race condition somewhere in gix, either its reading code or a caching layer somewhere, which I can intermittently trigger in Nosey Parker.

Specifically, when scanning a Git repository, Nosey Parker first enumerates all object IDs, building a vector of blob IDs. Then in a later stage, that vector of blob IDs is used in a Rayon pipeline to read each blob and scan it, in parallel. At the rate of around one in every few hundred attempts when using multiple worker threads, I see gix failing to read one of those blob objects whose ID was previously enumerated.

Last week I tried isolating where the behavior is coming from by trying to disable all caching mechanisms in gix, but that didn't make the problem go away. Either it's in the core ODB reading code, or (more likely) I failed to effectively disable all of the gix caching mechanisms.

This is not a super high priority Nosey Parker issue for me right now, but I wanted to make you aware of it. I'm happy to do further investigation if you have suggestions to better debug this. Thank you!

Byron · 2024-05-11T13:23:40Z

Thanks so much for letting me know!
Bugs in the ODB portion of the gix codebase are among my most feared as they will be spurious, hard to reproduce, and typically hard to fix (after all, who says they are not just much less likely then?).

I agree that it's most definitely a race condition where somehow a thread won't notice that a new pack-index file has become available that contains the object, probably because another thread has just loaded it.

My plan here is to write a program to hopefully reproduce the issue, ideally with a repository with a lot of pack files to make its occurrence more likely. From there, it should be possible to fix it or at least improve the situation.

bradlarsen · 2024-05-11T14:04:12Z

@Byron Thank you! I could look at this more next week. Let me know if I can help.

Byron · 2024-05-11T14:33:02Z

Thank you! I am slowly getting back into it and noticed something interesting: The code to get the initial set of objects also uses a parallel implementation that calls odb.header(), which is when multiple threads trigger loading all the packs (racily) right after the indices have been pre-loaded to support the iteration. From what I can tell, this never hits a race-condition, probably because indices are pre-loaded here.

After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero. Now find_object() runs into the problem occasionally. What I am saying is that in theory, re-using the same repository instance that was used for counting could possibly fix the issue.

Chances are that the underlying issue in somewhere in the code that dynamically loads indices, as all of the sudden it fails to find an index with the object, at least in one thread (maybe another did successfully load it, or is in the process of doing so and somehow one thread doesn't get to wait for that).

Hopefully I will have a repro soon.

bradlarsen · 2024-05-11T15:42:08Z

After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero.

Yes, you're right — Nosey Parker enumerates repos in one phase, and then in the next phase reopens each repo and reads the blobs from it. This re-opening is done to avoid running out of memory, as otherwise it might need to keep thousands of repos open at once.

I intend to rework this scheme eventually, to start scanning immediately rather than first enumerating everything and then scanning it. But that's a medium-term project for me.

A motivating example is here: praetorian-inc/noseyparker#179

Byron · 2024-05-11T16:07:09Z

Great news! I could reproduce the issue and can do so consistently. Having less packs seems to be the key here, and from what I can tell, this situation will put threads into a waiting-loop and it's my guess that something there goes wrong so they don't actually pickup newly loaded indices.

❯ hyperfine -m100 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup' --show-output
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
 18:04:54 counting done 41.4k objects in 0.01s (4.4M objects/s)
 18:04:54  tracing ERROR    🚨 [error]: Object that is known to be present wasn't found | err: NotFound { oid: Sha1(2558c83eb8f313829b7573f70081c2c7ab1581c4) }
 18:04:54 re-counting done 41.4k objects in 0.10s (394.8k objects/s)
 18:04:54     tracing INFO     run [ 116ms | 98.39% / 100.00% ]
 18:04:54     tracing INFO     ┝━ ThreadSafeRepository::discover() [ 1.59ms | 0.09% / 1.37% ]
 18:04:54     tracing INFO     │  ┕━ open_from_paths() [ 1.48ms | 1.22% / 1.28% ]
 18:04:54     tracing INFO     │     ┕━ gix_odb::Store::at() [ 70.0µs | 0.06% ]
 18:04:54     tracing INFO     ┕━ ThreadSafeRepository::open() [ 277µs | 0.06% / 0.24% ]
 18:04:54     tracing INFO        ┕━ open_from_paths() [ 208µs | 0.15% / 0.18% ]
 18:04:54     tracing INFO           ┕━ gix_odb::Store::at() [ 38.2µs | 0.03% ]
Only JSON is implemented - using that instead
Error: At least one object couldn't be looked up even though it must exist
Error: Command terminated with non-zero exit code: 1. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.

A motivating example is here: praetorian-inc/noseyparker#179

…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179

Byron · 2024-05-11T19:43:14Z

With the latest modification, I cannot reproduce the issue anymore even after a thousand runs, where previously it would trigger after less than 50 iterations usually.

❯ hyperfine -m1000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
  Time (mean ± σ):     124.9 ms ±   2.2 ms    [User: 157.1 ms, System: 15.8 ms]
  Range (min … max):   116.3 ms … 137.4 ms    1000 runs

The only unfortunate thing here is that I wasn't able to fix it in a way that avoids re-reading the state on disk to pick up newly added packs, which shouldn't be needed here. On the bright side, it's quite a rare issue in the first place so this inefficiency probably doesn't matter much for most.

After a little more 'tuning' I think this issue is a thing of the past, as it can't be reproduced anymore even in 10.000 runs.

❯ hyperfine -m10000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
  Time (mean ± σ):     130.4 ms ±   2.5 ms    [User: 165.3 ms, System: 16.6 ms]
  Range (min … max):   113.8 ms … 173.8 ms    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

It's not impossible that there will be more races that are discovered in long-running servers maybe, but we get there when we get there - the trampling-herd problem here is now certainly resolved.

…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179 Previously, it was possible for a trampling herd of threads to consolidate the disk state. Most of them would be 'needs-init' threads which could notice that the initialization already happened, and just use that. But a thread might be late for the party and somehow manages to not get any newly loaded index, and thus tries to consolidate with what's on disk again. Then it would again determine no change, and return nothing, causing the caller to abort and not find objects it should find because it wouldn't see the index that it should have seen. The reason the thread got into this mess is that the 'is-load-ongoing' flagging was racy itself, so it would not wait for ongoing loads and just conclude nothing happened. An extra delay (by yielding) now assures it either seees the loading state and waits for it, sees the newly loaded indices. Note that this issue can be reproduced with: ``` './target/release/gix -r repo-with-one-pack -t10 --trace odb stats --extra-header-lookup' ```

bradlarsen · 2024-05-14T16:29:48Z

Thank you @Byron for the speedy response! I will give the main branch of gitoxide a try in Nosey Parker.

Byron · 2024-05-22T14:56:26Z

Gix v0.63 was just released, and it should fix this issue for good :).

bradlarsen · 2024-05-28T17:28:35Z

Yes, upgrading gix from 0.62 to 0.63 seems to fix this problem for me. Thank you @Byron!

Fixes #179.

bradlarsen added bug Something isn't working content discovery Related to enumerating or specifying content to scan labels Apr 26, 2024

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024

add an example program to try reproduce odb errors due to parallelism.

9f03644

A motivating example is here: praetorian-inc/noseyparker#179

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024

add an example program to try reproduce odb errors due to parallelism.

95c2ffd

A motivating example is here: praetorian-inc/noseyparker#179

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024

add an example program to try reproduce odb errors due to parallelism.

baaec21

A motivating example is here: praetorian-inc/noseyparker#179

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024

add an example program to try reproduce odb errors due to parallelism.

391bb8d

A motivating example is here: praetorian-inc/noseyparker#179

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024

fix: more robustness in the face of a trampling-herd of threads loadi…

9ca81df

…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179

Byron added a commit to Byron/gitoxide that referenced this issue May 11, 2024

fix: more robustness in the face of a trampling-herd of threads loadi…

7c9186a

…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179

BrewTestBot mentioned this issue May 22, 2024

gitoxide 0.36.0 Homebrew/homebrew-core#172490

Merged

bradlarsen added a commit that referenced this issue May 28, 2024

Upgrade gix from 0.62 to 0.63

4cfc982

Fixes #179.

bradlarsen closed this as completed in c365dc7 May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url` #179

Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url` #179

bradlarsen commented Apr 26, 2024 •

edited

Loading

bradlarsen commented Apr 26, 2024

bradlarsen commented Apr 26, 2024

bradlarsen commented Apr 26, 2024 •

edited

Loading

bradlarsen commented May 9, 2024

Byron commented May 11, 2024

bradlarsen commented May 11, 2024

Byron commented May 11, 2024 •

edited

Loading

bradlarsen commented May 11, 2024

Byron commented May 11, 2024

Byron commented May 11, 2024 •

edited

Loading

bradlarsen commented May 14, 2024

Byron commented May 22, 2024

bradlarsen commented May 28, 2024

Sporadic test failure: scan::appmaker::scan_workflow_from_git_url #179

Sporadic test failure: scan::appmaker::scan_workflow_from_git_url #179

Comments

bradlarsen commented Apr 26, 2024 • edited Loading

bradlarsen commented Apr 26, 2024

bradlarsen commented Apr 26, 2024

bradlarsen commented Apr 26, 2024 • edited Loading

bradlarsen commented May 9, 2024

Byron commented May 11, 2024

bradlarsen commented May 11, 2024

Byron commented May 11, 2024 • edited Loading

bradlarsen commented May 11, 2024

Byron commented May 11, 2024

Byron commented May 11, 2024 • edited Loading

bradlarsen commented May 14, 2024

Byron commented May 22, 2024

bradlarsen commented May 28, 2024

Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url` #179

Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url` #179

bradlarsen commented Apr 26, 2024 •

edited

Loading

bradlarsen commented Apr 26, 2024 •

edited

Loading

Byron commented May 11, 2024 •

edited

Loading

Byron commented May 11, 2024 •

edited

Loading