Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New unexpected build failures on 20240603.1.0 #10004

Closed
2 of 14 tasks
llvm-beanz opened this issue Jun 6, 2024 · 105 comments · Fixed by Mause/duckdb-java#6 or duckdb/duckdb-java#32
Closed
2 of 14 tasks

New unexpected build failures on 20240603.1.0 #10004

llvm-beanz opened this issue Jun 6, 2024 · 105 comments · Fixed by Mause/duckdb-java#6 or duckdb/duckdb-java#32

Comments

@llvm-beanz
Copy link

Description

Our PR builds had been working fine as expected until last night when the runners updated to the 20240603.1.0 image. An impacted PR is here:

microsoft/DirectXShaderCompiler#6668

Earlier iterations of the PR build successfully, but the builds began failing once the image updated.

See a failing build here:
https://dev.azure.com/DirectXShaderCompiler/public/_build/results?buildId=6383&view=results

And a previously successful one here:
https://dev.azure.com/DirectXShaderCompiler/public/_build/results?buildId=6371&view=results

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 11
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Image: 20240603.1.0
https://dev.azure.com/DirectXShaderCompiler/public/_build/results?buildId=6383&view=results

Is it regression?

20240514.3.0

Expected behavior

Our builds should work?
https://dev.azure.com/DirectXShaderCompiler/public/_build/results?buildId=6371&view=results

Actual behavior

The build fails with errors we don't encounter locally or on older VM images.

Repro steps

We cannot reproduce outside the VM image.

@past-due
Copy link

past-due commented Jun 6, 2024

We are also seeing new issues in the 20240603.1.0 Windows Server 2022 image, on GitHub Actions - Standard Runners.

Specifically, in our case, a built executable yields an access violation when attempts are made to run it as part of later stages of our build process.

@Esvandiary
Copy link

Esvandiary commented Jun 6, 2024

We're also seeing issues with windows-2022 runners starting sometime in the last two days; re-running previously successful jobs consistently result in failures. We're building a C++ project using CMake and VS2022.

In our case, the build process itself succeeds but trying to run any of the resulting executables later in the workflow fails with Access violation. Local builds of the same code do not exhibit any issues.

@ScottTodd
Copy link

We're also seeing similar issues on windows-2022 with

Errors are all segfaults trying to run executable files.

@past-due
Copy link

past-due commented Jun 6, 2024

The issue may be due to a combination of the ordering of the PATH environment variable, older versions of vcruntime in the path (ex. bundled with Python), and changes in VS 2022 17.10 STL that require the latest vcruntime.

When running the following on windows-2022:20240603.1.0 I see:
echo ((Get-Command vcruntime140.dll).Path) = C:\hostedtoolcache\windows\Python\3.9.13\x64\vcruntime140.dll
echo ((Get-Command vcruntime140.dll).Version) = 14.29.30139.0

But the system-installed version (at C:\Windows\system32\vcruntime140.dll) is 14.40.33810.00

As noted in the VS 2022 17.10 STL release notes:

Reference: https://github.com/microsoft/STL/releases/tag/vs-2022-17.10
Another reference: https://developercommunity.visualstudio.com/t/Access-violation-in-_Thrd_yield-after-up/10664660?sort=active

And, in fact, I can confirm that built executables run correctly on a VM where I've ensured that vcruntime140.dll version 14.40.33810.00 is what's loaded.


Possible recommendations for windows images:

@firthm01
Copy link

firthm01 commented Jun 6, 2024

Also seeing segfaults running executables from a windows-latest (20240603.1.0) runner. Was fine on 20240514.3.0.
Private repo so can't share runs, but it looks very similar to the other reports. In our case we're running pluginval.exe against some audio plugins built by the runner. Interestingly not an immediate segfault. Runs through several test phases before failing.

Starting tests in: pluginval / Editor...
D:\a\_temp\7e7e13d1-ac8c-4be7-bda4-e292e2e5bb48.sh: line 3:  1317 Segmentation fault      ./pluginval.exe --strictness-level 10 --verbose --validate "build/plugin/[REDACTED].vst3"
##[error]Process completed with exit code 139.```

@Esvandiary
Copy link

Esvandiary commented Jun 6, 2024

DLL confusion due to PATH issues makes a lot of sense - I had a build succeed just now simply by changing the workflow to build in Debug rather than Release. The executables would then be looking for d-suffixed DLLs, which wouldn't conflict.

@rouault
Copy link

rouault commented Jun 6, 2024

I also see a regression on the GDAL (https://github.com/OSGeo/gdal) CI related to that change, causing crashes during tests execution:

@mkruskal-google
Copy link

mkruskal-google commented Jun 6, 2024

We hit this too and traced it back to std::mutex usage in https://github.com/abseil/abseil-cpp. This looks like https://developercommunity.visualstudio.com/t/Access-violation-in-_Thrd_yield-after-up/10664660#T-N10668856, which suggests an older incompatible version of msvcp140.dll is being used in this image.

We also only see this in optimized builds. Our debug-built executables still work.

The stacktrace we got:

*** SIGSEGV received at time=1717698962 ***
[1920](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1946)
@ 00007FF638B1F1D2 (unknown) `__scrt_common_main_seh'::`1'::filt$0
[1921](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1947)
@ 00007FF9C733EFF0 (unknown) _C_specific_handler
[1922](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1948)
@ 00007FF9D5A843BF (unknown) _chkstk
[1923](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1949)
@ 00007FF9D5A1186E (unknown) RtlVirtualUnwind2
[1924](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1950)
@ 00007FF9D5A833AE (unknown) KiUserExceptionDispatcher
[1925](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1951)
@ 00007FF9C72B3278 (unknown) Thrd_yield
[1926](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1952)
@ 00007FF638B13566 (unknown) absl::lts_20240116::time_internal::cctz::time_zone::Impl::LoadTimeZone
[1927](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1953)
@ 00007FF638B1245F (unknown) absl::lts_20240116::time_internal::cctz::local_time_zone
[1928](https://github.com/protocolbuffers/protobuf/actions/runs/9406060503/job/25908647666?pr=17050#step:3:1954)
@ 00007FF638B1184E (unknown) absl::lts_20240116::InitializeLog

@henryruhs
Copy link

henryruhs commented Jun 6, 2024

My Python code using ctypes.windll.kernel32.GetShortPathName to shorten paths stopped working since this runner Image: windows-2022 and Version: 20240603.1.0 update.

🟢 20240514.3.0: https://github.com/facefusion/facefusion/actions/runs/9401500581/job/25893418056
🔴 20240603.1: https://github.com/facefusion/facefusion/actions/runs/9407074170/job/25912005903

@bduffany
Copy link

bduffany commented Jun 6, 2024

20240603.1.0 broke us too; we're consistently seeing nondescript "failed to execute command" errors in some protoc.exe executions (google protobuf compiler).

@rouault
Copy link

rouault commented Jun 6, 2024

I've attempted that in https://github.com/rouault/gdal/actions/runs/9407632196/job/25913839874, copying c:\Windows\system32\vcruntime140.dll in a specific directory and putting it in front of the path, but it appears that the version of c:\Windows\system32\ is only 14.32.31326 , and not >= 14.40.33810.00. I would argue that the runner-image should be fixed to have a recent enough vcruntime140.dll in front of the PATH.

A workaround I found when reading https://developercommunity.visualstudio.com/t/Access-violation-in-_Thrd_yield-after-up/10664660#T-N10668856 is that you can define "/D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR" when building your software to revert to a std::mutex constructor compatible of older vcruntimes : rouault/gdal@c4ab31f . My builds work fine with that workaround.

rouault added a commit to rouault/gdal that referenced this issue Jun 6, 2024
…atest 20240603.1.0

Cf actions/runner-images#10004

Other approach attempted in
9ff56b3
didn't result in finding a recent enough vcruntime140.dll
rouault added a commit to OSGeo/gdal that referenced this issue Jun 6, 2024
…atest 20240603.1.0

Cf actions/runner-images#10004

Other approach attempted in
rouault@9ff56b3
didn't result in finding a recent enough vcruntime140.dll
@ScottTodd
Copy link

Can we expect a rollback or some other fix to the runner images, or will affected projects need to use one of the listed workarounds? Is there a way to use older/stable runner images instead of this new version?

@randombit
Copy link

Seeing this also

Runner image 20240514.3.0 works fine

https://github.com/randombit/botan/actions/runs/9408188903/job/25918967600

Runner image 20240603.1.0, built binaries fail with error code 3221225477

https://github.com/randombit/botan/actions/runs/9408188903/job/25918967856

Same code in this case; one (working) is the PR the second (failing) is the merge of that PR into master.

randombit added a commit to randombit/botan that referenced this issue Jun 7, 2024
Visual C++ devs (Microsoft) did something idiotic, and made it so that
if code compiled with the latest compiler runs against an older
runtime it ... crashes without any message.

https://developercommunity.visualstudio.com/t/Access-violation-in-_Thrd_yield-after-up/10664660#T-N10668856

Then Github (Microsoft) shipped an image with a new compiler and an
old runtime, so that compiling anything and trying to run it fails

actions/runner-images#10004

Truly extraordinary
@YOU54F
Copy link

YOU54F commented Jun 7, 2024

Getting similar errors today with the new images, simply executing curl to download a binary. affecting windows-2019 and windows-2022 images. ( I was using windows-latest, but tried windows-2019, same result)

🔵  Downloading ffi v0.4.24 for pact_ffi-windows-x86_64.dll.gz
Error: Process completed with exit code 43.

Here is a re-run of job today, using a newer runner, that passed yesterday.

🟢 20240514.3.0 - https://github.com/YOU54F/pact-js-core/actions/runs/9392917514/job/25868610705#step:1:9
🔴 20240603.1.0 - https://github.com/YOU54F/pact-js-core/actions/runs/9392917514/job/25920895001#step:1:9

curl --output foo --write-out "%{http_code}" --location https://github.com/pact-foundation/pact-ruby-standalone/releases/download/v2.4.4/pact-2.4.4-windows-x86_64.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: libcurl function was given a bad argument
000

I believe my error is curl/curl#13845 probably due to the image update, updating curl deps

JasonMarechal25 pushed a commit to AntaresSimulatorTeam/Antares_Simulator that referenced this issue Jun 21, 2024
Revert "Add `/D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` to fix segfault
#2151

This reverts commit a121c75.

actions/runner-images#10004
sophimao added a commit to sophimao/fpga-runtime-for-opencl that referenced this issue Jun 21, 2024
Sedeniono added a commit to Sedeniono/tiny-optional that referenced this issue Jun 22, 2024
According to actions/runner-images#10004, the
problems with the Windows runners have been fixed. They prevented
successful runs for release builds.

So this commit reverts 20c1fd8.
@BrianMouncer
Copy link

@llvm-beanz - We are closing this issue, as your builds were failing are now successful after the 20240610 image deployment and are fixed.

https://dev.azure.com/DirectXShaderCompiler/public/_build/results?buildId=6395&view=results https://dev.azure.com/DirectXShaderCompiler/public/_build?definitionId=1&_a=summary

Thank you. Kindly reach out to us if you still face any new issues.

@ijunaidm Do you have more information on this change. In our case, JVM itself is loading an older version of vc runtime from it's deployed folder, so even if we remove or update the vc dlls in system32 and other paths on the agent image, JVM still loads an older version of the runtime before we are even called, so our native code will crash on load if we build on the latest 20240610.

The 20240610 image in un-usable to us, unless we modify our code, and work with all our partner team to have them also start building with the D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR flag defined.

This does not seem like a good path forward. How long will the D_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR be supported? What other std incompatibilities are in there that we do not know about yet, and are not covered by this one compiler flag?

Do you know if there is a customer impact issue for this in the DevDiv team?

@BrianMouncer
Copy link

BrianMouncer commented Jun 22, 2024

@kishorekumar-anchala, the latest image seems to work for us.
There has to be a better way to resolve issues like this more quickly. This was an enormous disruption to our team and teams across the world.
Do you have some sort of RCA that you've performed which captures how you can prevent issues like this in the future and/or respond more quickly to them if the occur?

@kishorekumar-anchala: I concur with @llvm-beanz's RCA proposal. The number of man hours spent on investigations is astonishing. We poured through every commit and reran everything piece by piece. Only to come the resolution that it wasn't our fault. This cost us 3 days before finding this specific ticket that offered some insight. Couple that with inability to run tests for days on end before some resolution was on the horizon -> this was a very expensive break and disrupted delivery schedules. Now fold in everyone else that scrambled in a similar fashion. And for some, the image is still effectively broken and still running up the tab.

I have never considered Azure hosted images to be unreliable until now. There was no communication that an image was being updated in a substantial form. No real way to figure out if the issue was known or being addressed.

This lack of communication also bit our team the week prior when we learned (via series of failed tests) that gcc 13 had been pulled off a set of images without any meaningful announcements. Some might argue that the gcc changes might be posted via tasks in the runner repos but I don't monitor them. Instead, I always reference a page such as Microsoft-hosted agents to understand the current config opportunities and scan the updates list. I've never seen images degrade in tooling and stability as they have in the past few weeks. Please consider how to inform customers more effectively as well as responding more efficiently.

Initially I don't think this in on the hosted image maintainers. My point of view is that this is a failure on the Visual Studio team that allowed such a change to ship in the first place. The only area the hosted image folks could have improved would be to have rolled back the compiler upgrade as soon as they realized what happened, and to then message it to the community and push back on the Compiler team to unbreak the STD classes in the vc runtime.... or at a bare minimum published a high visibility breaking change message as soon as they root caused the issue, rather than just resloving this GH Issue because one persons builds are passing while dozens others are still broken...

@MarkCallow
Copy link

In our case, JVM itself is loading an older version of vc runtime

@BrianMouncer #10020 and #10055 also cover the JVM issues.

@MarkCallow
Copy link

Initially I don't think this in on the hosted image maintainers.

It is. They deployed an image in which the VC runtime installed in C:\Windows\system32 did not match the version of the image's compiler, so all tests running newly compiled code failed. They also, as you said, should have rolled back the image as soon as the magnitude of the issue became apparent.

As for MS, they need to fix the compiler tool chain so that the dynamic linker ensures code is linking to a compatible version of the runtime and fails with an understandable error message when the versions mismatch.

@mprather
Copy link

Initially I don't think this in on the hosted image maintainers.

@BrianMouncer It's a maintainer's issue. I have no gripes with the VC team, as a matter of fact. The images were improperly built. As supporting evidence, we re-ran pipelines that had previously succeeded with May's image and found those pipelines failed with the new June image. Then that same code started working again when the 20240603.1.0 image was deployed. That is not a code issue and certainly not a runtime issue. It's an image configuration issue.

We lucked out that our 3rd party libraries started working with the fix. We never had to touch our code b/c it never explicitly dealt with mutex issue. So, yes, magically on the 13th of the month, everything started working for us with no code change. It's an image issue. The sticky part is the issue still exists for some - #10020 is still active and very much a concern for a lot of folks since the new configuration is still a blocking issue for them.

It's worth noting that the reason I also called out the gcc issue was to illustrate that two different platforms suffered from the lack of clear communication, and both practically happened at the same time. That's a lot of expensive deviations due to debugging, workarounds, etc. The windows issue illustrates a need for better testing, monitoring, and response. The ubuntu issues took an image set backwards. Both issues highlight a lack of communication. We spent a lot of time (man hours and pipeline time) on issues that really shouldn't have happened in the first place (windows c++ issue) or, at the very least, with sufficient notice to allow for planned changes (ubuntu issue).

sophimao added a commit to intel/fpga-runtime-for-opencl that referenced this issue Jun 25, 2024
@MarkCallow
Copy link

Is the #define _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR workaround needed with clangcl as well as msvc?.

@MarkCallow
Copy link

MarkCallow commented Jun 27, 2024

Is the #define _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR workaround needed with clangcl as well as msvc?.

I found my own answer after looking back at CI build logs. Usually an msvc build failed first cancelling the other builds but I found an occurrence where the runner image was updated after the msvc jobs had completed but before a clangcl job. It suffered crashes and SEH exceptions in the way familar to everyone here. The workaround is needed also with ClangCL.

MarkCallow added a commit to KhronosGroup/KTX-Software that referenced this issue Jun 30, 2024
Instead of removing the older version of the vcruntime from the Temurin
JVM installation in the GitHub Actions runner image, define
`_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR` when compiling libktx and
${ASTCENC_LIB_TARGET}. This makes the code compatible with older VC
runtimes removing the burden from users to ensure their JVM installation
uses the latest VC runtime.

See actions/runner-images#10055. For further
background see actions/runner-images#10004 and
https://developercommunity.visualstudio.com/t/Access-violation-in-_Thrd_yield-after-up/10664660#T-N10669129-N10678728.

Includes 2 other minor changes:

1. Move the compiler info dump in `CMakeLists.txt` to before first use
    of the compiler info and recode it to use `cmake_print_variables`.
2. Disable dump of system and platform info in
    `tests/loadtests/CMakeLists.txt`.
vyazelenko added a commit to real-logic/aeron that referenced this issue Jul 15, 2024
pow2clk pushed a commit to pow2clk/DirectXShaderCompiler that referenced this issue Jul 16, 2024
This PR contains two changes:
1) Moves a pragma to disable a warning, which seems to be required by
the new compiler.
2) Adds a preprocessor define to workaround the crashes caused by the
runner image mismatching C++ runtime versions.

The second change we will want to revert once the runner images are
fixed. The issue tracking the runner images is:

actions/runner-images#10004

Related microsoft#6668

(cherry picked from commit 0b9acdb)
pow2clk pushed a commit to pow2clk/DirectXShaderCompiler that referenced this issue Jul 16, 2024
This removes the hack introduced in microsoft#6683 to workaround issues in the
GitHub and ADO runner image:
actions/runner-images#10004

Rumor has it the runner images are now fixed... let's see.

Fixes microsoft#6674

(cherry picked from commit 98bb80a)
pthom added a commit to pthom/imgui_bundle that referenced this issue Aug 3, 2024
…ex::lock (_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR)

if (WIN32 AND IMGUI_BUNDLE_BUILD_PYTHON)
    # Windows: workaround against msvc Runtime incompatibilities when using std::mutex::lock
    # Early 2024, msvcp140.dll was updated, and Python 3.11/3.12 are shipped with their own older version of msvcp140.dll
    # As a consequence the python library will happily crash at customer site, not bothering to mention
    # the fact that the loaded version of msvcp140.dll is incompatible...
    # See:
    #    https://developercommunity.visualstudio.com/t/Access-violation-in-_Thrd_yield-after-up/10664660
    #    actions/runner-images#10004
    #    #239 (comment)
    add_compile_definitions(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR)
endif()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment