-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors when using OpenBLAS through NumPy on Windows 10 2004 #2709
Comments
While I agree that it looks "OpenBLAS related", I am not yet convinced it is a bug in OpenBLAS itself, particularly if it only happens with/after the Win10 update. The error messages look as if argument passing from C (BLAS) to FORTRAN (LAPACK) functions took a hit. |
I ran openblas_utext.exe without any issues. Is there an equivalent to the |
The executables in |
Indeed pip will install binary by default. The illegal value parameters are all integers. This might mean that LAPACK and BLAS has different idea of integer sizes. Could you copy numpy directory from working version and test again?It just looks as broken wheel package pulled from pypi. |
I tried the nightly from 5/7/20 but this also failed. It has a different version than 1.19 although it isn't obvious how it differs other than it has a different hash. I'll try to build myself in the next few days. |
It is not directly made by OpenBLAS, could you point to hashes you talk about so that we can have a look at package content? |
@bashtage what would be interesting to compare numpy wheels (' hashes) in pip caches. Looking into archives they include 0.3.9-dev version which is anything from development tree (maybe known broken) between 0.3.9 and 0.3.10. |
From NumPy 1.19.0: libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
From NumPy nightly 5/7/2020: libopenblas.VN4TFHCG6GCB7E53NOJHJWQ4PL7VXRV3.gfortran-win_amd64
[It is my assumption that the large string is a hash] |
numpy should know which hash they packaged, most likely it is a snapshot from just a few days before the 0.3.10 release to get the fix for inadvertent use of avx512 on haswell (in dynamic_arch builds created on skx). most likely a red herring when the issue seems so closely tied with a microsoft patch. |
Is the hash on the working system matching any of these two? |
There is a (small) chance that #2729 may have fixed this (inconsistent declarations of the size of the GEMM buffer in DYNAMIC_ARCH builds) |
@bashtage Any news on this ? |
NumPy is using head in their latest nightlies and it is still producing the same issues. Edit: I do think it is more and more likely a Windows bug. It even appears when coretype is set to Prescott. I had planned to reach out to a Microsoft Python advocate to see if I can get any bandwidth beyond reporting it through standard channels. It won't show up in CI for a very long time since most CI instances are using build 17xxx of Windows, while 19041+ is required for the bug to appear. |
What is your CPU? Like CPU-Z screenshot would be nice |
I'm on Zen by default. I see the same errors by manually setting core
types to Haswell, Sandybridge, Zen and Nehalem.
…On Wed, 29 Jul 2020 at 20:31 Andrew ***@***.***> wrote:
Prescott
What is your CPU? Like CPU-Z screenshot would be nice
Should be fixable setting OPENBLAS_CORETYPE= SKYLAKE/HASWELL/SANDYBRIDGE
for AVX512/AVX2/AVX respectively.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2709 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKTSRK27L4XYLONA5POHADR6B2K3ANCNFSM4OUSA2XQ>
.
|
So it should be ZEN. there is recent change to detect 3td generation ZEN CPUs #2744 , coming downstream in due time. |
@brada4 this has all already been done and discussed here and on the numpy ticket and I do not see what Zen3 has to do with it |
A little more information. I used the instructions in the wiki to use CMake and vstudio 2019 to build OpenBLAS. The error still reproduces. Tracking it down, the first error is here in a call to |
The two Fortran files you found are both from the original reference (or "netlib") BLAS (one "accidentally" included as a part of Reference-LAPACK, the other dead code from an earlier idea to provide a built-in verification). |
Surprisingly, adding print statements to Is there a set of tests I can run in OpenBLAS that might expose the problem (sorry for the noise if you have already tried that)? I wonder if it is critical that the matrix be 13 x 13 ? |
With mingw or clang you can add a line |
My cpu is "model name : AMD Ryzen 9 3900X 12-Core Processor" so I assume that is KERNEL.ZEN. |
|
Forcing the use of the C-based
|
That's another NaN then - the check in DLASCL is for double argument "cfrom" zero or NaN. And we still do not really know where they come from, or do we ? |
Looking at the registers and flags in a debugger, the only thing that seems strange at the point where NumPy calls in to OpenBLAS is that ST0 is NAN when |
Any idea if this register is NaN under Linux? Guessing no. |
(#695 is what I had somewhere in the back of my mind, but at first glance it does/did not involve any fpu-using microkernels). |
I know I am persistent, but there is a difference between
Maybe one should |
Maybe, but my current reading of that description is that the implied |
I did. It didn't solve the problem.There is a comment to that effect above. Edit: ahh, you mean |
Seems most likely to me now that the mystery .S file "causing" the ZLALSD/DLASCL test failure is simply znrm2.S , all the other files with fpu instructions are either completely dead or extensions unlikely to be called from numpy |
I have taken the trouble to disassemble all object files from https://github.com/xianyi/OpenBLAS/releases/download/v0.3.10/OpenBLAS-0.3.10-x64.zip and take a look at FPU and MMX usage:
The following objects are using the
The Probably using the |
Thanks - the xROTG ones derive from C files (in OpenBLAS/interface) so any fpu usage there is/was up to the compiler. xNRM2 and xSUM are quite familiar names now (where the xSUM is a BLAS extension probably not used by numpy). (Still more inclined to replace nrm2.S with its generic C version specifically on Windows though) |
Unfortunately the scattered FPU state has to be healed somewhere: in the numpy or the OpenBLAS codebase. Adding two instructions to 2 assembler files seems the most simple method to me. In numpy one has to extend the build system to add MASM, add a helper function at the right place and so on. MS has a problematic perspective for the WIN64 API and FPU uasge: the user has to be aware of the fact tat the FPU state is volatile between calls (this includes a corrupt state I guess). So the user is responsible to get the FPU in a usuable state before usage. I'm sure nobody expected that an WIN64! On the other hand: nobody expected the spanish inquistion... The most problematic aspect is, that the erratic FPU state is pending if not healed somewhere and may pop up later in the codebase. |
Yuck. I did not envisage OpenBLAS as a Windows repair tool but it is hard to argue about the "somewhere". Still I consider the corrupt fpu state a genuine bug in Win10.19041 and would hope Microsoft treats it as such. (And if EMMS/FCLEX actually does the trick I would still prefer to make that addition to the (z)nrm2.S files |
MSVC 2019 does not emit MMX or FPU in 64bit mode. |
Does not seem to make it safer to just leave the pending exception dangling, at best it would make it somebody else's problem ? (Assuming some other third-pary library gets loaded as well, like we saw with libhdfs in another recent issue) |
@brada, yes, MSVC does not emit MMX or FPU in 64bit mode. But the MS runtime libraries are using FPU code and the use it in an unexpected way, hence the problems. As long as OpenBLAS is using FPU or MMX code itself in Win64 in the gcc builds OpenBLAS must care about this problem. |
Do we know by know that the FCLEX (or FNCLEX) plut EMMS combo is sufficient to clear this sorry state ? |
I will try it as soon as I can. I have to setup my development machine first. |
Further to this, do we know that routines using only MMX are definitely affected by this DLL bug as well, or is this just an assumption based on them sharing the same registers ? (Thinking MMX is a bit more likely to still be in use than x87, so there should be more users and programs affected by the bug) |
I will try it as soon as I can. I have to setup my development machine first. |
@martin-frbg, I'm not a assembler guru. My idee what happens with this crazy issue are described here: #2709 (comment) and are the results of reflection after scamming numerous documents, stackoverflow/blog contributions. A huge time sink btw. EDIT: we should file MS an invoice :-( I have an idea about why MMX is not affected: Maybe because MMX uses the registers as is and not as a circulating stack. But this is speculation. So from the MMX point of view the registers can be simply used and erroneous values will be overwritten. The documentation says: The EMMS instruction must be used to clear the MMX technology state at the end of all MMX technology procedures or subroutines and before calling other procedures or subroutines that may execute x87 floating-point instructions. If a floating-point instruction loads one of the registers in the x87 FPU data register stack before the x87 FPU tag word has been reset by the EMMS instruction, an x87 floating-point register stack overflow can occur that will result in an x87 floating-point exception or incorrect result. EMMS operation is the same in non-64-bit modes and 64-bit mode. I have drawn two conclusions from that:
But again: I'm not an assembler guru. |
It looks like somehow the stuf that applied to userspace WDM (printer drivers) that mmx state is corrupted over context switch, came to generic applications too. Would be interesting to see if windows vista compatibility layer fixes this. |
@brada, the error has been localized see numpy/numpy#16744 (comment). It is caused by the use of fmod from ucrtbase.dll (10.0.19041.488). See https://developercommunity.visualstudio.com/content/problem/1207405/fmod-after-an-update-to-windows-2004-is-causing-a.html and https://developercommunity.visualstudio.com/content/problem/1208774/fpu-exception-in-fmod0-x-in-windows-10-version-200.html. |
@martin-frbg, I think you are right. |
Commenting on this closed issue to update that in numpy/numpy#17547 I added code to call |
@mattip Belt and bracers</uk translation> |
It seems the issue causing Numpy to break when installing on Windows has finally been resolved, according to the original issue reporting this bug: OpenMathLib/OpenBLAS#2709 Unpinning because we don't need this pinned anymore.
OpenBLAS issue on Windows 10 2004 has been fixed (see numpy/numpy#16744 and OpenMathLib/OpenBLAS#2709)
This issue only appears on Windows 10 2004 (19041). It does not appear on Windows 10 1909 (18363).
On a fresh install of NumPy from pip on a 2004 machine, e.g,
open
ipython
and enterThis produces an error:
What makes me suspect that it is OpenBLAS related is:
On the other hand, it may likely be a Windows bug since:
Any help is appreciated.
The related NumPy issue is numpy/numpy#16744.
The text was updated successfully, but these errors were encountered: