Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when using OpenBLAS through NumPy on Windows 10 2004 #2709

Closed
bashtage opened this issue Jul 8, 2020 · 81 comments · Fixed by #2881
Closed

Errors when using OpenBLAS through NumPy on Windows 10 2004 #2709

bashtage opened this issue Jul 8, 2020 · 81 comments · Fixed by #2881
Labels
Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS

Comments

@bashtage
Copy link

bashtage commented Jul 8, 2020

This issue only appears on Windows 10 2004 (19041). It does not appear on Windows 10 1909 (18363).

On a fresh install of NumPy from pip on a 2004 machine, e.g,

pip install numpy ipython

open ipython and enter

import numpy as np
a = np.arange(13 * 13, dtype=np.float64)
a.shape = (13, 13)
a = a % 17
va, ve = np.linalg.eig(a)

This produces an error:

 ** On entry to DGEBAL parameter number  3 had an illegal value
 ** On entry to DGEHRD  parameter number  2 had an illegal value
 ** On entry to DORGHR DORGQR parameter number  2 had an illegal value
 ** On entry to DHSEQR parameter number  4 had an illegal value
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-1-bad305f0dfc7> in <module>
      3 a.shape = (13, 13)
      4 a = a % 17
----> 5 va, ve = np.linalg.eig(a)

<__array_function__ internals> in eig(*args, **kwargs)

c:\anaconda\envs\py-pip\lib\site-packages\numpy\linalg\linalg.py in eig(a)
   1322         _raise_linalgerror_eigenvalues_nonconvergence)
   1323     signature = 'D->DD' if isComplexType(t) else 'd->DD'
-> 1324     w, vt = _umath_linalg.eig(a, signature=signature, extobj=extobj)
   1325
   1326     if not isComplexType(t) and all(w.imag == 0.0):

c:\anaconda\envs\py-pip\lib\site-packages\numpy\linalg\linalg.py in _raise_linalgerror_eigenvalues_nonconvergence(err, flag)
     92
     93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
---> 94     raise LinAlgError("Eigenvalues did not converge")
     95
     96 def _raise_linalgerror_svd_nonconvergence(err, flag):

LinAlgError: Eigenvalues did not converge

What makes me suspect that it is OpenBLAS related is:

  1. It does not happen when using MKL.
  2. It does not happen when using LAPACK-lite that ships with NumPy and is used as a fallback.

On the other hand, it may likely be a Windows bug since:

  1. It does not occur before 2004.
  2. It does not occur when using 32-bit Python on the same system.
  3. It does not occur when running Pythin in WSL on the same machine whether using WSL1, which runs Linux binaries directly on Windows, or WSL2, which runs them in a hypervisor.

Any help is appreciated.

The related NumPy issue is numpy/numpy#16744.

@martin-frbg
Copy link
Collaborator

While I agree that it looks "OpenBLAS related", I am not yet convinced it is a bug in OpenBLAS itself, particularly if it only happens with/after the Win10 update. The error messages look as if argument passing from C (BLAS) to FORTRAN (LAPACK) functions took a hit.
Perhaps it would make sense to try and build OpenBLAS on a Win10-2004 system (assuming that your pip install just pulled a precompiled binary). Btw it might also be relevant if this affects only particular cpu generations.

@bashtage
Copy link
Author

bashtage commented Jul 8, 2020

I ran openblas_utext.exe without any issues. Is there an equivalent to the lapack-test directive in the *nix make file?

@martin-frbg
Copy link
Collaborator

The executables in test (if you have them) are at least fortran files calling into C routines, but the lapack-test is currently not available in the cmake build. I'll see if it is sufficient to add its subdirectory to the main CMakeLists.txt but I suspect some things may need adjusting compared to the stock cmake support in Reference-LAPACK.

@brada4
Copy link
Contributor

brada4 commented Jul 9, 2020

Indeed pip will install binary by default.
It needs parameter --no-binary :all: to force compilation of all module libraries.

The illegal value parameters are all integers. This might mean that LAPACK and BLAS has different idea of integer sizes. Could you copy numpy directory from working version and test again?It just looks as broken wheel package pulled from pypi.

@bashtage
Copy link
Author

bashtage commented Jul 9, 2020

I tried the nightly from 5/7/20 but this also failed. It has a different version than 1.19 although it isn't obvious how it differs other than it has a different hash.

I'll try to build myself in the next few days.

@brada4
Copy link
Contributor

brada4 commented Jul 9, 2020

It is not directly made by OpenBLAS, could you point to hashes you talk about so that we can have a look at package content?

https://pypi.org/project/numpy/#files

@brada4
Copy link
Contributor

brada4 commented Jul 10, 2020

@bashtage what would be interesting to compare numpy wheels (' hashes) in pip caches. Looking into archives they include 0.3.9-dev version which is anything from development tree (maybe known broken) between 0.3.9 and 0.3.10.
I think we need to add git tag as once proposed out of fact that downstreams do not stick to the releases.

@bashtage
Copy link
Author

From NumPy 1.19.0:

libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll

strings indicates this is 3.9

From NumPy nightly 5/7/2020:

libopenblas.VN4TFHCG6GCB7E53NOJHJWQ4PL7VXRV3.gfortran-win_amd64

strings indicates this is 3.10

[It is my assumption that the large string is a hash]

@martin-frbg
Copy link
Collaborator

numpy should know which hash they packaged, most likely it is a snapshot from just a few days before the 0.3.10 release to get the fix for inadvertent use of avx512 on haswell (in dynamic_arch builds created on skx). most likely a red herring when the issue seems so closely tied with a microsoft patch.

@brada4
Copy link
Contributor

brada4 commented Jul 10, 2020

Is the hash on the working system matching any of these two?

@martin-frbg
Copy link
Collaborator

There is a (small) chance that #2729 may have fixed this (inconsistent declarations of the size of the GEMM buffer in DYNAMIC_ARCH builds)

@martin-frbg
Copy link
Collaborator

@bashtage Any news on this ?

@bashtage
Copy link
Author

bashtage commented Jul 29, 2020

NumPy is using head in their latest nightlies and it is still producing the same issues.

Edit: I do think it is more and more likely a Windows bug. It even appears when coretype is set to Prescott. I had planned to reach out to a Microsoft Python advocate to see if I can get any bandwidth beyond reporting it through standard channels. It won't show up in CI for a very long time since most CI instances are using build 17xxx of Windows, while 19041+ is required for the bug to appear.

@brada4
Copy link
Contributor

brada4 commented Jul 29, 2020

Prescott

What is your CPU? Like CPU-Z screenshot would be nice
Should be fixable setting OPENBLAS_CORETYPE= SKYLAKEX/HASWELL/SANDYBRIDGE for AVX512/AVX2/AVX respectively.

@bashtage
Copy link
Author

bashtage commented Jul 29, 2020 via email

@brada4
Copy link
Contributor

brada4 commented Jul 29, 2020

So it should be ZEN. there is recent change to detect 3td generation ZEN CPUs #2744 , coming downstream in due time.
The illegal value error indeed is disturbing.

@martin-frbg
Copy link
Collaborator

@brada4 this has all already been done and discussed here and on the numpy ticket and I do not see what Zen3 has to do with it

@martin-frbg martin-frbg added the Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS label Jul 30, 2020
@mattip
Copy link
Contributor

mattip commented Sep 30, 2020

A little more information. I used the instructions in the wiki to use CMake and vstudio 2019 to build OpenBLAS. The error still reproduces.

Tracking it down, the first error is here in a call to DNRM2 which results in a NAN even though the data seems valid. (I used WRITE(*,*) A in a loop just to make sure). My fortran foo is not strong enough to figure out which DNRM2 is being called. It is neither the one in reference/dnrm2f.f nor the one in lapack-netlib/BLAS/SRC/dnrm2.f: adding WRITE(*,*) N to the top of both those function did not print anything. Any hints as to how I can find where the implementation might be? Might there be, in one of the implementations of this function, a variable that is not initialized when it is compiled on windows?

@martin-frbg
Copy link
Collaborator

The two Fortran files you found are both from the original reference (or "netlib") BLAS (one "accidentally" included as a part of Reference-LAPACK, the other dead code from an earlier idea to provide a built-in verification).
The code that actually gets called is interface/nrm2.c which forwards to the appropriate cpu-specific microkernel, named NRM2_K in the file (which on x86_64 with a non-Microsoft compiler would be kernel/x86_64/nrm2.S according to the KERNEL files in the x86_64 directory). As you are building with VS,
which does not cope with the assembly code, the generic implementation from kernel/arm/nrm2.c is substituted.
(the CNAME macro there gets replaced with DNRM2, and FLOAT with double). That code looks quite straightforward to me, but maybe instrumenting it with print statements will provide some insight ?

@mattip
Copy link
Contributor

mattip commented Oct 1, 2020

Surprisingly, adding print statements to kernle/arm/nrm2.c did not do anything. I am pretty sure the build is using the assembly version via clang (maybe because of flang?), and in the numpy builds it is using mingw, so maybe also getting the assembly kernel. I see compiled kernels in How can I convince OpenBLAS to use the C version instead, as a way to check whether there is some new problem with the assembly compilation or register use, or is there a way to add a fprint("A\n") to the assembly to positively prove I am using it?

Is there a set of tests I can run in OpenBLAS that might expose the problem (sorry for the noise if you have already tried that)? I wonder if it is critical that the matrix be 13 x 13 ?

@martin-frbg
Copy link
Collaborator

With mingw or clang you can add a line
DNRM2KERNEL=../arm/nrm2.c
to kernel/x86_64/KERNEL.HASWELL (or whatever matches your cpu)

@mattip
Copy link
Contributor

mattip commented Oct 1, 2020

My cpu is "model name : AMD Ryzen 9 3900X 12-Core Processor" so I assume that is KERNEL.ZEN.
What do I need to touch to get that to be rebuilt? I don't seem to be able to trigger the replacement of the assembly version with the C version.

@martin-frbg
Copy link
Collaborator

make clean should do it - if you are using cmake, it may be necessary to recreate the build directory however

@mattip
Copy link
Contributor

mattip commented Oct 1, 2020

Forcing the use of the C-based ../arm/nrm2.c kernel fixes the dnrm-related failures. When running the entire NumPy test suite with the patch in the detail, I still have one failure: dgelsd reports ** On entry to DLASCL parameter number 4 had an illegal value

diff --git a/kernel/x86_64/KERNEL b/kernel/x86_64/KERNEL
index 4a2e13be..e1b1e6e4 100644
--- a/kernel/x86_64/KERNEL
+++ b/kernel/x86_64/KERNEL
@@ -259,7 +259,8 @@ SNRM2KERNEL = nrm2_sse.S
 endif

 ifndef DNRM2KERNEL
-DNRM2KERNEL = nrm2.S
+//DNRM2KERNEL = nrm2.S
+DNRM2KERNEL      =  ../arm/nrm2.c
 endif

 ifndef QNRM2KERNEL

@martin-frbg
Copy link
Collaborator

That's another NaN then - the check in DLASCL is for double argument "cfrom" zero or NaN. And we still do not really know where they come from, or do we ?

@mattip
Copy link
Contributor

mattip commented Oct 2, 2020

Looking at the registers and flags in a debugger, the only thing that seems strange at the point where NumPy calls in to OpenBLAS is that ST0 is NAN when fmod() is used, and +0 when it is not used.

@mattip
Copy link
Contributor

mattip commented Oct 2, 2020

It is legal to call fstp without previously pushing anything onto the fpu stack? Here is the first part of the implementation of fmod, the ST0 register gets messed up on line 98

@bashtage
Copy link
Author

bashtage commented Oct 2, 2020

Looking at the registers and flags in a debugger, the only thing that seems strange at the point where NumPy calls in to OpenBLAS is that ST0 is NAN when fmod() is used, and +0 when it is not used.

Any idea if this register is NaN under Linux? Guessing no.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Oct 5, 2020

(#695 is what I had somewhere in the back of my mind, but at first glance it does/did not involve any fpu-using microkernels).
Yes, using the generic c codes on Windows would probably be easier (need to check if we can ifdef on OSNAME in the KERNEL files already)

@carlkl
Copy link

carlkl commented Oct 6, 2020

I know I am persistent, but there is a difference between FCLEX and FNCLEX:

Clears the floating-point exception flags (PE, UE, OE, ZE, DE, and IE), the exception summary status flag (ES), 
the stack fault flag (SF), and the busy flag (B) in the FPU status word. The FCLEX instruction checks for and 
handles any pending unmasked floating-point exceptions before clearing the exception flags; 
the FNCLEX instruction does not.

The assembler issues two instructions for the FCLEX instruction (an FWAIT instruction followed by an FNCLEX instruction), 
and the processor executes each of these instructions separately. If an exception is generated for either of these 
instructions, the save EIP points to the instruction that caused the exception.

Maybe one should FCLEX give a try.

@martin-frbg
Copy link
Collaborator

Maybe, but my current reading of that description is that the implied FWAIT in FCLEX would only raise the exception that we are trying to clear (and all this assumes that it is only a pending exception flag we need to worry about, not an invalid fpu stack in general)

@mattip
Copy link
Contributor

mattip commented Oct 6, 2020

Maybe one should FCLEX give a try.

I did. It didn't solve the problem.There is a comment to that effect above.

Edit: ahh, you mean fnclex -> fclex. Sorry.

@martin-frbg
Copy link
Collaborator

Seems most likely to me now that the mystery .S file "causing" the ZLALSD/DLASCL test failure is simply znrm2.S , all the other files with fpu instructions are either completely dead or extensions unlikely to be called from numpy

@carlkl
Copy link

carlkl commented Oct 7, 2020

I have taken the trouble to disassemble all object files from https://github.com/xianyi/OpenBLAS/releases/download/v0.3.10/OpenBLAS-0.3.10-x64.zip and take a look at FPU and MMX usage:

FPU instructions:

cblas_drotg.obj
cblas_srotg.obj

crotg.obj
drotg.obj
srotg.obj
zrotg.obj

csum_k_<TARGETNAME>.obj
dsum_k_<TARGETNAME>.obj
ssum_k_<TARGETNAME>.obj
zsum_k_<TARGETNAME>.obj

dnrm2_k_<TARGETNAME>.obj
znrm2_k_<TARGETNAME>.obj

MMX instructions:

dgemm_oncopy_BARCELONA.obj.asm
dgemm_otcopy_BARCELONA.obj.asm
sgemm_oncopy_BARCELONA.obj.asm
sgemm_otcopy_BARCELONA.obj.asm
strsm_kernel_LN_BARCELONA.obj.asm
strsm_kernel_LN_CORE2.obj.asm
strsm_kernel_LN_PRESCOTT.obj.asm
strsm_kernel_LT_BARCELONA.obj.asm
strsm_kernel_LT_CORE2.obj.asm
strsm_kernel_LT_PRESCOTT.obj.asm
strsm_kernel_RN_BARCELONA.obj.asm
strsm_kernel_RN_CORE2.obj.asm
strsm_kernel_RN_PRESCOTT.obj.asm
strsm_kernel_RT_BARCELONA.obj.asm
strsm_kernel_RT_CORE2.obj.asm
strsm_kernel_RT_PRESCOTT.obj.asm

The following objects are using the EMMS instruction to RESET and emtpy the FPU registers after MMX usage:

dgemm_oncopy_BARCELONA.obj.asm
dgemm_otcopy_BARCELONA.obj.asm
sgemm_oncopy_BARCELONA.obj.asm
sgemm_otcopy_BARCELONA.obj.asm

The EMMS instruction markes the FPU stack as emtpy and prepares it for FPU usages after use of MMX instructions. This is necessary as the MMX and the FPU registers are aligned. EMMS should also be able to prepare the FPU stack for further usage in case it is corrupted.

Probably using the FCLEX and EMMS instruction instead of FNINIT in the beginning of the *nrm2.S assembler files is sufficient to heal bad behaviour from any third party library (or MS VC runtime DLL) to the FPU registers and stack.
The benefit of these two instructions vs FNINIT is performance and the fact it does not change the FPU precision and is therefore consistent with CONSISTENT_FPCSR. This means it can be added without any considerations.
However, this should be tested by @mattip first.

@martin-frbg
Copy link
Collaborator

Thanks - the xROTG ones derive from C files (in OpenBLAS/interface) so any fpu usage there is/was up to the compiler. xNRM2 and xSUM are quite familiar names now (where the xSUM is a BLAS extension probably not used by numpy). (Still more inclined to replace nrm2.S with its generic C version specifically on Windows though)

@carlkl
Copy link

carlkl commented Oct 7, 2020

Unfortunately the scattered FPU state has to be healed somewhere: in the numpy or the OpenBLAS codebase. Adding two instructions to 2 assembler files seems the most simple method to me. In numpy one has to extend the build system to add MASM, add a helper function at the right place and so on.

MS has a problematic perspective for the WIN64 API and FPU uasge: the user has to be aware of the fact tat the FPU state is volatile between calls (this includes a corrupt state I guess). So the user is responsible to get the FPU in a usuable state before usage. I'm sure nobody expected that an WIN64! On the other hand: nobody expected the spanish inquistion...

The most problematic aspect is, that the erratic FPU state is pending if not healed somewhere and may pop up later in the codebase.

@martin-frbg
Copy link
Collaborator

Yuck. I did not envisage OpenBLAS as a Windows repair tool but it is hard to argue about the "somewhere". Still I consider the corrupt fpu state a genuine bug in Win10.19041 and would hope Microsoft treats it as such. (And if EMMS/FCLEX actually does the trick I would still prefer to make that addition to the (z)nrm2.S files #ifdef OS_WINDOWS)

@brada4
Copy link
Contributor

brada4 commented Oct 8, 2020

MSVC 2019 does not emit MMX or FPU in 64bit mode.

@martin-frbg
Copy link
Collaborator

Does not seem to make it safer to just leave the pending exception dangling, at best it would make it somebody else's problem ? (Assuming some other third-pary library gets loaded as well, like we saw with libhdfs in another recent issue)

@carlkl
Copy link

carlkl commented Oct 8, 2020

@brada, yes, MSVC does not emit MMX or FPU in 64bit mode. But the MS runtime libraries are using FPU code and the use it in an unexpected way, hence the problems. As long as OpenBLAS is using FPU or MMX code itself in Win64 in the gcc builds OpenBLAS must care about this problem.
FCLEX and EMMS instructions are not harmful and costs only some cycles. So why not add them in the right place?

@martin-frbg
Copy link
Collaborator

Do we know by know that the FCLEX (or FNCLEX) plut EMMS combo is sufficient to clear this sorry state ?

@carlkl
Copy link

carlkl commented Oct 8, 2020

I will try it as soon as I can. I have to setup my development machine first.

@martin-frbg
Copy link
Collaborator

Further to this, do we know that routines using only MMX are definitely affected by this DLL bug as well, or is this just an assumption based on them sharing the same registers ? (Thinking MMX is a bit more likely to still be in use than x87, so there should be more users and programs affected by the bug)

@carlkl
Copy link

carlkl commented Oct 8, 2020

I will try it as soon as I can. I have to setup my development machine first.

@carlkl
Copy link

carlkl commented Oct 8, 2020

@martin-frbg, I'm not a assembler guru. My idee what happens with this crazy issue are described here: #2709 (comment) and are the results of reflection after scamming numerous documents, stackoverflow/blog contributions. A huge time sink btw. EDIT: we should file MS an invoice :-(

I have an idea about why MMX is not affected: Maybe because MMX uses the registers as is and not as a circulating stack. But this is speculation. So from the MMX point of view the registers can be simply used and erroneous values will be overwritten.

The documentation says:

The EMMS instruction must be used to clear the MMX technology state at the end of all MMX technology procedures or subroutines and before calling other procedures or subroutines that may execute x87 floating-point instructions. If a floating-point instruction loads one of the registers in the x87 FPU data register stack before the x87 FPU tag word has been reset by the EMMS instruction, an x87 floating-point register stack overflow can occur that will result in an x87 floating-point exception or incorrect result.

EMMS operation is the same in non-64-bit modes and 64-bit mode.

I have drawn two conclusions from that:

  1. Before using the FPU EMMS can heal the the FPU state.
  2. This instruction is not needed for MMX usage

But again: I'm not an assembler guru.

@brada4
Copy link
Contributor

brada4 commented Oct 9, 2020

---, yes, MSVC does not emit MMX or FPU in 64bit mode. But the MS runtime libraries are using FPU code and the use it in an unexpected way, hence the problems. As long as OpenBLAS is using FPU or MMX code itself in Win64 in the gcc builds OpenBLAS must care about this problem.
FCLEX and EMMS instructions are not harmful and costs only some cycles. So why not add them in the right place?

It looks like somehow the stuf that applied to userspace WDM (printer drivers) that mmx state is corrupted over context switch, came to generic applications too. Would be interesting to see if windows vista compatibility layer fixes this.

@martin-frbg
Copy link
Collaborator

@carlkl think you may be tagging an unrelated user in your replies to brada4 here. @brada4 from your suggestions I am not sure if you have been following this thread...

@carlkl
Copy link

carlkl commented Oct 9, 2020

@martin-frbg, I think you are right.

@mattip
Copy link
Contributor

mattip commented Oct 13, 2020

Commenting on this closed issue to update that in numpy/numpy#17547 I added code to call emms after a call to fmod in windows, so now we have both a belt and suspenders (does that work in English?)

@bashtage
Copy link
Author

@mattip Belt and bracers</uk translation>

brianlechthaler added a commit to qbitkit/qbitkit that referenced this issue Feb 5, 2021
It seems the issue causing Numpy to break when installing on Windows has finally been resolved, according to the original issue reporting this bug: OpenMathLib/OpenBLAS#2709

Unpinning because we don't need this pinned anymore.
sam210723 added a commit to sam210723/xrit-rx that referenced this issue Sep 20, 2021
OpenBLAS issue on Windows 10 2004 has been fixed (see numpy/numpy#16744 and OpenMathLib/OpenBLAS#2709)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug in other software Compiler, Virtual Machine, etc. bug affecting OpenBLAS
Projects
None yet
5 participants