Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling zpotrs_ introduces nans in unrelated results of cpotrf_ with Nehalem kernels #695

Closed
tkelman opened this issue Nov 23, 2015 · 11 comments

Comments

@tkelman
Copy link
Contributor

tkelman commented Nov 23, 2015

This was running on a Haswell with Ubuntu 14.04. I have 4 separate arrays here, A1, A2, A3, and B. The values in A1 and A3 are exactly equal, but running the same call to cpotrf_ before vs after an unrelated zpotrs_ call on A2 and B produces different results. The second call has somehow introduced some unexpected NaNs. If I comment out the zpotrs_ call then there are no NaNs.

/*
git clone https://github.com/xianyi/OpenBLAS
# develop at e31948ceb0ffe0afcbed2e528cca3d4e1366c421, also v0.2.15
cd OpenBLAS
make TARGET=NEHALEM -j16
# save this file as 13243.c
gcc 13243.c -o 13243 -L. -lopenblas
LD_LIBRARY_PATH=. ./13243
*/
#include <stdint.h>
#include <complex.h>
#include <stdio.h>
#define BLASINT int32_t
void cpotrf_(char*, BLASINT*, complex float*, BLASINT*, BLASINT*);
void zpotrs_(char*, BLASINT*, BLASINT*, complex double*,
    BLASINT*, complex double*, BLASINT*, BLASINT*);
int main() {
complex float A1[100] = {5.8525753+0.0*I, -0.79540455-0.7066077*I, 0.98274714-1.3824869*I, 2.619998-1.8532984*I, -1.8306153+1.2336911*I, 0.32275113-0.015575029*I, 2.1968813-1.0640624*I, 0.27894387-0.97911835*I, 3.0476584-0.18548489*I, 0.3842994-0.7050991*I,
-0.79540455+0.7066077*I, 8.313246+0.0*I, -1.8076122+0.8882447*I, 0.47806996-0.48494184*I, 0.5096429+0.5395974*I, -0.7285097+0.10360408*I, -1.1760061+2.7146957*I, -0.4271084-0.042899966*I, -1.7228563-2.8335886*I, 1.8942566-0.6389735*I,
0.98274714+1.3824869*I, -1.8076122-0.8882447*I, 9.367975+0.0*I, -0.1838578-0.6468568*I, -1.8338387-0.7064959*I, 0.041852742+0.6556877*I, 2.5673025-1.9732997*I, -1.1148382+0.15693812*I, 2.4704504+1.0389464*I, 1.0858271+1.298006*I,
2.619998+1.8532984*I, 0.47806996+0.48494184*I, -0.1838578+0.6468568*I, 3.1117508+0.0*I, -1.956626-0.22825956*I, 0.07081801+0.31801307*I, 0.3698375+0.5400855*I, 0.80686307-1.5315914*I, 1.5649154+1.6229297*I, -0.112077385-1.2014246*I,
-1.8306153-1.2336911*I, 0.5096429-0.5395974*I, -1.8338387+0.7064959*I, -1.956626+0.22825956*I, 3.6439795+0.0*I, -0.2594722-0.48786148*I, -0.47636223+0.27821827*I, -0.61608654+2.01858*I, -2.7767487-1.7693765*I, 0.048102796+0.9741874*I,
0.32275113+0.015575029*I, -0.7285097-0.10360408*I, 0.041852742-0.6556877*I, 0.07081801-0.31801307*I, -0.2594722+0.48786148*I, 3.624376+0.0*I, -1.6697118-0.4017511*I, -1.4397877+0.7550918*I, -0.31456697+1.0403451*I, -0.31978557-0.13701046*I,
2.1968813+1.0640624*I, -1.1760061-2.7146957*I, 2.5673025+1.9732997*I, 0.3698375-0.5400855*I, -0.47636223-0.27821827*I, -1.6697118+0.4017511*I, 6.8273163+0.0*I, -0.10051322-0.24303961*I, 1.4415971-0.29750675*I, 1.221786+0.85654986*I,
0.27894387+0.97911835*I, -0.4271084+0.042899966*I, -1.1148382-0.15693812*I, 0.80686307+1.5315914*I, -0.61608654-2.01858*I, -1.4397877-0.7550918*I, -0.10051322+0.24303961*I, 3.4057708+0.0*I, -0.5856801+1.0203559*I, 0.7103452-0.8422135*I,
3.0476584+0.18548489*I, -1.7228563+2.8335886*I, 2.4704504-1.0389464*I, 1.5649154-1.6229297*I, -2.7767487+1.7693765*I, -0.31456697-1.0403451*I, 1.4415971+0.29750675*I, -0.5856801-1.0203559*I, 7.005772+0.0*I, -0.9617417+1.2486815*I,
0.3842994+0.7050991*I, 1.8942566+0.6389735*I, 1.0858271-1.298006*I, -0.112077385+1.2014246*I, 0.048102796-0.9741874*I, -0.31978557+0.13701046*I, 1.221786-0.85654986*I, 0.7103452+0.8422135*I, -0.9617417-1.2486815*I, 3.4629636+0.0*I};
char up = 'U';
BLASINT n = 10;
BLASINT info[1];
cpotrf_(&up, &n, A1, &n, info);
printf("%g+%g*I\n", creal(A1[91]), cimag(A1[91]));
// 0.653214+0.27415*I

complex double A2[100] = {3.0607147216796875+0.0*I, -0.5905849933624268-0.29020825028419495*I, 0.321084201335907+0.45168760418891907*I, 0.8387917876243591-0.644718587398529*I, -0.3642411530017853+0.051274992525577545*I, 0.8071482181549072+0.33944568037986755*I, 0.013674172572791576+0.21422699093818665*I, 0.35476258397102356+0.42408594489097595*I, -0.5991537570953369-0.23082709312438965*I, -0.0600702166557312-0.2113417387008667*I,
-0.7954045534133911+0.7066076993942261*I, 2.807175397872925+0.0*I, -0.1691000759601593+0.313548743724823*I, -0.30911174416542053+0.7447023987770081*I, -0.22347848117351532+0.03316075727343559*I, -0.4088296890258789-1.0214389562606812*I, -0.2344931811094284+0.08056317269802094*I, 0.793269693851471-0.17507623136043549*I, 0.03163455054163933+0.20559945702552795*I, 0.13581633567810059-0.2110036462545395*I,
0.9827471375465393+1.3824869394302368*I, -1.8076121807098389-0.8882446885108948*I, 2.3277781009674072+0.0*I, 0.830405056476593-0.19296252727508545*I, 0.1394239068031311-0.5260677933692932*I, 1.239942193031311-0.09915469586849213*I, 0.06731037050485611-0.059320636093616486*I, 0.11507681757211685-0.1984301060438156*I, -0.6843825578689575+0.4647614359855652*I, 1.213119387626648-0.7757048010826111*I,
2.619997978210449+1.8532984256744385*I, 0.4780699610710144+0.48494184017181396*I, -0.18385779857635498+0.6468567848205566*I, 2.0811400413513184+0.0*I, -0.035075582563877106+0.09732913225889206*I, 0.27337002754211426-0.9032229781150818*I, -0.8374675512313843+0.0479498989880085*I, 0.6916252374649048+0.45711082220077515*I, 0.1883818507194519+0.06482727080583572*I, -0.32384994626045227+0.05857187137007713*I,
-1.8306152820587158-1.2336910963058472*I, 0.5096428990364075-0.5395973920822144*I, -1.833838701248169+0.7064958810806274*I, -1.956626057624817+0.22825956344604492*I, 1.706615924835205+0.0*I, -0.2895336151123047+0.17579378187656403*I, -0.923172116279602-0.4530014097690582*I, 0.5040621757507324-0.37026339769363403*I, -0.2824432849884033-1.0374568700790405*I, 0.1399831622838974+0.4977008104324341*I,
0.32275113463401794+0.015575028955936432*I, -0.7285097241401672-0.10360407829284668*I, 0.041852742433547974-0.655687689781189*I, 0.07081800699234009-0.318013072013855*I, -0.25947219133377075+0.4878614842891693*I, 1.5735365152359009+0.0*I, -0.2647853195667267-0.26654252409935*I, -0.6190430521965027-0.24699924886226654*I, -0.6288471221923828+0.48154571652412415*I, 0.02446540631353855-0.2611822783946991*I,
2.1968812942504883+1.0640623569488525*I, -1.1760060787200928-2.714695692062378*I, 2.5673024654388428+1.9732997417449951*I, 0.3698374927043915-0.54008549451828*I, -0.4763622283935547-0.27821826934814453*I, -1.6697118282318115+0.4017511010169983*I, 1.2674795389175415+0.0*I, 0.3079095482826233-0.07258892804384232*I, -0.5929520130157471-0.038360968232154846*I, 0.04388086497783661-0.025549031794071198*I,
0.27894386649131775+0.9791183471679688*I, -0.42710840702056885+0.0428999662399292*I, -1.1148382425308228-0.1569381207227707*I, 0.8068630695343018+1.5315914154052734*I, -0.6160865426063538-2.0185799598693848*I, -1.439787745475769-0.7550917863845825*I, -0.10051321983337402+0.24303960800170898*I, 0.9066106081008911+0.0*I, 0.05315789580345154-0.06136537343263626*I, -0.21304509043693542+0.6494344472885132*I,
3.0476584434509277+0.1854848861694336*I, -1.7228562831878662+2.8335886001586914*I, 2.4704504013061523-1.0389463901519775*I, 1.564915418624878-1.6229296922683716*I, -2.7767486572265625+1.769376516342163*I, -0.314566969871521-1.0403450727462769*I, 1.4415971040725708+0.29750674962997437*I, -0.5856801271438599-1.0203559398651123*I, 0.5668219923973083+0.0*I, 0.033351436257362366-0.07832501083612442*I,
0.3842993974685669+0.7050991058349609*I, 1.894256591796875+0.6389734745025635*I, 1.085827112197876-1.2980060577392578*I, -0.11207738518714905+1.2014245986938477*I, 0.04810279607772827-0.9741873741149902*I, -0.31978556513786316+0.13701045513153076*I, 1.2217860221862793-0.856549859046936*I, 0.7103452086448669+0.84221351146698*I, -0.9617416858673096-1.2486815452575684*I, 0.0756804421544075+0.0*I};
complex double B[20] = {-0.21782716937787788-0.9222220085490986*I, -0.7620356655676837+0.15533508334193666*I, -0.905011814118756+0.2847570854574069*I, -0.3451346708401685+1.076948486041297*I, 0.25336108035924787+0.975317836492159*I, 0.11192755545114-0.1603741874112385*I, -0.20604111555491242+0.10570814584017311*I, -1.0568488936791578-0.06025820467086475*I, -0.6650468984506477-0.5000967284800251*I, -1.0509472322215125+0.5022165705328413*I,
-0.727775859267237+0.50638268521728*I, 0.39947219167701153-0.4576746001199889*I, -0.7122162951294634-0.630289556702497*I, 0.9870834574024372-0.2825689605519449*I, 0.0628393808469436-0.1253397353973715*I, 0.8439562576196216+1.0850814110398734*I, 0.562377322638969-0.2578030745663871*I, 0.12696236014017806-0.09853584666755086*I, -0.023682508769195098+0.18093440285319276*I, -0.7264975746431271+0.31670415674097235*I};
char lo = 'L';
BLASINT nrhs = 2;
zpotrs_(&lo, &n, &nrhs, A2, &n, B, &n, info);

// note that this is exactly equal to A1
complex float A3[100] = {5.8525753+0.0*I, -0.79540455-0.7066077*I, 0.98274714-1.3824869*I, 2.619998-1.8532984*I, -1.8306153+1.2336911*I, 0.32275113-0.015575029*I, 2.1968813-1.0640624*I, 0.27894387-0.97911835*I, 3.0476584-0.18548489*I, 0.3842994-0.7050991*I,
-0.79540455+0.7066077*I, 8.313246+0.0*I, -1.8076122+0.8882447*I, 0.47806996-0.48494184*I, 0.5096429+0.5395974*I, -0.7285097+0.10360408*I, -1.1760061+2.7146957*I, -0.4271084-0.042899966*I, -1.7228563-2.8335886*I, 1.8942566-0.6389735*I,
0.98274714+1.3824869*I, -1.8076122-0.8882447*I, 9.367975+0.0*I, -0.1838578-0.6468568*I, -1.8338387-0.7064959*I, 0.041852742+0.6556877*I, 2.5673025-1.9732997*I, -1.1148382+0.15693812*I, 2.4704504+1.0389464*I, 1.0858271+1.298006*I,
2.619998+1.8532984*I, 0.47806996+0.48494184*I, -0.1838578+0.6468568*I, 3.1117508+0.0*I, -1.956626-0.22825956*I, 0.07081801+0.31801307*I, 0.3698375+0.5400855*I, 0.80686307-1.5315914*I, 1.5649154+1.6229297*I, -0.112077385-1.2014246*I,
-1.8306153-1.2336911*I, 0.5096429-0.5395974*I, -1.8338387+0.7064959*I, -1.956626+0.22825956*I, 3.6439795+0.0*I, -0.2594722-0.48786148*I, -0.47636223+0.27821827*I, -0.61608654+2.01858*I, -2.7767487-1.7693765*I, 0.048102796+0.9741874*I,
0.32275113+0.015575029*I, -0.7285097-0.10360408*I, 0.041852742-0.6556877*I, 0.07081801-0.31801307*I, -0.2594722+0.48786148*I, 3.624376+0.0*I, -1.6697118-0.4017511*I, -1.4397877+0.7550918*I, -0.31456697+1.0403451*I, -0.31978557-0.13701046*I,
2.1968813+1.0640624*I, -1.1760061-2.7146957*I, 2.5673025+1.9732997*I, 0.3698375-0.5400855*I, -0.47636223-0.27821827*I, -1.6697118+0.4017511*I, 6.8273163+0.0*I, -0.10051322-0.24303961*I, 1.4415971-0.29750675*I, 1.221786+0.85654986*I,
0.27894387+0.97911835*I, -0.4271084+0.042899966*I, -1.1148382-0.15693812*I, 0.80686307+1.5315914*I, -0.61608654-2.01858*I, -1.4397877-0.7550918*I, -0.10051322+0.24303961*I, 3.4057708+0.0*I, -0.5856801+1.0203559*I, 0.7103452-0.8422135*I,
3.0476584+0.18548489*I, -1.7228563+2.8335886*I, 2.4704504-1.0389464*I, 1.5649154-1.6229297*I, -2.7767487+1.7693765*I, -0.31456697-1.0403451*I, 1.4415971+0.29750675*I, -0.5856801-1.0203559*I, 7.005772+0.0*I, -0.9617417+1.2486815*I,
0.3842994+0.7050991*I, 1.8942566+0.6389735*I, 1.0858271-1.298006*I, -0.112077385+1.2014246*I, 0.048102796-0.9741874*I, -0.31978557+0.13701046*I, 1.221786-0.85654986*I, 0.7103452+0.8422135*I, -0.9617417-1.2486815*I, 3.4629636+0.0*I};
cpotrf_(&up, &n, A3, &n, info);
printf("%g+%g*I\n", creal(A3[91]), cimag(A3[91]));
// -nan+-nan*I
}

Found in, and reduced from, JuliaLang/LinearAlgebra.jl#260

@martin-frbg
Copy link
Collaborator

Can confirm for single- and multithreaded builds on Nehalem as well - using separate variables for the n and info in the zpotrs call (in case one of them were inadvertently IN+OUT) did not help. valgrind does not complain about uninitialized arrays in calls for either function. Have not tried with completely stock lapack in place of the optimized potr functions.

@tkelman
Copy link
Contributor Author

tkelman commented Nov 23, 2015

In JuliaLang/LinearAlgebra.jl#260 it was reported that this happens on every core type except sandy bridge, haswell, and the recent AMD families.

Comparing Cygwin 64's build (dynamic arch, but only blas from openblas, lapack uses netlib) vs Julia's mingw-w64 build (dynamic arch including openblas' lapack - also ILP64 and renamed symbols, but that doesn't appear to matter here):

$ gcc 13243.c -o 13243 -llapack

Tony@TK-samsung ~/julia
$ ./13243
0.653214+0.27415*I
0.653214+0.27415*I

Tony@TK-samsung ~/julia
$ OPENBLAS_CORETYPE=Nehalem ./13243
0.653214+0.27415*I
0.653214+0.27415*I

Tony@TK-samsung ~/julia
$ x86_64-w64-mingw32-gcc 13243.c -o usr/bin/13243 -Lusr/bin -lopenblas64_

Tony@TK-samsung ~/julia
$ usr/bin/13243
0.653214+0.27415*I
0.653214+0.27415*I

Tony@TK-samsung ~/julia
$ OPENBLAS_CORETYPE=Nehalem usr/bin/13243
0.653214+0.27415*I
-1.#QNAN+-1.#QNAN*I

@martin-frbg
Copy link
Collaborator

Dumping the first few members of the sa array in potrf_U_single.c right before the call to POTF2_U
shows them to be markedly different on the second invocation of cpotrf_ - at least sa[0] thru sa[11]
were zero on the first invocation, but contain nonzero values on the invocation after zpotrs. Not sure if this is significant, as sa and sb are probably just work arrays, put perhaps the problem actually is
that blas_memory_alloc does not zero the memory ?

Adding "feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);" to the test case, I get the first invalid operation trap here:
cscal_k () at ../kernel/x86_64/zscal_sse.S:77
77 comiss %xmm0, %xmm15
(gdb) up
JuliaLang/julia#1 0x0000000000404e1f in cpotf2_U (args=0x7fffffffcbe0, range_m=0x0, range_n=0x0, sa=0x7ffff5827020, sb=0x7ffff5923020, myid=0) at zpotf2_U.c:91
91 SCAL_K(i, 0, 0, ONE / ajj[0], ZERO,
(gdb) up
JuliaLang/julia#2 0x0000000000403f92 in cpotrf_U_single (args=0x7fffffffcbe0, range_m=0x0, range_n=0x0, sa=0x7ffff5827020, sb=0x7ffff5923020, myid=0) at potrf_U_single.c:115
115 info = POTF2_U(args, NULL, range_n, sa, sb, 0);
(gdb) up
JuliaLang/julia#3 0x000000000040316e in cpotrf_ (UPLO=0x7fffffffd76f "UU\324\032@", N=0x7fffffffd768, a=0x7fffffffccb0, ldA=0x7fffffffd768, Info=0x7fffffffd760) at lapack/zpotrf.c:120
120 *Info = (potrf_single[uplo])(&args, NULL, NULL, sa, sb, 0);
(gdb) up
JuliaLang/julia#4 0x0000000000402f6b in main () at zpotrs.c:71
71 cpotrf_(&up, &n, A3, &n, info);

Sorry if this is all just a red herring.

@martin-frbg
Copy link
Collaborator

Barking up a few more trees I now see that the call to cdotc_k() at line 77 of zpotf2_U.c suddenly returns a NaN on the second invocation of cpotrf_ when loop variable j==2 (up to this point, the ajj[0] produced by this function matched the values seen during the first invocation of cpotrf_, and array "a" appears to be undamaged on entry). This seems to implicate zdot_sse.S, that was last changed to fix spurious NaNs in relation to JuliaLang/julia#189. (Un)fortunately my ignorance of x86 assembly language prevents me from venturing further.

@martin-frbg
Copy link
Collaborator

Sorry again, zdot is just another victim, the culprit now seems to be GEMV_U (i.e. cgemv_t.S) that starts introducing NaNs into a[22],a[23],a[42],a[43],a[62],a[63],a[82],a[83] during the j==1 iteration of the aforementioned loop in zpotf2_U.c (all other values in this instance of the a matrix, and all values in "a" before and after invocations of GEMV_U up to that point being identical to those occuring during the first cpotrf_() call).
The ...gemv_t.S saw increases in STACKSIZE and several internal changes in commit 23965f1 "fixed overflow internal buffer bug" , perhaps related although that change dates from May 2013 ?

@martin-frbg
Copy link
Collaborator

BTW matzeri's fix for JuliaLang/julia#697 has no effect on this problem, although it also touches gemv.

@martin-frbg
Copy link
Collaborator

Replacing kernel/x86_64/cgemv_t.S with the pre-23965f1 version did not fix anything either, while forcing the use of the generic C implementation (setting CGEMVTKERNEL = ../arm/zgemv_t.c in KERNEL.NEHALEM as seen in KERNEL.generic) did lead to correct results.
Playing with the STACKSIZE and GEMV_UNROLL parameters in cgemv_t.S did nothing, and I am unable to debug the actual assembly code. The NaNs appearing at elements separated by a stride of 20 do appear to hint at some kind of address miscalculation that ends up fetching spurious data from the stack ?

@jeromerobert
Copy link
Contributor

I do reproduce this bug with TARGET=NEHALEM but not with TARGET=SANDYBRIDGE yet the share the same cgemv_t.S.

@martin-frbg
Copy link
Collaborator

Perhaps it depends on what gets left in the buffer - Nehalem seems to have optimized trsm kernels (that appear to be referenced by the potrf functions) while Sandybridge and Haswell use the generic c implementations. Not sure when I get around to testing this hypothesis (and it would not explain why zeroing the buffer on entry to cgem as in JuliaLang/julia#746 does not help)

@martin-frbg
Copy link
Collaborator

Turns out to have nothing to do with trsm kernel variants - and I can actually take the entire KERNEL.SANDYBRIDGE setup file and substitute it for KERNEL.NEHALEM (most of the definitions therein being obviously unrelated to potrf/gemv), and the test case still fails in the same way unless I also make it use the generic CGEMVTKERNEL . I hope that some of my ramblings prove to be useful in debugging.

@xianyi
Copy link
Collaborator

xianyi commented Mar 1, 2016

@martin-frbg , I plan to set CGEMVTKERNEL to C kernel by default. Then, we improve the performance for some CPU architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants