-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate decomp_suite failures with dynpicard option #39
Comments
Cases that are missing data
these in fact are cases were the model segfaulted/crashed, so it is the data for the case itself that is missing, not the one that we are comparing against. EDIT 2022/05: reported in CICE-Consortium#608 and subsequently fixed. daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2daycore file reveals daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2daycore file reveals same as above (need to switch thread in the core with daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2daycore file reveals same as above. Cases that fail to run
daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard, daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2daymissing an input file, see CICE-Consortium#602 (comment) daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard"ERROR: bad departure points" daley_intel_smoke_gbox180_16x1x6x6x60_debug_debugblocks_dspacecurve_dynpicard_run2dayNaNs in gridbox_corners... probably not related to VP solver...
array EDIT the above was already reported in CICE-Consortium#599 (comment) ("Problems in ice_grid.F90"), on the gbox128 grid. The other 3 cases are mentioned above in the MISS section. |
Cases that fail 'test'
daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicardfails to restart exactly daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_shortfails to restart exactly daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicardfails to restart exactly daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicardfails to restart exactly The other cases also fail to run (see above). |
Others failing tests are because they are not BFB. |
I fixed the buggy OpenMP directive: diff --git i/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 w/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
index 457a73a..367d29e 100644
--- i/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
+++ w/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
@@ -3507,7 +3507,7 @@ subroutine precondition(zetaD , &
wx = vx
wy = vy
elseif (precond_type == 'diag') then ! Jacobi preconditioner (diagonal)
- !$OMP PARALLEL DO PRIVATE(iblk)
+ !$OMP PARALLEL DO PRIVATE(iblk, ij, i, j)
do iblk = 1, nblocks
do ij = 1, icellu(iblk)
i = indxui(ij, iblk) So let's do another round ( Cases that are missing datadaley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2dayrun segfaulted.
à cause d'un out-of-range dans le calcul de la norme...:
daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2daysame as above Cases that failed to run
daley_intel_restart_gx1_64x1x16x16x10_dwghtfile_dynpicard, daley_intel_smoke_gx1_64x1x16x16x10_debug_dwghtfile_dynpicard_run2daysame as above (missing file) daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard, daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard(horizontal_remap)ERROR: bad departure points The next two are same as in first suite above. |
Note: re-running the same suite a second time leads to different results:
This really suggests that there is some non-reproducibility in the code... |
Just a quick update. I'm playing with the OpenMP in the entire code and tested evp, eap, and vp. I can also confirm that running different thread counts with vp produces different answers. If "re-running the same suite a second time leads to different results", that suggests the code is not bit-for-bit reproducible when rerun? I tried to test that and for my quick tests, the same run does seem to be reproducible. It's a little too bad because that's an easier problem to debug. I also tested a 32x1x16x16x16 and 64x1x16x16x16 case and they are not bit-for-bit. Same decomp, no OpenMP, just different block distribution. If I get a chance, I will try to look into this more. At this point, I will probably defer further OpenMP optimization with vp. I think there are several tasks to do
|
Hi Tony, thanks for these details and tests. This isssue is definitely still on my list, I hope I'll have to time to go back to the VP solver this Winter/early Spring. I'll take a look at the PR when you submit it. |
I'm finally going back to this. I've re-ran the Summary: $ ./results.csh |tail -5
203 measured results of 203 total results
157 of 203 tests PASSED
0 of 203 tests PENDING
0 of 203 tests MISSING data
46 of 203 tests FAILED cases that fail "run"$ ./results.csh |\grep FAIL|\grep ' run'
FAIL daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard run
FAIL daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day run -1 -1 -1 on banting I also get daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicardfails at the first run of the test with SIGILL or SIGSEGV (varies). Usually no core is produced (got a core once but more or less useful since this case is not compiled in debug mode, when I recompiled in debug mode I did not get a core...),
daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2dayfails with SIGSEGV, got a core on one machine but not the other.
"stack smashing detected" that I've never seen. Here is the backtrace:
cases that fail "test"$ ./results.csh |\grep FAIL|\grep ' test'
FAIL daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard test
FAIL daley_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_run2day test which are the same as above (if "run" fails, "test" fails) cases that fail "bfbcomp"first the $ ./results.csh |\grep FAIL | \grep daley_intel_decomp
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_slenderX1 bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_roundrobin bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_sectcart bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_sectrobin bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_spacecurve bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop
FAIL daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_rakeX1 bfbcomp daley_intel_decomp_gx3_4x2x25x29x5_dynpicard_squarepop and then the rest: $ ./results.csh |\grep FAIL | \grep different-data
FAIL daley_intel_restart_gx3_1x1x50x58x4_droundrobin_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_4x1x25x116x1_dslenderX1_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_6x2x4x29x18_dspacecurve_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_5x2x33x23x4_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_4x2x19x19x10_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_20x2x5x4x30_dsectrobin_dynpicard_short bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x5x10x20_drakeX2_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x2x8x10x20_droundrobin_dynpicard_maskhalo bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x4x25x29x16_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x8x30x20x32_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_1x1x120x125x1_droundrobin_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x1x1x800_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x2x2x200_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x3x3x100_droundrobin_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_16x2x8x8x80_dspiralcenter_dynpicard bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_10x1x10x29x4_dsquarepop_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_restart_gx3_8x1x25x29x4_drakeX2_dynpicard_thread bfbcomp daley_intel_restart_gx3_4x2x25x29x4_dslenderX2_dynpicard different-data
FAIL daley_intel_smoke_gx3_1x1x25x58x8_debug_droundrobin_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_20x1x5x116x1_debug_dslenderX1_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_6x2x4x29x18_debug_dspacecurve_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x2x10x12x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_5x2x33x23x4_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_4x2x19x19x10_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_run2day_short bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x5x10x20_debug_drakeX2_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x2x8x10x20_debug_droundrobin_dynpicard_maskhalo_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x6x25x29x16_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x8x30x20x32_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_1x1x120x125x1_debug_droundrobin_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x1x1x800_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x2x2x200_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x3x3x100_debug_droundrobin_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_16x2x8x8x80_debug_dspiralcenter_dynpicard_run2day bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data
FAIL daley_intel_smoke_gx3_8x1x25x29x4_debug_drakeX2_dynpicard_run2day_thread bfbcomp daley_intel_smoke_gx3_4x2x25x29x4_debug_dslenderX2_dynpicard_run2day different-data the good news
|
daley_intel_restart_gx3_6x2x50x58x1_droundrobin_dynpicard
Running with the memory debugging library, however, hides the segfault and the code runs correctly ... On XC-50, all executables use static linking, so it's not possible for DDT to preload its memory debugging library; you have to relink your executable with DDT's memory debugging library.
Note: Adding
This is not easy to understand as the wording is weird. A few notes:
In practice, |
OK, I tested the failing tests above without Uncommenting the variable in the machine file (comment added in 8c23df8 (- Update version and copyright. (CICE-Consortium#691), 2022-02-23) makes both test pass. |
OK so with the little code modifications mentioned in #40 (comment), which I will push tomorrow, the precond = diag # or ident, not yet tested
bfbflag = reprosum # maybe works with dpddd and lsum16, not yet tested. This is very encouraging as it shows not only that the OpenMP implementation is OK, but also that we did not "miss" anything MPI-related (like halo updates, etc) in the VP implementation. EDIT forgot the end note: [1] I do have one failure, |
OK, unsurprisingly it also passes with |
However, I have 3 failures with
|
ppp6_intel_smoke_gx3_6x2x50x58x1_debug_droundrobin_dynpicard_reprosum_run2day
No core. Running in DDT reveals the failure is here, line 963: CICE/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90 Lines 943 to 964 in bce31c2
EDIT reading the code, it seems impossible for EDIT2
CICE/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90 Lines 659 to 680 in bce31c2
just before calling CICE/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90 Lines 667 to 669 in bce31c2
overflows... I'm not sure though if it's normal for It ultimately comes from here: CICE/cicecore/cicedynB/infrastructure/comm/mpi/ice_reprosum.F90 Lines 589 to 594 in bce31c2
OK, so it is Note that I can't print |
ppp6_intel_smoke_gx3_20x2x5x4x30_debug_dsectrobin_dynpicard_reprosum_run2day_shortsimilaire à ci-dessus:
Core is truncated. |
ppp6_intel_smoke_gx3_10x1x10x29x4_debug_dsquarepop_dynpicard_reprosum_run2day_thread
Only a single core (!), it is fortunately usable:
This is when computing the norm of the residual vector (Fx,Fy) just after it's been computed, so it's a bit mysterious... (well not too much since the global sum is over all points whereas before we were summing only ice points (using EDIT what is mysterious is that I should get this failure also with |
So I reran with EDIT I did rebase though...
EDIT2 I see the same failure when running with |
OK so in the end all 3 failures are at the same place, where we compute the norm of (Fx,Fy). These are the only variables given to It's still weird that I would get the failure only in certain decompositions though... EDIT not that weird, since it was using uninitialized values, so anything can happen... |
If I initialize (Fx,Fy) to 0, it fixes those errors, but then I ran the suite again from scratch and got some new failures (MPI aborts, bad departure points, etc.) EDIT here are the failures (suite:
|
I get the same results (bgen then bcmp from same commit on I then ran the same suite with a (self-compiled) OpenMPI instead of Intel MPI and it seems that I do not get any of these errors (still on |
OK so I dug a bit into this and found this Intel MPI variable : Apparently OpenMPI does that out-of-the-box, and it seems Cray MPT also at least under the circumstances under which we were running on daley/banting (exclusive nodes). I did find some references to I ran 2 |
Thanks @phil-blain, that's some rough debugging. Yuck. Do we understand why the dynpicard is particularly susceptible? Why don't we see this with some other configurations? |
Because dynpicard uses global sums ( I'll walk my steps backwards from here, I think I got to the bottom of it now. |
OK. I hope MPI Reductions are bit-for-bit for the same pe count / decomposition. You are finding that to be true, correct? Just to clarify, are you just seeing different results with different pe counts/decompositions? Is the global reduction in dynpicard using the internal CICE global sum method yet? |
Not for Intel MPI, no, unless I set this
Yes, with the code in
Not with the code on |
Interesting and surprising! What machine is that? In my experience, this is a requirement of MPI in most installations and I've never seen non-reproducibility for POP-based runs, and I check it a lot (in CESM/RASM/etc). POP has a lot of global sums, so it's a good test. I assume this is just a setting on this one particular machine?
That's what I'd expect. I think the bfbcomp testing has benefited from the fact that there were no global sums (or similar) in CICE up to now.
Let me know if I can help. I think it's perfectly fine to do some "bfbcomp" testing with slower global sums for the dynpicard in particular, but to use the fastest global sums in production and other testing. The separate issue is whether the CICE global sum implementation is slower than it should be. Thanks. |
It's one of our new Lenovo clusters (see https://www.hpcwire.com/off-the-wire/canadian-weather-forecasts-to-run-on-nvidia-powered-system/). I was also surprised, but if you follow the links to stackoverflow/stackexchange which I posted above, it is clearly indicated in the MPI standard that it is only a recommendation that repeated runs yield the same results for collective reductions. Apparently OpenMPI follows that recommendation, but Intel MPI has to be convinced with that variable. It's an environment variable for Intel MPI, so no it's not specific to that machine. WIth Intel MPI, the non reproducibility is (as far as I understand) linked to the pinning of MPI processes to specific CPUs. So if from run to run the ranks are pinnned to different CPUs, then the reductions might give different results because the reduction algorithm take advantage of the processor topology. If you always run on machines with exclusive node access, then it's possible that the pinning is always the same so you do not notice the difference. That was the case on our previous Crays.
Indeed.
Yes, that's my plan. But I noticed that even with |
OK, retracing my steps back. I ran 2
|
Next step: back to my new code. I ran a
This is a bit unfortunate, especially the restart failures. To me this hints to a bug in the code. |
I next ran the same thing, but adding
|
and I next re-ran a debug suite, with
nothing unexpected here; all bfbcomp tests and all restart tests passed this time. |
I reran a second identical suite, baseline comparing with the previous one [suite:
differences start at the second time step, and they do not start at the last decimal at all: diff --git 1/home/phb001/data/ppp6/cice/baselines//decomp-vp-repro-debug-cbwr-dynpicard/ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day/cice.runlog.220805-174305 2/home/phb001/data/ppp6/cice/runs//ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day.220808-115550/cice.runlog.220808-155837
index 5924f69..d515b0d 100644
--- 1/home/phb001/data/ppp6/cice/baselines//decomp-vp-repro-debug-cbwr-dynpicard/ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day/cice.runlog.220805-174305
+++ 2/home/phb001/data/ppp6/cice/runs//ppp6_intel_smoke_gx3_1x6x25x29x16_debug_diag1_droundrobin_dynpicard_reprosum_run2day.220808-115550/cice.runlog.220808-155837
@@ -922,47 +922,47 @@ heat used (W/m^2) = 2.70247926206599542 21.66078047047012589
istep1: 2 idate: 20050101 sec: 7200
(JRA55_data) reading forcing file 1st ts = /space/hall6/sitestore/eccc/cmd/e/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
Arctic Antarctic
-total ice area (km^2) = 1.55991254493588358E+07 1.56018697755621299E+07
+total ice area (km^2) = 1.55989861815027408E+07 1.56018575740428381E+07
total ice extent(km^2) = 1.57251572666864432E+07 1.93395172319125347E+07
-total ice volume (m^3) = 1.48535756598763418E+13 2.40341818246218164E+13
-total snw volume (m^3) = 1.96453741997257983E+12 5.12234084165053809E+12
-tot kinetic energy (J) = 1.02831514062509187E+14 2.19297132090383406E+14
-rms ice speed (m/s) = 0.12005519150472969 0.13595187216180987
-average albedo = 0.96921950449670136 0.80142868450106208
-max ice volume (m) = 3.77905590440176198 2.86245209411921220
-max ice speed (m/s) = 0.49255344388362082 0.34786466500096180
+total ice volume (m^3) = 1.48535756598763867E+13 2.40341818246218164E+13
+total snw volume (m^3) = 1.96452855810116943E+12 5.12233980000631152E+12
+tot kinetic energy (J) = 1.10879416990511859E+14 2.30027827620289031E+14
+rms ice speed (m/s) = 0.12466465545409244 0.13923836296261000
+average albedo = 0.96921968750838638 0.80142904447165109
+max ice volume (m) = 3.77907249513972854 2.86247619373129503
+max ice speed (m/s) = 0.48479403870651594 0.35054796852363712
max strength (kN/m) = 129.27453302836647708 58.25651456094256275
----------------------------
arwt rain h2o kg in dt = 1.45524672839061462E+11 5.77214180149894043E+11 This is really hard for me to understand, I would expect any numerical error to accumulate slowly and start in the last decimals.. |
The above was mistakenly without
|
I then ran 2 suites with first suite:
Second suite:
|
I then took a step back and ran the I took the time to fix two bugs:
My initial fix for the first bug (ef5858e) was not sufficient as I still had two failures:
This lead me to complete the bugfix in 52fd683: First suite (note: compiled at ef5858e, only
Second suite:
|
Then I ran again 2 All passed (bfbcomp, restart, compares). |
Next, I ran the
Both run failures are "bad departure points" I recompiled the first one with EDIT PR for that bugfix is here: CICE-Consortium#758 |
With that bug fixed (a4cf10e) I ran a second suite (bcmp) [suite:
|
I checked I checked
it seems it is these 4 tests:
A few remarks:
|
So I ran the
|
OK. let's get to the bottom of the "bad departure points" error. I cooked myself up a stress test suite by creating one I used the
OK so this points to some weird OpenMP stuff in the new code. |
So I scrutinized my commits, and found the error: 693fd29 diff --git a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90 b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
index d90a2a8..87c87ec 100644
--- a/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
+++ b/cicecore/cicedynB/dynamics/ice_dyn_vp.F90
@@ -878,6 +878,8 @@ subroutine anderson_solver (icellt , icellu , &
vrel (:,:,iblk))
! Compute nonlinear residual norm (PDE residual)
+ Fx = c0
+ Fy = c0
call matvec (nx_block , ny_block , &
icellu (iblk) , icellt (iblk), &
indxui (:,iblk) , indxuj (:,iblk), & the problem is that we are inside an OpenMP loop here, but we initialize the whole call residual_vec (nx_block , ny_block , &
icellu (iblk), &
indxui (:,iblk), indxuj (:,iblk), &
bx (:,:,iblk), by (:,:,iblk), &
Au (:,:,iblk), Av (:,:,iblk), &
Fx (:,:,iblk), Fy (:,:,iblk))
enddo
!$OMP END PARALLEL DO
nlres_norm = sqrt(global_sum_prod(Fx(:,:,:), Fx(:,:,:), distrb_info, field_loc_NEcorner) + &
global_sum_prod(Fy(:,:,:), Fy(:,:,:), distrb_info, field_loc_NEcorner))
if (my_task == master_task .and. monitor_nonlin) then
write(nu_diag, '(a,i4,a,d26.16)') "monitor_nonlin: iter_nonlin= ", it_nl, &
" nonlin_res_L2norm= ", nlres_norm
endif was identically zero. I checked that by running my stress test suite with ! Compute relative tolerance at first iteration
if (it_nl == 0) then
tol_nl = reltol_nonlin*nlres_norm
endif
! Check for nonlinear convergence
if (nlres_norm < tol_nl) then
exit In the failing runs, the abort where after the solver exited after only 1 nonlinear iteration, so I guess the solution was not "solved" enough and that lead to the "bad departure points" error. |
Fixed in be571c5 |
With this fix, the So it seems I got to the bottom of everything. EDIT OK, new suites |
Excellent @phil-blain, looks like this was a real challenging bug to sort out! |
Thanks! yeah OpenMP is tricky! It definitely did not help that the failures would disappear when compiling in debug mode! |
A little recap with new suites (these are all with
|
And here are similar tests with EVP, diff --git a/./configuration/scripts/tests/baseline.script b/./configuration/scripts/tests/baseline.script
index bb8f50a..82a770b 100644
--- a/./configuration/scripts/tests/baseline.script
+++ b/./configuration/scripts/tests/baseline.script
@@ -65,7 +65,7 @@ if (${ICE_BASECOM} != ${ICE_SPVAL}) then
${ICE_CASEDIR}/casescripts/comparebfb.csh ${base_dir} ${test_dir}
set bfbstatus = $status
- if ( ${bfbstatus} != 0 ) then
+ #if ( ${bfbstatus} != 0 ) then
set test_file = `ls -1t ${ICE_RUNDIR}/cice.runlog* | head -1`
set base_file = `ls -1t ${ICE_BASELINE}/${ICE_BASECOM}/${ICE_TESTNAME}/cice.runlog* | head -1`
@@ -97,7 +97,7 @@ if (${ICE_BASECOM} != ${ICE_SPVAL}) then
endif
endif
- endif
+ #endif
endif
So all in all the same behaviour as the Picard solver with respect to global sums. |
I ran an EVP decomp suite pair with The 3 |
I reran the decomp_suite with VP, and |
Running the
decomp_suite
with the VP dynamics results in some segfaults (due to NaN initialisation), some errors ("bad departure points") and some non BFB restarts, see CICE-Consortium#518.I'll use this issue to document my findings in investigating those.
The text was updated successfully, but these errors were encountered: