Constant size matrix errors for estimator properties #2413

jtkrogel · 2020-04-29T13:00:08Z

I am getting fatal aborts when attempting to run with current develop. The inputs work for a build from April 2019 and fail with develop.

The abort is right at the start of VMC, related to property (observable) handling:

=========================================================
  Start VMC
  File Root qmc.g000.s000 append = no 
=========================================================
Resetting walkers
  Adding 9 walkers to 0 existing sets
  Total number of walkers: 3.6000000000e+01
  Total weight: 3.6000000000e+01
  Resetting Properties of the walkers 1 x 8114
ERROR You cannot resize a constant size matrix beyond its initial max dimensions

This is identical to aborts present in the nightly tests:
https://cdash.qmcpack.org/CDash/testDetails.php?test=8024789&build=113981

The code change that likely introduced the current abort scenario was introduced in #2213. The nightly test goes from steady pass to steady failure around the merge of #2213:

https://cdash.qmcpack.org/CDash/testDetails.php?test=8026616&build=114000

Aborts of this type are similar to those seen in #2238.

The abort likely triggers every time an observable adds properties to scalar.dat. Currently both the skall and force estimators do this.

Given that standard estimators are affected, I think the in process release (#2412) should be gated until this issue is fixed.

The text was updated successfully, but these errors were encountered:

PDoakORNL · 2020-04-29T13:06:30Z

Since the default build includes this

SET(WALKER_MAX_PROPERTIES 256 CACHE STRING "Maximum number of properties tracked by walkers")

This is working as it should. You can't have 8114 properties per walker unless

cmake ... -DWALKER_MAX_PROPERTIES=8114

or larger.

jtkrogel · 2020-04-29T13:08:22Z

This is not working as it should. From QMCPACK input, users can request the skall estimator which adds elements to the property list so the data can be written to scalar.dat (the only type of output supported for this estimator).

The number of elements added depends on the system under study, and so hard wiring it into the build request is untenable.

jtkrogel · 2020-04-29T13:11:23Z

Confirming that removal of the skall estimator from my input results in no abort.

PDoakORNL · 2020-04-29T13:16:59Z

i.e. exactly the same issue as in #2238
I think a better error message is a good temporary fix.
I will see if I can catch the exception in the estimator or base class and issue a useful message such as.
"you must build with WALKER_MAX_PROPERTIES => X to use this estimator"

jtkrogel · 2020-04-29T13:32:24Z

This would at least provide a runnable route temporarily. In general, I think having build flags depend on runtime requests (user input) undermines the robustness and usability of the code.

jtkrogel · 2020-04-29T13:35:03Z

At issue here are primarily the finite size corrections, which we recently obtained a tool for post-processing:

#2329

prckent · 2020-04-29T14:23:48Z

Since we don't want people to have to recompile the code for routine finite size corrections, this will need to be addressed somehow. Improving the error message would be a good temporary fix. We ought to be able to compute the max properties before creating any walkers with better plumbing.

PDoakORNL · 2020-04-29T14:52:07Z

The life cycle of the walker elements and mpi send/receive buffers were greatly complicated by this. It was hard to track who was messing with properties when resulting in lots of opportunity for overwrites and overreads.

I'd be in favor of setting WALKER_MAX_PROPERTIES rather high by default as it is in host memory. The only reason its so low right now is in case someone wants to run a huge number of walkers in which case you are going to need to make some tradeoffs. This is what we support now since I have the impression that supporting running huge numbers of walkers is more important.

However by my calculation at double precision:
16000 walker properties for 10000 walkers costs 1 gigabyte of ram for all the walker properties. This is a by rank cost so I don't think it would be particularly awful. I'd have to explore in more detail but it safe to guess this gets copied at least once somewhere so to be safe lets say its an order of magnitude bigger. It should still easily fit in system ram. Of course if you stuff something that scales N^2 or worse in properties...

And of course there could be a cache hit issue caused by this (now the local energies are strided more widely through system ram). But even with the dynamic Matrix that used to be here there is a possible design problem although only to be pursued if its proven to be a performance bottleneck.

I'd also mention that a input scheme that finished parsing and translating input to an intermediate representation before beginning construction of "simulation" objects would allow WALKER_MAX_PROPERTIES to be dropped while preserving the ability to set the properties data structure at construct time for the walkers.

jtkrogel · 2020-04-29T14:52:28Z

I think another good target for a longer term fix is to move output data from estimators like this one to stat.h5. This would avoid the properties buffer entirely (making a small fixed length of the properties buffer more robust), but would require significant updates to the qmcfinitesize post-processing tool.

camelto2 · 2020-04-29T15:33:44Z

@jtkrogel skall does output to stat.h5, as I have been using it exclusively with the qmcfinitesize tool with v3.8.0. the qmcfinitesize tool doesn't do twist averaging or anything yet, so all of that postprocessing needs to be by hand initially from the stat.h5 and passed as an ascii file to the tool.

I actually have never tested the scalar output for skall and used it with the qmcfinitesize tool because of twist averaging not being in the code currently.

jtkrogel · 2020-04-29T16:13:20Z

That's good. This makes it all the easier. I think all that is needed, then, is to transition forces to stat.h5 and generally wall off access to scalar.dat for more than a handful of values per observable (actual scalars).

jptowns · 2020-04-30T17:13:31Z

Just to add in, If you add hdf5="yes" to skall xml, it works just fine on develop. Crashes without it as you pointed out. Perhaps we should switch hdf5 output to default.

jtkrogel · 2020-04-30T17:33:59Z

From this POV, fixes here do not merit an additional release IMO.

I would be in favor of removing the hdf5 option from the skall input xml and have it always do hdf output.

prckent · 2020-04-30T17:36:20Z

It is a good suggestion to only support hdf5 output. Then we only have one code path and one format. Win.

PDoakORNL · 2020-05-07T01:18:15Z

I am still working on this but hit some complications in testing the fix. Should be in tomorrow.

fix for #2413

jtkrogel added the bug label Apr 29, 2020

jtkrogel added this to the v3.9.2 milestone Apr 29, 2020

PDoakORNL self-assigned this Apr 29, 2020

prckent modified the milestones: v3.9.2, v3.9.3 Apr 29, 2020

PDoakORNL mentioned this issue May 7, 2020

fix for #2413 #2435

Merged

ye-luo closed this as completed in #2435 May 7, 2020

ye-luo added a commit that referenced this issue May 7, 2020

Merge pull request #2435 from PDoakORNL/fix_2413

e64a4fb

fix for #2413

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Constant size matrix errors for estimator properties #2413

Constant size matrix errors for estimator properties #2413

jtkrogel commented Apr 29, 2020

PDoakORNL commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

PDoakORNL commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

prckent commented Apr 29, 2020

PDoakORNL commented Apr 29, 2020 •

edited

Loading

jtkrogel commented Apr 29, 2020

camelto2 commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

jptowns commented Apr 30, 2020 •

edited

Loading

jtkrogel commented Apr 30, 2020

prckent commented Apr 30, 2020

PDoakORNL commented May 7, 2020

Constant size matrix errors for estimator properties #2413

Constant size matrix errors for estimator properties #2413

Comments

jtkrogel commented Apr 29, 2020

PDoakORNL commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

PDoakORNL commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

prckent commented Apr 29, 2020

PDoakORNL commented Apr 29, 2020 • edited Loading

jtkrogel commented Apr 29, 2020

camelto2 commented Apr 29, 2020

jtkrogel commented Apr 29, 2020

jptowns commented Apr 30, 2020 • edited Loading

jtkrogel commented Apr 30, 2020

prckent commented Apr 30, 2020

PDoakORNL commented May 7, 2020

PDoakORNL commented Apr 29, 2020 •

edited

Loading

jptowns commented Apr 30, 2020 •

edited

Loading