Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant size matrix errors for estimator properties #2413

Closed
jtkrogel opened this issue Apr 29, 2020 · 15 comments · Fixed by #2435
Closed

Constant size matrix errors for estimator properties #2413

jtkrogel opened this issue Apr 29, 2020 · 15 comments · Fixed by #2435
Assignees
Labels
Milestone

Comments

@jtkrogel
Copy link
Contributor

I am getting fatal aborts when attempting to run with current develop. The inputs work for a build from April 2019 and fail with develop.

The abort is right at the start of VMC, related to property (observable) handling:

=========================================================
  Start VMC
  File Root qmc.g000.s000 append = no 
=========================================================
Resetting walkers
  Adding 9 walkers to 0 existing sets
  Total number of walkers: 3.6000000000e+01
  Total weight: 3.6000000000e+01
  Resetting Properties of the walkers 1 x 8114
ERROR You cannot resize a constant size matrix beyond its initial max dimensions

This is identical to aborts present in the nightly tests:
https://cdash.qmcpack.org/CDash/testDetails.php?test=8024789&build=113981

The code change that likely introduced the current abort scenario was introduced in #2213. The nightly test goes from steady pass to steady failure around the merge of #2213:

https://cdash.qmcpack.org/CDash/testDetails.php?test=8026616&build=114000

Aborts of this type are similar to those seen in #2238.

The abort likely triggers every time an observable adds properties to scalar.dat. Currently both the skall and force estimators do this.

Given that standard estimators are affected, I think the in process release (#2412) should be gated until this issue is fixed.

@jtkrogel jtkrogel added the bug label Apr 29, 2020
@jtkrogel jtkrogel added this to the v3.9.2 milestone Apr 29, 2020
@PDoakORNL
Copy link
Contributor

Since the default build includes this

SET(WALKER_MAX_PROPERTIES 256 CACHE STRING "Maximum number of properties tracked by walkers")

This is working as it should. You can't have 8114 properties per walker unless

cmake ... -DWALKER_MAX_PROPERTIES=8114

or larger.

@jtkrogel
Copy link
Contributor Author

This is not working as it should. From QMCPACK input, users can request the skall estimator which adds elements to the property list so the data can be written to scalar.dat (the only type of output supported for this estimator).

The number of elements added depends on the system under study, and so hard wiring it into the build request is untenable.

@PDoakORNL PDoakORNL self-assigned this Apr 29, 2020
@jtkrogel
Copy link
Contributor Author

Confirming that removal of the skall estimator from my input results in no abort.

@PDoakORNL
Copy link
Contributor

i.e. exactly the same issue as in #2238
I think a better error message is a good temporary fix.
I will see if I can catch the exception in the estimator or base class and issue a useful message such as.
"you must build with WALKER_MAX_PROPERTIES => X to use this estimator"

@jtkrogel
Copy link
Contributor Author

This would at least provide a runnable route temporarily. In general, I think having build flags depend on runtime requests (user input) undermines the robustness and usability of the code.

@jtkrogel
Copy link
Contributor Author

At issue here are primarily the finite size corrections, which we recently obtained a tool for post-processing:

#2329

@prckent
Copy link
Contributor

prckent commented Apr 29, 2020

Since we don't want people to have to recompile the code for routine finite size corrections, this will need to be addressed somehow. Improving the error message would be a good temporary fix. We ought to be able to compute the max properties before creating any walkers with better plumbing.

@PDoakORNL
Copy link
Contributor

PDoakORNL commented Apr 29, 2020

The life cycle of the walker elements and mpi send/receive buffers were greatly complicated by this. It was hard to track who was messing with properties when resulting in lots of opportunity for overwrites and overreads.

I'd be in favor of setting WALKER_MAX_PROPERTIES rather high by default as it is in host memory. The only reason its so low right now is in case someone wants to run a huge number of walkers in which case you are going to need to make some tradeoffs. This is what we support now since I have the impression that supporting running huge numbers of walkers is more important.

However by my calculation at double precision:
16000 walker properties for 10000 walkers costs 1 gigabyte of ram for all the walker properties. This is a by rank cost so I don't think it would be particularly awful. I'd have to explore in more detail but it safe to guess this gets copied at least once somewhere so to be safe lets say its an order of magnitude bigger. It should still easily fit in system ram. Of course if you stuff something that scales N^2 or worse in properties...

And of course there could be a cache hit issue caused by this (now the local energies are strided more widely through system ram). But even with the dynamic Matrix that used to be here there is a possible design problem although only to be pursued if its proven to be a performance bottleneck.

I'd also mention that a input scheme that finished parsing and translating input to an intermediate representation before beginning construction of "simulation" objects would allow WALKER_MAX_PROPERTIES to be dropped while preserving the ability to set the properties data structure at construct time for the walkers.

@jtkrogel
Copy link
Contributor Author

I think another good target for a longer term fix is to move output data from estimators like this one to stat.h5. This would avoid the properties buffer entirely (making a small fixed length of the properties buffer more robust), but would require significant updates to the qmcfinitesize post-processing tool.

@camelto2
Copy link
Contributor

@jtkrogel skall does output to stat.h5, as I have been using it exclusively with the qmcfinitesize tool with v3.8.0. the qmcfinitesize tool doesn't do twist averaging or anything yet, so all of that postprocessing needs to be by hand initially from the stat.h5 and passed as an ascii file to the tool.

I actually have never tested the scalar output for skall and used it with the qmcfinitesize tool because of twist averaging not being in the code currently.

@jtkrogel
Copy link
Contributor Author

That's good. This makes it all the easier. I think all that is needed, then, is to transition forces to stat.h5 and generally wall off access to scalar.dat for more than a handful of values per observable (actual scalars).

@prckent prckent modified the milestones: v3.9.2, v3.9.3 Apr 29, 2020
@jptowns
Copy link
Contributor

jptowns commented Apr 30, 2020

Just to add in, If you add hdf5="yes" to skall xml, it works just fine on develop. Crashes without it as you pointed out. Perhaps we should switch hdf5 output to default.

@jtkrogel
Copy link
Contributor Author

From this POV, fixes here do not merit an additional release IMO.

I would be in favor of removing the hdf5 option from the skall input xml and have it always do hdf output.

@prckent
Copy link
Contributor

prckent commented Apr 30, 2020

It is a good suggestion to only support hdf5 output. Then we only have one code path and one format. Win.

@PDoakORNL
Copy link
Contributor

I am still working on this but hit some complications in testing the fix. Should be in tomorrow.

ye-luo added a commit that referenced this issue May 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants