Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

randomize_grid does not randomize the PP grid as expected #4362

Closed
markdewing opened this issue Dec 12, 2022 · 16 comments
Closed

randomize_grid does not randomize the PP grid as expected #4362

markdewing opened this issue Dec 12, 2022 · 16 comments
Labels

Comments

@markdewing
Copy link
Contributor

markdewing commented Dec 12, 2022

I would expect the function NonLocalECPComponent::randomize_grid to randomize the grid origin across the entire sphere.
But it does not - it's missing some coverage around the "poles". See the attached graph.

The root cause is the initialization of cth from a random number. It is currently cth = myRNG() - 0.5. I think it should be cth = 1.0 - 2*myRNG().

I'm not sure this issue has a practical impact, since the rotation is only meant to change the origin of the spherical integration grid and not cover the sphere by itself. If the grid is sufficiently dense, the other points will cover the missing regions from the origin. (But I didn't actually check this)

There is another minor annoyance in that when initializing the current code with all zeros for the rng values, the result is not the identity rotation. With the corrected initialization, using zeros does result in the identity rotation.

I tried to find the origin of the formula in randomize_grid. Nothing online seemed to match, and I couldn't recreate it through multiplying 2x2 rotation matrices (though that search was not exhaustive).

Plot of random rotations applied an input vector (0,0, 1)
random_rotation

The script used to generate the plot

check_grid_randomize.tar.gz

@prckent
Copy link
Contributor

prckent commented Dec 12, 2022

Thanks Mark. Indeed this seems like a bug, since we shouldn't rely on grid density or symmetry to get the correct result. It should be correct even if we use one point, which it will not be based on your plot.

The functions in question are at https://github.com/QMCPACK/qmcpack/blob/develop/src/QMCHamiltonians/NonLocalECPComponent.cpp#L868 (we have two variants)

It is also interesting that we use 3 random numbers for this.

@prckent prckent added the bug label Dec 12, 2022
@jtkrogel
Copy link
Contributor

This is how the code looked when randomize_grid was introduced in Mar. 2005 (commit: a9b5aad, file NonLocalPPotential.cpp):

  void NonLocalPPotential::RadialPotentialSet::randomize_grid(){
    const double twopi=6.28318531;
    double phi,psi,sph,cph,sth,cth,sps,cps;
    Tensor<double,3> rmat;
    /* Rotation matrix with Euler angles defined as: 
       1) counterclockwise rotation around z (phi)
       2) clockwise rotation around y (theta)
       3) counterclockwise rotation around z (psi) */
    phi=twopi*Random(); sph=sin(phi); cph=cos(phi);
    psi=twopi*Random(); sps=sin(psi); cps=cos(psi);
    cth=2*(Random()-0.5); sth=sqrt(1-cth*cth);
    rmat(0,0)= cph*cth*cps-sph*sps;
    rmat(0,1)= sph*cth*cps+cph*sps;
    rmat(0,2)=-sth*cps;
    rmat(1,0)=-cph*cth*sps-sph*cps;
    rmat(1,1)=-sph*cth*sps+cph*cps;
    rmat(1,2)= sth*sps;
    rmat(2,0)= cph*sth;
    rmat(2,1)= sph*sth;
    rmat(2,2)= cth;
    for (vector<PosType>::iterator it=sgridxyz_m.begin(); 
	it != sgridxyz_m.end(); ++it) *it=dot(rmat,*it);
  }

Notice 1) more comments, 2) the additional factor of 2 in cth that is now missing.

@jtkrogel
Copy link
Contributor

The factor of two was removed when randomize_grid was moved to the header file a week later in commit 26eaba3

@jtkrogel
Copy link
Contributor

The overall method may be closely related to details contained in Murnahan (https://doi.org/10.1073%2Fpnas.36.11.670) and Miles (https://doi.org/10.2307/2333716).

@jtkrogel
Copy link
Contributor

Is the density uniform if you simply propagate a single vector many times with the uncorrected code?

@markdewing
Copy link
Contributor Author

The plot shown (and the script) applies many random rotations to a single vector (unit vector along the z-axis). Do you mean some other scheme - like applying random rotations to random vectors? I didn't measure the resulting distribution in any more detail than just visual inspection.

@prckent
Copy link
Contributor

prckent commented Dec 12, 2022

It is quite likely that this is a very small bias in practical calculations since we have large default integration grids and the points used in the integration grids (which are referenced to electron positions) will vary considerably unlike the fixed (0,0,1) in this test. However we should fix it since there are cases when there could be a bias, including for coarse grids. That large runs are OK can be tested since we will see if the statistical tests change meaningfully. Tests have also been done vs Hartree-Fock orbitals in several cases, and the problem did not show up.

The fix unfortunately likely requires updating a great many deterministic tests. Mark, can you please check the situation with your fix? Then we will know how much work is entailed. We also clearly need a unit test that calls this function 10^6 times and verifies that we can evaluate a simple numerical integral correctly. e.g. We would presumably fail to integrate the surface area of a sphere with the current code and single point grids.

@markdewing
Copy link
Contributor Author

Using a real, full precision build with the corrected grid randomization:

  • all the unit tests pass (not surprising, since the grid randomization function uses the rng directly, making it hard to write tests)
  • 272 out of 967 deterministic tests fail
  • 3 out 575 short tests involving systems with psuedopotentials failed, but these failures are likely random statistical failures.

(I had left some stats for the tests earlier, but deleted them after I discovered I had disabled the grid randomization entirely)

@jtkrogel
Copy link
Contributor

jtkrogel commented Dec 13, 2022

I agree with Paul. That we haven't seen a discrepancy before now in reference testing suggests any bias is very small. Likely worthwhile to explicitly test with a long DMC run (longer than a "long" test) of a small molecule (e.g. H2O) before and after the fix.

markdewing added a commit to markdewing/qmcpack that referenced this issue Dec 13, 2022
Extract the storage and generation of the random rotation matrix to make
it easier to test.

Fix the issue in QMCPACK#4362 where the distribution of the rotations does not
cover the entire sphere.
@prckent
Copy link
Contributor

prckent commented Dec 13, 2022

To gather statics, I am currently running 10 sets of all the long tests, with and without the proposed fix. This will take ~a day to run.

Added 14 Dec: Had to restart due to not setting adequate test timeouts. Rerunning with a smaller ensemble. So far it looks like the main breakage is in the deterministic tests.

@prckent
Copy link
Contributor

prckent commented Dec 13, 2022

@markdewing put the matrix rotater in Numerics and remove all physics references? This is only maths/numerics.

@markdewing
Copy link
Contributor Author

I had an idea for incremental fixing of the deterministic tests. One of the features I want for testing is to be able to turn off the grid randomization from the input file. Implementing that feature would not cause a breaking change. Input files and deterministic tests could be updated individually to have the grid randomization turned off. Then applying the fix to the randomization wouldn't break the tests. The downside is then the tests would need to be updated again if we want grid randomization to be included in the test results. But maybe only a subset of them would need to get this second update.

@prckent
Copy link
Contributor

prckent commented Dec 14, 2022

Under what circumstances is the grid reoriented today? Some QMC codes reorient every single evaluation.

@camelto2
Copy link
Contributor

Under what circumstances is the grid reoriented today? Some QMC codes reorient every single evaluation.

It looks like it is every evaluation...for example see NonLocalECPotential::evaluateImp. gets randomized a few lines into the funciton

@prckent
Copy link
Contributor

prckent commented Dec 15, 2022

Test results from 4 runs of a real CPU build gcc no mpi, ctest -R 'deterministic|long'. No short or converter (etc.) tests run.

$ grep "Test.*Failed" build_real/out_?
build_real/out_1:1019/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_real/out_1:1021/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_real/out_2:  35/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_real/out_2:  36/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.05 sec
build_real/out_2: 978/1089 Test  #334: long-heg_14_gamma-sjb-1-16-totenergy ..............................................................***Failed    0.05 sec
build_real/out_3:  35/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_real/out_3:  36/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_real/out_3: 132/1089 Test #1389: long-diamondC_2x1x1_pp-vmc_sdj-1-16-flux ..........................................................***Failed    0.08 sec
build_real/out_3: 171/1089 Test #1014: long-diamondC_1x1x1_pp-vmc_sdj_excited-1-16-flux ..................................................***Failed    0.07 sec
build_real/out_3: 550/1089 Test  #999: long-diamondC_1x1x1_pp-vmc-dmc-allp_sdj-1-16-2-totenergy ..........................................***Failed    0.05 sec
build_real/out_3: 952/1089 Test #1844: long-monoO_1x1x1_pp-vmc_sdj3-1-16-flux ............................................................***Failed    0.05 sec
build_real/out_4: 786/1089 Test #1388: long-diamondC_2x1x1_pp-vmc_sdj-1-16-samples .......................................................***Failed    0.04 sec
build_real/out_4: 793/1089 Test  #972: long-diamondC_1x1x1_pp-vmc_sdj-1-16-flux ..........................................................***Failed    0.05 sec
build_real/out_4: 802/1089 Test #1839: long-monoO_1x1x1_pp-vmc_sdj-1-16-samples ..........................................................***Failed    0.04 sec
build_real/out_4: 809/1089 Test  #449: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-totenergy .....................................................***Failed    0.03 sec
build_real/out_4: 810/1089 Test  #448: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-kinetic .......................................................***Failed    0.03 sec
build_real/out_4: 811/1089 Test  #451: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-samples .......................................................***Failed    0.03 sec
build_real/out_4: 813/1089 Test  #450: long-O_ae_uhf_pyscf-vmc_hf_noj-1-16-eeenergy ......................................................***Failed    0.04 sec
build_real/out_4: 868/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_real/out_4: 869/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.03 sec
$ grep "Test.*Failed" build_real/out_?|wc -l
20

So, an average of 5 failed tests per run, and only long run failures.

Now with the proposed fix, too many deterministic tests fail than should be listed here:

$ grep "Test.*Failed" build_fixed_real/out_?|head -30
build_fixed_real/out_1: 116/1089 Test  #384: deterministic-C2_pp_msdj-H5-param-grad-1-1-CIcoeff_4 ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 117/1089 Test  #385: deterministic-C2_pp_msdj-H5-param-grad-1-1-ud_4 ...................................................***Failed    0.04 sec
build_fixed_real/out_1: 118/1089 Test  #386: deterministic-C2_pp_msdj-H5-param-grad-1-1-udC_25 .................................................***Failed    0.04 sec
build_fixed_real/out_1: 131/1089 Test  #498: deterministic-LiH_pp-vmc_hf_sdj_xml-1-1-totenergy .................................................***Failed    0.03 sec
build_fixed_real/out_1: 134/1089 Test  #501: deterministic-LiH_pp-vmc_hf_sdj_xml-1-1-potential .................................................***Failed    0.03 sec
build_fixed_real/out_1: 137/1089 Test  #504: deterministic-LiH_pp-vmc_hf_sdj_xml-1-1-nonlocalecp ...............................................***Failed    0.04 sec
build_fixed_real/out_1: 141/1089 Test  #508: deterministic-LiH_pp-vmc_hf_sdj_hdf5-1-1-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_1: 143/1089 Test  #511: deterministic-LiH_pp-vmc_hf_sdj_hdf5-1-1-potential ................................................***Failed    0.03 sec
build_fixed_real/out_1: 146/1089 Test  #514: deterministic-LiH_pp-vmc_hf_sdj_hdf5-1-1-nonlocalecp ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 150/1089 Test  #576: deterministic-FeCO6_gms-vmc_noj-1-1-totenergy .....................................................***Failed    0.03 sec
build_fixed_real/out_1: 152/1089 Test  #579: deterministic-FeCO6_gms-vmc_noj-1-1-potential .....................................................***Failed    0.03 sec
build_fixed_real/out_1: 157/1089 Test  #582: deterministic-FeCO6_gms-vmc_noj-1-1-nonlocalecp ...................................................***Failed    0.03 sec
build_fixed_real/out_1: 159/1089 Test  #585: deterministic-FeCO6_gms-vmcbatch_noj-1-1-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_1: 162/1089 Test  #588: deterministic-FeCO6_gms-vmcbatch_noj-1-1-potential ................................................***Failed    0.03 sec
build_fixed_real/out_1: 164/1089 Test  #591: deterministic-FeCO6_gms-vmcbatch_noj-1-1-nonlocalecp ..............................................***Failed    0.03 sec
build_fixed_real/out_1: 167/1089 Test  #604: deterministic-FeCO6_pyscf-vmc_noj-1-1-totenergy ...................................................***Failed    0.04 sec
build_fixed_real/out_1: 169/1089 Test  #607: deterministic-FeCO6_pyscf-vmc_noj-1-1-potential ...................................................***Failed    0.03 sec
build_fixed_real/out_1: 172/1089 Test  #610: deterministic-FeCO6_pyscf-vmc_noj-1-1-nonlocalecp .................................................***Failed    0.04 sec
build_fixed_real/out_1: 176/1089 Test  #613: deterministic-FeCO6_pyscf-vmcbatch_noj-1-1-totenergy ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 178/1089 Test  #616: deterministic-FeCO6_pyscf-vmcbatch_noj-1-1-potential ..............................................***Failed    0.04 sec
build_fixed_real/out_1: 185/1089 Test  #619: deterministic-FeCO6_pyscf-vmcbatch_noj-1-1-nonlocalecp ............................................***Failed    0.05 sec
build_fixed_real/out_1: 230/1089 Test #1026: deterministic-diamondC_1x1x1_pp-dmc-estimator-spindensity-1-1-check ...............................***Failed    0.50 sec
build_fixed_real/out_1: 232/1089 Test #1041: deterministic-diamondC_1x1x1_pp-vmc_sdj-1-1-totenergy .............................................***Failed    0.03 sec
build_fixed_real/out_1: 236/1089 Test #1020: deterministic-diamondC_1x1x1_pp-dmc-estimator-density-1-1-check ...................................***Failed    0.63 sec
build_fixed_real/out_1: 237/1089 Test #1034: deterministic-diamondC_1x1x1_pp-dmc-estimator-1rdm-1-1-check ......................................***Failed    0.49 sec
build_fixed_real/out_1: 240/1089 Test #1044: deterministic-diamondC_1x1x1_pp-vmc_sdj-1-1-potential .............................................***Failed    0.04 sec
build_fixed_real/out_1: 246/1089 Test #1047: deterministic-diamondC_1x1x1_pp-vmc_sdj-1-1-nonlocalecp ...........................................***Failed    0.05 sec
build_fixed_real/out_1: 247/1089 Test #1052: deterministic-diamondC_1x1x1_pp-vmc_sdj_excited-1-1-totenergy .....................................***Failed    0.04 sec
build_fixed_real/out_1: 250/1089 Test #1055: deterministic-diamondC_1x1x1_pp-vmc_sdj_excited-1-1-potential .....................................***Failed    0.03 sec
build_fixed_real/out_1: 257/1089 Test #1058: deterministic-diamondC_1x1x1_pp-vmc_sdj_excited-1-1-nonlocalecp ...................................***Failed    0.05 sec

$ grep "Test.*Failed" build_fixed_real/out_?|wc -l
1100

An average 275 failures per run.

$ grep "Test.*Failed" build_fixed_real/out_?| grep long
build_fixed_real/out_1: 828/1089 Test  #355: long-li2_sto-sj_dmc-1-16-variance .................................................................***Failed    0.06 sec
build_fixed_real/out_1:1016/1089 Test #1697: long-diamondC_1x1x1-Gaussian_pp_MSD-1-16-localecp .................................................***Failed    0.03 sec
build_fixed_real/out_1:1019/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_1:1021/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_fixed_real/out_2: 870/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_fixed_real/out_2: 871/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.03 sec
build_fixed_real/out_3:  37/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.04 sec
build_fixed_real/out_3:  38/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
build_fixed_real/out_3: 324/1089 Test #1389: long-diamondC_2x1x1_pp-vmc_sdj-1-16-flux ..........................................................***Failed    0.04 sec
build_fixed_real/out_3: 328/1089 Test #1013: long-diamondC_1x1x1_pp-vmc_sdj_excited-1-16-samples ...............................................***Failed    0.04 sec
build_fixed_real/out_3: 345/1089 Test  #976: long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-samples .................................................***Failed    0.04 sec
build_fixed_real/out_3: 360/1089 Test  #357: long-li2_sto-sj_dmc-1-16-2-variance ...............................................................***Failed    0.04 sec
build_fixed_real/out_3: 367/1089 Test  #858: long-bccH_1x1x1_ae-vmc_sdj-1-16-samples ...........................................................***Failed    0.03 sec
build_fixed_real/out_4: 780/1089 Test  #977: long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-flux ....................................................***Failed    0.07 sec
build_fixed_real/out_4: 874/1089 Test #1724: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-totenergy ................................................***Failed    0.03 sec
build_fixed_real/out_4: 875/1089 Test #1726: long-diamondC_2x1x1-Gaussian_pp_MSD-1-16-potential ................................................***Failed    0.04 sec
$ grep "Test.*Failed" build_fixed_real/out_?| grep long|wc -l
16

An average 4 failed long runs. This is actually one better than the unfixed code, and it seems that there were problems unrelated to the fix (SIGHUP?) with both long-diamondC_1x1x1_pp-vmc_sdj_excited-1-16 and long-diamondC_1x1x1_pp-vmc_sdj_meshf-1-16, explaining the reported samples failures. And none of the long nonlocal checks fails, so we know that the nonlocal pseudopotential energies are consistent.

Therefore for the elements and levels of statistical accuracies of our test set it is fair to conclude that the problem averages out but our deterministic tests are highly sensitive to it.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 23, 2023

@prckent close?

@prckent prckent closed this as completed Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants