Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU CUDA delayed updates #1279

Merged
merged 33 commits into from
Feb 7, 2019
Merged

GPU CUDA delayed updates #1279

merged 33 commits into from
Feb 7, 2019

Conversation

atillack
Copy link

@atillack atillack commented Dec 20, 2018

To get things moving forward here is the the pull request for my GPU delayed updates code. I've been presenting some aspects of this work (including profiles) at the annual ECP meeting and at Nvidia's GTC earlier this year (2018) for those interested to get a more in depth view. It is labeled work in progress and here is my todo-list:

  • implement VMC, VMC w/ drift, and DMC parts

  • a bit of code merging is needed to use the config changes from Ye's CPU delayed updates instead of the ones I've been using

    • only difference is configurability per QMC block instead of global in the slaterdeterminant definition but I can live with losing that
  • extend to complex code path

  • code cleanup (leftover from some of the different strategies tried)

I tested and profiled the code extensively using the NiO system and see about a 1.5x speedup for DMC blocks of the 256 atom NiO on Summit. This is about what one would expect if the runtime of the update_inverse kernels were reduced to close to nothing per update step.

Andreas Tillack added 8 commits December 22, 2017 14:09
…o drift work.

The drift version currently is not overly optimized and usually slower compared
to the original code path. Based on runtime traces, the performance degradation
is due mostly to:

- Two rows of the updated A inverse are needed for calculating gradients in
  DiracDeterminant_CUDA's calc_gradient and det_lookahead functions. Without
  drift, only the calculation in det_lookahead is required.
- The calc_lemma_gradient kernel in determinant_update.cu may be optimized
  further.
…update algorithm. Code in this case choose k=0 (old code path).
…rnel in determinant_update.cu.

Speed optimization now comes down to optimizing the two kernels update_onemove and calc_lemma_column.
Conflicts:
	src/Particle/MCWalkerConfiguration.h
	src/QMCDrivers/VMC/VMC_CUDA.cpp
	src/QMCWaveFunctions/EinsplineSet.h
	src/QMCWaveFunctions/EinsplineSetCuda.cpp
	src/QMCWaveFunctions/Fermion/DiracDeterminantBase.h
	src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.cpp
	src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.h
	src/QMCWaveFunctions/Fermion/SlaterDet.h
	src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp
	src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.h
	src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.cpp
	src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.h
	src/QMCWaveFunctions/OrbitalBase.h
	src/QMCWaveFunctions/SPOSetBase.cpp
	src/QMCWaveFunctions/SPOSetBase.h
	src/QMCWaveFunctions/TrialWaveFunction.h
	src/QMCWaveFunctions/TrialWaveFunction_CUDA.cpp
… Jastrows).

Code has similar performance and correctness to previous version.
@ghost ghost assigned atillack Dec 20, 2018
@ghost ghost added the in progress label Dec 20, 2018
@qmc-robot
Copy link

Can one of the maintainers verify this patch?

@PDoakORNL
Copy link
Contributor

ok to test

@prckent
Copy link
Contributor

prckent commented Dec 20, 2018

To help development & merge velocity, you could do the complex implementation in a later PR if you wished. The GPU code was not complex until #84 by you and @yingwaili

@ye-luo
Copy link
Contributor

ye-luo commented Dec 21, 2018

Okay to test

@ghost ghost assigned ye-luo Dec 21, 2018
@@ -29,6 +29,26 @@ cublas_inverse (cublasHandle_t handle,
int N, int rowStride, int numMats,
bool useHigherPrecision = true);

void
cublas_lemma_mats (cublasHandle_t handle,
Copy link
Contributor

@ye-luo ye-luo Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_inverse.h/cu is for matrix inversion.
cublas_lemma_mats, cublas_ainv_row, cublas_smw_update are not generic wrapper functions of cublas.
Please move them to src/QMCWaveFunctions/Fermion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No production code should be in sandbox. Put distinct functionality in different files or simply rename the cuda_inverse.h file e.g. cuda_matrices.h

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path I put previously was wrong. It was just my local path. Functions under Numerics should not be only used a specific algorithm. So cuda_matrices.h is not good. Please create a new set of .h and .cu file under src/QMCWaveFunctions/Fermion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ye-luo
Copy link
Contributor

ye-luo commented Dec 21, 2018

SM1 needs a fix at the moment.
short-diamondC_2x1x1_pp-vmc_sdj-1-16 is failing.

Regarding the complex, you can either fix the complex or protect the complex with macro depending on how much effort is needed.

@ye-luo ye-luo self-requested a review December 21, 2018 08:51
@atillack
Copy link
Author

atillack commented Jan 2, 2019

Happy New Year! The hamiltonian unit test should work now. Thank you @ye-luo for your fix.

@prckent prckent added this to the V3.7.0 Release milestone Jan 3, 2019
@atillack
Copy link
Author

atillack commented Jan 3, 2019

Complex code path implemented and tested to be working on NiO S32 on SummitDev.

Andreas Tillack added 5 commits January 4, 2019 13:19
This way, delay_rank sets the delay rank for the entire run and this
also allows overriding the delay rank per QMC block in the config file
for GPU delayed updates.
@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo Thanks. Yes, I'll add a warning and cap DU at 64.

@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo Great idea! Please check the current commit (I am doing the same right now). I added a synchronization at the only different place between drift and no drift where we're coming out of gpu::kernelStream into the default stream.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 5, 2019

@atillack sadly, adding this stream synchronization doesn't help. I mean synchronization of threads within a kernel which has some launching parameter 64 (block size?).

@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo Yeah, unfortunately there is no other somewhat suspicious location that stands out. I'll add the warning and cap at k = 64 and work on finding the fix later.

What makes me suspicious of numerical precision being the culprit is that with a Psi.recompute call after each step, everything seems fine (this is my NiO 128 atom data analyzed over the last 10 steps of 20 blocks of 1 step each):

                            LocalEnergy               Variance           ratio 
k = 1:
avg  series 1  -11782.962089 +/- 25.977604   473.905789 +/- 40.549188   0.0402 
k = 2:
avg  series 1  -11782.905483 +/- 25.761714   503.596322 +/- 47.569417   0.0427 
k = 4:
avg  series 1  -11783.675988 +/- 25.934824   498.788050 +/- 40.720720   0.0423 
k = 8:
avg  series 1  -11783.242652 +/- 25.829249   486.117116 +/- 44.136982   0.0413 
k = 16:
avg  series 1  -11783.116865 +/- 26.016353   498.418402 +/- 44.100940   0.0423 
k = 32:
avg  series 1  -11783.063903 +/- 25.766134   515.954829 +/- 42.950156   0.0438 
k = 64:
avg  series 1  -11782.884046 +/- 25.917708   486.847984 +/- 53.088808   0.0413 
k = 128:
avg  series 1  -11782.704666 +/- 25.630978   505.765987 +/- 34.084771   0.0429 

@ye-luo
Copy link
Contributor

ye-luo commented Feb 5, 2019

Psi.recompute wipes out any bad stuff accumulated during the PbyP move. If the sampling goes wrong, the average value will goes wrong even if individual Psi is correct. Could you try to run the VMC winder(more nodes) and longer(more blocks) and also check k = 80?
Are you using 1 walker per GPU? Try 32.

@atillack
Copy link
Author

atillack commented Feb 5, 2019

I am running with 4 GPUs and 128 walkers/GPU.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 5, 2019

I didn't understand why your error bar was around 25.
I'm expecting some value around sqrt(400/(4*128*20))= 0.197642354 Hartree.

@atillack
Copy link
Author

atillack commented Feb 5, 2019

Psi.recompute on GPU only updates the A inverse - this is why when AinvU (V'A^-1 * dU) and the lemma matrix where slightly inconsistent the SMW update (A^-1' = A^-1 - AinvULemma^-1(V'A^-1) started accumulating those errors leading to the observed NaNs.

I think there might still be some error accumulation going on that gets larger for bigger delay ranks. Part of the fix could be to go to higher precision for the SMW update (like we do for the full A^-1 update).

Here is the k = 80 run data (same as above, NiO 128 atoms, 4 GPUS, 128 walkers/GPU, 1 warmup step, 1 step per block, 20 blocks, last 10 are analyzed):
k = 80:
avg series 1 -11782.911679 +/- 25.797730 479.054177 +/- 50.768536 0.0407

@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo The error bar is likely that large as the VMC block I am running is the very first one and is not yet equilibrated. I get the same error bar that you get when I run the "official" input files.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 5, 2019

The recomputing of A^-1 should not be the source. It is also applied on CPU mixed precision and I can safely go up-to k = 512. I start to worry about V100 since I was running on Summit. Were you running on SummitDev?

@atillack
Copy link
Author

atillack commented Feb 5, 2019

The current numbers are on SummitDev but I also get similar results on Summit. Btw, the k = 80 above was accidentally at k = 64 (the cap at 64 code works! and I forgot I already enabled it after lunch). Here is the k = 80 data:
LocalEnergy Variance ratio
avg series 1 -11784.798230 +/- 25.943351 548.457051 +/- 38.092702 0.0465

…ue to observed numerical errors beyond that.
@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo Here is what I get when i run with k=80 using the following VMC input block:

<qmc method="vmc" move="pbyp" gpu="yes" kdelay="80">
<estimator name="LocalEnergy" hdf5="no" />
<parameter name="walkers">128</parameter>
<parameter name="stepsbetweensamples"> 1 </parameter>
<parameter name="warmupSteps"> 5 </parameter>
<parameter name="substeps"> 5 </parameter>
<parameter name="steps"> 2 </parameter>
<parameter name="blocks"> 5 </parameter>
<parameter name="timestep"> 1.0 </parameter>
<parameter name="usedrift"> no </parameter>
</qmc>

                        LocalEnergy               Variance           ratio 

NiO-fcc-S32-vmc series 1 -11865.977789 +/- 0.700271 408.558316 +/- 4.459451 0.0344

What is your VMC block?

@prckent
Copy link
Contributor

prckent commented Feb 5, 2019

Q. Does Kepler (e.g. on Titan) have the same behaviors?

@ye-luo
Copy link
Contributor

ye-luo commented Feb 5, 2019

@atillack Your VMC block seems fine. I don't put the kdelay via VMC block. I did <slaterdeterminant delay_rank="80"> The error bar in your last rely seems reasonable. Anyway this is not the real issue.
Your k = 80 results confuse me. Are they actually all capped as k = 64?
What is the real k = 80 result? Is it problematic on SummitDev?

@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo the most recent k=80 result I posted was on SummitDev and is truly k=80. It makes no difference how the delay rank is set bit the individual section setting can override the global delay rank setting.

@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo I have working results up to k=128 bit I also run into the diverging variance at k=256. From my experience, recompute during the warmup steps helps which does make me believe it is numerical instability when more than a handful of steps go by with continuous usage of SMW updates rather than full ones.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 5, 2019

@atillack I tried Summitdev and I got correct numbers for k=80,96,128. So I believe the issue I'm encountering is related to Summit.
@prckent I'm afraid some kernels written for pre-Volta architectures may be no more safe on Volta. I have not gotten a chance to try Titan yet.

Unless there is any other concern. I will approve and merge the code tomorrow. CI machine is under maintenance today.

@atillack
Copy link
Author

atillack commented Feb 5, 2019

@ye-luo @prckent I am running on Titan right now.

@atillack
Copy link
Author

atillack commented Feb 6, 2019

@ye-luo @prckent It's fixed and working now!

Thanks @ye-luo for the idea of looking for CUDA threading issues, in the kernel to finish calculation of the lemma and ainvu matrices there was the possibility of changing data underneath another thread trying to use that data...

Here is my current SummitDev @ k=256 run:
k = 256:
NiO-fcc-S32-vmc series 1 -11865.801500 +/- 0.545776 426.167319 +/- 23.228991 0.0359

Rerunning the other tests...

@atillack
Copy link
Author

atillack commented Feb 6, 2019

@ye-luo @prckent Everything is working. Here is my current VMC w/o drift series on SummitDev:

                            LocalEnergy               Variance           ratio 
k = 1:
NiO-fcc-S32-vmc  series 1  -11865.578301 +/- 0.851429   404.349249 +/- 11.021452   0.0341 
k = 2:
NiO-fcc-S32-vmc  series 1  -11865.349947 +/- 0.916794   396.919920 +/- 7.895504   0.0335 
k = 4:
NiO-fcc-S32-vmc  series 1  -11866.000019 +/- 0.772399   424.014092 +/- 22.672240   0.0357 
k = 8:
NiO-fcc-S32-vmc  series 1  -11865.711592 +/- 0.675608   403.633995 +/- 5.521675   0.0340 
k = 16:
NiO-fcc-S32-vmc  series 1  -11865.642270 +/- 0.453154   401.845648 +/- 11.238411   0.0339 
k = 32:
NiO-fcc-S32-vmc  series 1  -11865.794974 +/- 0.935828   409.392814 +/- 12.941130   0.0345 
k = 64:
NiO-fcc-S32-vmc  series 1  -11865.310495 +/- 0.519327   403.773517 +/- 11.573975   0.0340 
k = 128:
NiO-fcc-S32-vmc  series 1  -11865.702223 +/- 0.690361   421.433592 +/- 6.115831   0.0355 
k = 256:
NiO-fcc-S32-vmc  series 1  -11865.801500 +/- 0.545776   426.167319 +/- 23.228991   0.0359 

@ye-luo ye-luo changed the title [WIP] GPU delayed updates GPU delayed updates Feb 7, 2019
@ye-luo ye-luo changed the title GPU delayed updates GPU CUDA delayed updates Feb 7, 2019
@ye-luo
Copy link
Contributor

ye-luo commented Feb 7, 2019

I verified that runs with up-to k = 256 are correct on summit.

@ye-luo ye-luo merged commit bf1a98e into QMCPACK:develop Feb 7, 2019
@ghost ghost removed the in progress label Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants