-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU CUDA delayed updates #1279
GPU CUDA delayed updates #1279
Conversation
…o drift work. The drift version currently is not overly optimized and usually slower compared to the original code path. Based on runtime traces, the performance degradation is due mostly to: - Two rows of the updated A inverse are needed for calculating gradients in DiracDeterminant_CUDA's calc_gradient and det_lookahead functions. Without drift, only the calculation in det_lookahead is required. - The calc_lemma_gradient kernel in determinant_update.cu may be optimized further.
…update algorithm. Code in this case choose k=0 (old code path).
…rnel in determinant_update.cu. Speed optimization now comes down to optimizing the two kernels update_onemove and calc_lemma_column.
Conflicts: src/Particle/MCWalkerConfiguration.h src/QMCDrivers/VMC/VMC_CUDA.cpp src/QMCWaveFunctions/EinsplineSet.h src/QMCWaveFunctions/EinsplineSetCuda.cpp src/QMCWaveFunctions/Fermion/DiracDeterminantBase.h src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.cpp src/QMCWaveFunctions/Fermion/DiracDeterminantCUDA.h src/QMCWaveFunctions/Fermion/SlaterDet.h src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.cpp src/QMCWaveFunctions/Jastrow/OneBodyJastrowOrbitalBspline.h src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.cpp src/QMCWaveFunctions/Jastrow/TwoBodyJastrowOrbitalBspline.h src/QMCWaveFunctions/OrbitalBase.h src/QMCWaveFunctions/SPOSetBase.cpp src/QMCWaveFunctions/SPOSetBase.h src/QMCWaveFunctions/TrialWaveFunction.h src/QMCWaveFunctions/TrialWaveFunction_CUDA.cpp
… Jastrows). Code has similar performance and correctness to previous version.
Can one of the maintainers verify this patch? |
ok to test |
To help development & merge velocity, you could do the complex implementation in a later PR if you wished. The GPU code was not complex until #84 by you and @yingwaili |
Okay to test |
src/Numerics/CUDA/cuda_inverse.h
Outdated
@@ -29,6 +29,26 @@ cublas_inverse (cublasHandle_t handle, | |||
int N, int rowStride, int numMats, | |||
bool useHigherPrecision = true); | |||
|
|||
void | |||
cublas_lemma_mats (cublasHandle_t handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda_inverse.h/cu is for matrix inversion.
cublas_lemma_mats, cublas_ainv_row, cublas_smw_update are not generic wrapper functions of cublas.
Please move them to src/QMCWaveFunctions/Fermion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No production code should be in sandbox. Put distinct functionality in different files or simply rename the cuda_inverse.h file e.g. cuda_matrices.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path I put previously was wrong. It was just my local path. Functions under Numerics should not be only used a specific algorithm. So cuda_matrices.h is not good. Please create a new set of .h and .cu file under src/QMCWaveFunctions/Fermion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SM1 needs a fix at the moment. Regarding the complex, you can either fix the complex or protect the complex with macro depending on how much effort is needed. |
Happy New Year! The hamiltonian unit test should work now. Thank you @ye-luo for your fix. |
Complex code path implemented and tested to be working on NiO S32 on SummitDev. |
…pdate.cu and added missing functions.
This way, delay_rank sets the delay rank for the entire run and this also allows overriding the delay rank per QMC block in the config file for GPU delayed updates.
@ye-luo Thanks. Yes, I'll add a warning and cap DU at 64. |
@ye-luo Great idea! Please check the current commit (I am doing the same right now). I added a synchronization at the only different place between drift and no drift where we're coming out of gpu::kernelStream into the default stream. |
@atillack sadly, adding this stream synchronization doesn't help. I mean synchronization of threads within a kernel which has some launching parameter 64 (block size?). |
@ye-luo Yeah, unfortunately there is no other somewhat suspicious location that stands out. I'll add the warning and cap at k = 64 and work on finding the fix later. What makes me suspicious of numerical precision being the culprit is that with a Psi.recompute call after each step, everything seems fine (this is my NiO 128 atom data analyzed over the last 10 steps of 20 blocks of 1 step each):
|
Psi.recompute wipes out any bad stuff accumulated during the PbyP move. If the sampling goes wrong, the average value will goes wrong even if individual Psi is correct. Could you try to run the VMC winder(more nodes) and longer(more blocks) and also check k = 80? |
I am running with 4 GPUs and 128 walkers/GPU. |
I didn't understand why your error bar was around 25. |
Psi.recompute on GPU only updates the A inverse - this is why when AinvU (V'A^-1 * dU) and the lemma matrix where slightly inconsistent the SMW update (A^-1' = A^-1 - AinvULemma^-1(V'A^-1) started accumulating those errors leading to the observed NaNs. I think there might still be some error accumulation going on that gets larger for bigger delay ranks. Part of the fix could be to go to higher precision for the SMW update (like we do for the full A^-1 update). Here is the k = 80 run data (same as above, NiO 128 atoms, 4 GPUS, 128 walkers/GPU, 1 warmup step, 1 step per block, 20 blocks, last 10 are analyzed): |
@ye-luo The error bar is likely that large as the VMC block I am running is the very first one and is not yet equilibrated. I get the same error bar that you get when I run the "official" input files. |
The recomputing of A^-1 should not be the source. It is also applied on CPU mixed precision and I can safely go up-to k = 512. I start to worry about V100 since I was running on Summit. Were you running on SummitDev? |
The current numbers are on SummitDev but I also get similar results on Summit. Btw, the k = 80 above was accidentally at k = 64 (the cap at 64 code works! and I forgot I already enabled it after lunch). Here is the k = 80 data: |
…ue to observed numerical errors beyond that.
@ye-luo Here is what I get when i run with k=80 using the following VMC input block:
NiO-fcc-S32-vmc series 1 -11865.977789 +/- 0.700271 408.558316 +/- 4.459451 0.0344 What is your VMC block? |
Q. Does Kepler (e.g. on Titan) have the same behaviors? |
@atillack Your VMC block seems fine. I don't put the kdelay via VMC block. I did |
@ye-luo the most recent k=80 result I posted was on SummitDev and is truly k=80. It makes no difference how the delay rank is set bit the individual section setting can override the global delay rank setting. |
@ye-luo I have working results up to k=128 bit I also run into the diverging variance at k=256. From my experience, recompute during the warmup steps helps which does make me believe it is numerical instability when more than a handful of steps go by with continuous usage of SMW updates rather than full ones. |
@atillack I tried Summitdev and I got correct numbers for k=80,96,128. So I believe the issue I'm encountering is related to Summit. Unless there is any other concern. I will approve and merge the code tomorrow. CI machine is under maintenance today. |
@ye-luo @prckent It's fixed and working now! Thanks @ye-luo for the idea of looking for CUDA threading issues, in the kernel to finish calculation of the lemma and ainvu matrices there was the possibility of changing data underneath another thread trying to use that data... Here is my current SummitDev @ k=256 run: Rerunning the other tests... |
@ye-luo @prckent Everything is working. Here is my current VMC w/o drift series on SummitDev:
|
I verified that runs with up-to k = 256 are correct on summit. |
To get things moving forward here is the the pull request for my GPU delayed updates code. I've been presenting some aspects of this work (including profiles) at the annual ECP meeting and at Nvidia's GTC earlier this year (2018) for those interested to get a more in depth view. It is labeled work in progress and here is my todo-list:
implement VMC, VMC w/ drift, and DMC parts
a bit of code merging is needed to use the config changes from Ye's CPU delayed updates instead of the ones I've been using
extend to complex code path
code cleanup (leftover from some of the different strategies tried)
I tested and profiled the code extensively using the NiO system and see about a 1.5x speedup for DMC blocks of the 256 atom NiO on Summit. This is about what one would expect if the runtime of the update_inverse kernels were reduced to close to nothing per update step.