-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mpi walker transmission fix #2232
Mpi walker transmission fix #2232
Conversation
@@ -22,14 +22,14 @@ ENDIF(QMC_OMP) | |||
# Set clang specific flags (which we always want) | |||
ADD_DEFINITIONS( -Drestrict=__restrict__ ) | |||
|
|||
SET(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fomit-frame-pointer -fstrict-aliasing") | |||
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fomit-frame-pointer -fstrict-aliasing -D__forceinline=inline") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this causing problems in debug build?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it appears that the first -fomit -fno-omit wins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMAKE_XXX_FLAGS is added to all builds.
For a build release CMake invokes both $CMAKE_XXX_FLAGS $CMAKE_XXX_FLAGS_RELEASE
. If all the builds are intended, add to CMAKE_XXX_FLAGS. If release builds are intended, add to CMAKE_XXX_FLAGS_RELEASE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean the CMAKE_C(XX)_FLAGS should not contain -f-omit-frame-pointer
if you don't want that in the flags for the debug build. This fixes that. While leaving this optimization in the default RelWithDebugInfo and Release.
++it_walker_elecs; | ||
} | ||
for(int i = 0; i < mcp_walkers_.size(); ++i) | ||
walker_elecs_[i].get().loadWalker(mcp_walkers_[i], true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this much better than iterators.
@@ -30,7 +30,7 @@ TEST_CASE("DMCBatched::calc_default_local_walkers", "[drivers]") | |||
|
|||
|
|||
auto testWRTWalkersPerRank = [&dtest](int walkers_per_rank) { | |||
DMCBatched dmc_batched(std::move(dtest())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to fix here if it works now, but it is unsafe. When I was working on #2221, I noticed the MCPopulation move constructor and move assign operator cannot be compiler generated by default because ParticleSet cannot be moved. The reason is that the compiler tries to use copy assign operator of ParticleSet which should be either deleted or implemented just like the copy constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of this PR has been long exploration of the unintended consequences of "unsafe" things I'd like to just leave this fixed.
@@ -42,7 +42,6 @@ OperatorBase::OperatorBase() : myIndex(-1), Dependants(0), Value(0.0), tWalker(0 | |||
*/ | |||
void OperatorBase::mw_evaluate(const RefVector<OperatorBase>& O_list, const RefVector<ParticleSet>& P_list) | |||
{ | |||
#pragma omp parallel for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not change this. This is needed so far. I verified that this works very well with OMP_NUM_THREADS=16 and 1 crowd in the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now this is OMP_NUM_THREADS=1,16 behavior but induced by forcing one crowd? I'll put this back in for now (as I expect this has to do with AMD support), but we need to continue to abstract threading and the actual runtime thread setup needs to be handled in one place and not via a proprietary colleciton of ENV. The adhoc nature of OMP thread safety and its opaqueness when reading the source and compiling time is going to continue to be an issue until we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If running 1 crowd is requested, the first OpenMP level is an inactive threading level, there is no need to set nested threading. When using a list of thread numbers like OMP_NUM_THREADS=1,16, they refer to active threading levels. I used this on Summit to compare with the CUDA code which is conceptually 1 crowd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you mean about the semantics of openmp but this throws threading control back to implicit openmp threading. In application threading model we should be using to reason about thread scopes and synchronization this is a 1,16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss in person with more details. This raw OpenMP implementation doesn't ignore the crowd level concurrency. Both crowd level and walker level are needed.
…cpack into mpi_walker_transmission_fix
General comment: Getting MPI working with DMC is blocking the release. Once this is in the focus should be on getting correct statistics. Threading philosophy and design discussions will need to be done, but later. |
@@ -457,11 +457,13 @@ void DMCBatched::setMultiplicities(const DMCDriverInput& dmcdriver_input, | |||
auto setMultiplicity = [&dmcdriver_input, &rng](MCPWalker& walker) { | |||
constexpr RealType onehalf(0.5); | |||
constexpr RealType cone(1); | |||
RealType M = walker.Weight; | |||
RealType M; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I submitted #2243 to follow up the remaining issue here.
Not sure if this should fix the non-MPI case as well or not.
|
I ran the same input file with 2 MPI and the result is still far from reference value -21.84497500 listed above.
More fixes are on the way? |
putting this here mostly to get CI, but nearly there.