Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high overhead on blue gene #85

Closed
mwkrentel opened this issue May 21, 2018 · 4 comments
Closed

high overhead on blue gene #85

mwkrentel opened this issue May 21, 2018 · 4 comments

Comments

@mwkrentel
Copy link
Member

Hpcrun seems to add a high overhead on Blue Gene. Master adds more
than 2x for the openmp solve phase in amg2006. The ompt-tr4 branch
with llvm libomp runtime adds even more.

This is with AMG 2006 on mira/cetus at ANL, 8 nodes, 8 MPI ranks,
16 openmp threads, problem size (-r) 16,16,16. AMG compiled with gnu,
flags '-g -O2', run with WALLCLOCK at 8500 (118 samples/sec).

AMG 2006 native, no toolkit.

wall clock time = 13.350482 seconds
wall clock time = 205.818907 seconds
wall clock time = 16.934752 seconds

Toolkit master, regular libgomp.

wall clock time = 31.799200 seconds
wall clock time = 241.473654 seconds
wall clock time = 43.120992 seconds

Branch ompt-tr4 with llvm libomp runtime and OMP_IDLE.

wall clock time = 35.795240 seconds
wall clock time = 247.430433 seconds
wall clock time = 72.394108 seconds

That's about 2.5x for phases 1 and 3 with master and over 4x for ompt.

@mwkrentel
Copy link
Member Author

mwkrentel commented May 22, 2018

Happens on mira/cetus (blue gene) with WALLCLOCK@8500 and
PAPI_TOT_CYC@14,500,000.

Does NOT happen on theta (x86, Cray XYZ) with REALTIME@8500.

Does NOT happen on biou (power7) with REALTIME@8500, only 2% overhead.

I could try poman (power 8), but if it doesn't happen on biou, then
it's not going to happen on po, and it's probably blue gene specific.

@jmellorcrummey

@mwkrentel
Copy link
Member Author

I tried inserting PAPI interrupts into openmp regions directly,
outside of hpcrun. With PAPI_TOT_CYC at 8,000,000 (200/sec) and
16 threads, I get 1-2% overhead.

So, it's not interrupts breaking something in the MPI or openmp
synchronization.

This suggests that there's something inside the hpcrun interrupt
handler that's taking too long. Maybe something in the concurrent
skip list, maybe something that synchronizes between threads.

But the big mystery remains why this happens on blue gene but not on
power7 and not on KNL. Some key part of the bug must be blue gene
specific.

@mwkrentel
Copy link
Member Author

Now fixed in commit ccb6bf7, at least for blue gene and powerpc.

Turns out that inside the unwinder, libunwind was calling mmap() which
is very slow on blue gene, causing some samples to take 350,000 usec.
The solution is to turn off calls to libunwind for powerpc, including
blue gene.

But I can tell there is a similar problem to a smaller degree on theta
at ANL. On mira (blue gene), the calls to libunwind to compute new
unwind intervals take 350,000 usec, on theta they take 25,000 usec,
compared to approx 100 usec for most unwinds.

@dxnguyen2014
Copy link

dxnguyen2014 commented Jul 5, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants