PMIx performance issue #394

nkogteva · 2015-02-13T13:40:48Z

We compared execution time of following intervals of mpi_init:

time from start to completion of rte_init
time from completion of rte_init to modex
time to execute modex
time from modex to first barrier
time to execute barrier
time from barrier to complete mpi_init

for 1.8 and trunk versions (configured with --enable-timing) for 0-vpid process using c_hello test.

We noticed that time from modex to first barrier is too much for trunk version. We also compared time of main PMIx procedsures using pmix test (orte/test/mpi/pmix.c) and it seems that PMIx_get is most time-consuming procedure.

Further investigation is needed.

jladd-mlnx · 2015-02-17T20:11:47Z

@nkogteva Please include the data you collected on the TACC system.

nkogteva · 2015-02-18T09:46:24Z

mpi_init intervals (c_hello test: release vs. trunk):

time utility output (c_hello test: release vs. trunk):

pmix intervals (pmix test: trunk only):

hppritcha · 2015-02-18T20:57:29Z

@nkogteva in collecting the data, how many MPI ranks/node were you using?

jladd-mlnx · 2015-02-18T21:15:10Z

@jsquyres @rhc54 @hppritcha Adding other folks to this issue. This data was collected on the Stampede system at TACC. There are two sets of experimental data here:

Launching a simple MPI hello world application.
Launching a native PMIx unit test with mpirun and invoking PMIx calls directly. The experiments following "pmix intervals (pmix test: trunk only):" are the second such set of tests.

Nadia told me this morning that all of the runs had 8 processes per node, although looking at the Stampede system overview, I'm thinking she may be mistaken. Each Stampede node has dual 8-core Xeon E5 processors - @nkogteva please confirm the number of cores-per-node that you ran in these experiments, better yet, please provide a representative command line.

rhc54 · 2015-02-18T21:16:47Z

Looking at the numbers, might be good to check that you have coll/ml turned "off" here as it will otherwise bias the data.

rhc54 · 2015-02-18T22:05:48Z

Looking at your unit test, there is clearly something wrong in the measurements. The "modex_send" call cannot be a function of the number of nodes - all it does is cause the provided value to be packed into a local buffer, and then it returns. So there is nothing about modex_send that extends beyond the proc itself, and the time has to be a constant (and pretty darned small, as the scale on your y-axis shows).

I'm working on this area right now, so perhaps it is worth holding off until we get the revised pmix support done. I'm bringing in the pmix-1.0.0 code and instrumenting it with Artem's timing support, so hopefully we'll get some detailed data.

jladd-mlnx · 2015-02-18T23:40:22Z

We have always set -mca coll ^ml but @nkogteva can you confirm that you did have ml turned off? @rhc54 I'm not sure that we can conclude that the modex sends are really scaling like O(N) as opposed to O(1) or if that's just some noisy data. What's reproducible @nkogteva, how many trials were done per data point?

nkogteva · 2015-02-19T10:03:12Z

@hppritcha @jladd-mlnx @rhc54

I re-ran test with -mca coll ^ml option (it was missed previously).

My command line for 128 launched nodes:
/usr/bin/time -o <time_utility_output_file_name> -p <path_to_ompi>/mpirun -bind-to core --map-by node -display-map -mca mtl '^mxm' -mca btl tcp,sm,self -mca btl_openib_if_include mlx4_0:1 -mca plm rsh -mca coll '^ml' -np 1024 -mca oob tcp -mca grpcomm_rcd_priority 100 -mca orte_direct_modex_cutoff 100000 -mca dstore hash -mca <ompi_timing true path_to_test>/c_hello

New results for pmix test:

As you can see MODEX_SEND is not exactly function of number of nodes. It is various for different runs. Anyway MODEX_SEND time is small. As for MODEX_RECV it still have too high values of execution time.

Please also note, this data is related to one run. I'm still gathering data to plot average time values for at least 10 runs. Plots for 1 run were attached in order to focus attention on MODEX_RECV problem.

rhc54 · 2015-02-19T12:46:23Z

Thanks Nadia! That cleared things up nicely.

I don't find anything too surprising about the data here. I suspect we are doing a collective behind the scenes (i.e., the fence call returns right away, but a collective is done to exchange the data), and that is why modex_recv looks logarithmic. I wouldn't be too concerned about it for now.

As for it "taking too long" - I would suggest that this isn't necessarily true. It depends on the algorithm and mode we are using. Remember, at this time the BTLs are demanding data for every process in the job, so regardless of direct modex or not, the total time for startup is going to be much longer than we would like.

Frankly, until the BTLs change, the shape of the curves you are seeing are actually about as good as we can do. We can hopefully reduce the time, but the scaling law itself won't change.

One other thing to check: be sure that you build --disable-debug and with some degree of optimization set (e.g., -O2) as that will significantly impact the performance.

hppritcha · 2015-02-19T13:00:24Z

Isn't the real issue here the delta between the 1.8 and master timings? I think we need to understand the results in Nadia's real graph. I'd be interested in --disable-debug for both 1.8 and master.

rhc54 · 2015-02-19T14:39:25Z

As I've said before, we know the current pmix server implementation in ORTE isn't optimized, and I'm working on it now. So I'm not really concerned at the moment.

rhc54 · 2015-02-20T20:05:41Z

I saw that Elena committed some changes to the grpcomm components that sounded like they directly impact the results here. Thanks Elena!

@nkogteva when you get a chance, could you please update these graphs?

nkogteva · 2015-03-03T12:48:30Z

Looks like we have a significant improvement in modex performance (especially in modex_receive) after fixes in grpcomm. Total pmix test execution time was reduced in more then 20 times:

But we still have some problems with mpi_init time due to long time intervals between modex and barrier on some processes and time for barrier on other processes who waits them. For example, for 1024 processes, 128 nodes it takes 18-20 sec.

As result, real time was reduced in 2 times because of pmix fixes but still stays high. But it seems that it is not pmix related issue.

:

jladd-mlnx · 2015-03-03T14:24:56Z

Great data!!! Great work, Nadia and Elena!!

On Tue, Mar 3, 2015 at 7:48 AM, Nadezhda Kogteva notifications@github.com
wrote:

Looks like we have a significant improvement in modex performance
(especially in modex_receive) after fixes in grpcomm. Total pmix test
execution time was reduced in more then 20 times:

[image: fence-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462594/7e945a48-c1bb-11e4-8f4d-5d285f39779f.jpg
[image: modex_recv-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462595/813fbdfa-c1bb-11e4-94cf-5cf06c09f014.jpg
[image: modex_send-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462597/831aa8ba-c1bb-11e4-9e80-8321ebb39755.jpg
[image: total-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462598/852d463a-c1bb-11e4-904c-33c88c96bdf1.jpg

But we still have some problems with mpi_init time due to long time
intervals between modex and barrier on some processes and time for barrier
on other processes who waits them. For example, for 1024 processes, 128
nodes it takes 18-20 sec.

[image: time from barrier to complete mpi_init-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462650/dbcaf2e4-c1bb-11e4-880e-2f46cd75adcb.jpg
[image: time from completion of rte_init to modex-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462649/dbc99192-c1bb-11e4-884f-8e5abb504ce0.jpg
[image: time from modex to first barrier-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462647/dbc84120-c1bb-11e4-8976-b6cea70c4d17.jpg
[image: time to execute barrier-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462651/dbcb2cc8-c1bb-11e4-82e2-4d9440dd63d8.jpg
[image: time to execute modex-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462648/dbc898e6-c1bb-11e4-83ff-6b0782bd36f0.jpg

As result, real time was reduced in 2 times because of pmix fixes but
still stays high. But it seems that it is not pmix related issue.
[image: real-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462699/527a9fde-c1bc-11e4-9cf7-87a08d656243.jpg
[image: sys-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462698/52795d40-c1bc-11e4-87e7-9c8cfd9017a0.jpg
[image: user-page-001]
https://cloud.githubusercontent.com/assets/5219209/6462700/527ac2e8-c1bc-11e4-885e-88df04e62179.jpg

:

—
Reply to this email directly or view it on GitHub
#394 (comment).

rhc54 · 2015-03-03T15:22:09Z

Indeed - thanks! The time from modex to barrier would indicate that someone is doing something with a quadratic signature during their add_procs or component init procedures. We should probably instrument MPI_Init more thoroughly to identify the precise step that is causing the problem.

bosilca · 2015-03-03T21:36:20Z

For the sake of completeness, here are some results from 2011 [1]. They are based on an Open MPI 1.6 spin-off (revision 24321). We compared it with MPICH2 (1.3.2p1 with Hydra over rsh and slurm), as well as MVAPICH (1.2) using the ScELA launcher. Some of these results were using the vanilla Open MPI, and some (the ones marked with the prsh) were using a optimized launcher developed at UTK.

This results are interesting because for most of the cases (with the exception of the ML collective) little has changed during the initialization stages in terms of modex exchange or process registration. So there is almost a fair, one-to-one comparaison with the results posted in this thread. Moreover, these results compare an MPI application composed of MPI_Init & MPI_Finalize, so we are talking about a full modex plus the required synchronizations.

[1] Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability for MPI Runtime Systems," Proceedings of the 2011 IEEE International Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 - 195, September, 2011 link

jladd-mlnx · 2015-03-03T22:37:52Z

Thanks, George. This is very enlightening.

On Tue, Mar 3, 2015 at 4:36 PM, bosilca notifications@github.com wrote:

For the sake of completeness, here are some results from 2011 [1]. They
are based on an Open MPI 1.6 spin-off (revision 24321). We compared it with
MPICH2 (1.3.2p1 with Hydra over rsh and slurm), as well as MVAPICH (1.2)
using the ScELA launcher. Some of these results were using the vanilla Open
MPI, and some (the ones marked with the prsh) were using a optimized
launcher developed at UTK.

This results are interesting because for most of the cases (with the
exception of the ML collective) little has changed during the
initialization stages in terms of modex exchange or process registration.
So there is almost a fair, one-to-one comparaison with the results posted
in this thread. Moreover, these results compare an MPI application composed
of MPI_Init & MPI_Finalize, so we are talking about a full modex plus the
required synchronizations.

[image: mvapich_vs_ompi]
https://cloud.githubusercontent.com/assets/642701/6472716/626c111a-c1c2-11e4-80be-e990ca07f229.png
[image: mpich_vs_ompi]
https://cloud.githubusercontent.com/assets/642701/6472721/67d37b2a-c1c2-11e4-9972-bf940f5de32b.png

[1] Bosilca, G., Herault, T., Rezmerita, A., Dongarra, J. "On Scalability
for MPI Runtime Systems," Proceedings of the 2011 IEEE International
Conference on Cluster Computing, IEEE Computer Society, Austin, TX, 187 -
195, September, 2011 link
http://icl.cs.utk.edu/news_pub/submissions/06061054.pdf

—
Reply to this email directly or view it on GitHub
#394 (comment).

rhc54 · 2015-03-04T08:56:49Z

Actually, what @bosilca said isn't quite true. The old ORTE used a binomial tree as its fanout. However, we switched that in the trunk to use the radix tree with a default fanout of 64, so we should look much more like the routed delta-ary tree curve.

bosilca · 2015-03-04T15:23:54Z

@rhc54 If at any moment ORTE would have had the scalable behavior you describe, we would not be here to discuss about it.

rhc54 · 2015-03-04T15:41:14Z

Ah now, George - don't be that way. My point was only that we haven't used the binomial fanout for quite some time now, and so your graphs are somewhat out-of-date.

ORTE has exhibited very good scaling behavior in the past, despite your feelings about it. According to you're own data, we do quite well when launching in a Slurm environment, which means that the modex algorithms in ORTE itself were good - we launch the daemons using Slurm, but we use our own internal modex in that scenario. We also beat the pants off of MPICH2 hydra at the upper scales in that scenario, as shown in your own graph.

So it was only the rsh launcher that has been the question. We have put very little effort into optimizing it for one simple reason: nobody cares about that one at scale, and it works fine at smaller scale.

Since the time of your publication, we also revamped the rsh launcher so it should be more parallel. I honestly don't know what it would look like now at scale, nor do I particularly care. I only improved it as part of switching to the state machine implementation.

We are here discussing this today because we are trying to characterize the current state of the PMIx implementation, which modified the collective algorithms behind the modex as well as the overall data exchange.

hppritcha · 2015-03-04T17:11:19Z

George, for the data in the cited paper, were the apps dynamically or statically linked? Also, which BTLs were enabled?

bosilca · 2015-03-05T16:40:59Z

Hopefully there have been significant progress on ORTE design and infrastructure with palpable impact on it's scalability. I might have failed to correctly read @nkogteva results, but based on these I can hardly quantify all these improvement you are mentioning.

Superpose @nkogteva results on the graphs presented above. What I see from the new scalability graphs is that at 120 processes one needs more than 4 seconds for the release version and over 10 seconds for the trunk. I let you do the fit and see how this would behave at 900 nodes, but without spoiling the surprise it is looking pretty grim.

Now, let's address the prsh. That ORTE using slurm had similar (but still worst) performances compared with a launched based on ssh, is nothing to be excited about. Of course, one can pat itself on the back on how good ORTE is compared with other MPI (MPICH nor MVAPICH). What I see instead is that there were some issues in ORTE regarding the application startup and the modex, and that based on the last set of results these issues are still to be addressed.

bosilca · 2015-03-05T16:46:12Z

@hppritcha the cluster was a heterogeneous cluster, more like a cluster of clusters. Some of these clusters were IB-based, some others MX and few had only ethernet 1G. We didn't restrict the BTLs on the second graph (the one going up to 900 nodes). The application were dynamically linked, but the binaries were staged on the nodes before launching the application (no NFS in our experimental setting). The first graph (the one going up to 140) is an IB cluster, with only IB, SM and SELF BTLs enabled.

jladd-mlnx · 2015-03-05T22:45:23Z

@bosilca I think it might be worthwhile for my team to to try to reproduce your experiments. Should be pretty straightforward as @nkogteva has automated the process. What version of OMPI was used?

bosilca · 2015-03-05T23:57:59Z

r24321. We used a Debian 5.0.8 that came with SLURM 1.3.6.

On Thu, Mar 5, 2015 at 5:45 PM, Joshua Ladd notifications@github.com
wrote:

@bosilca https://github.com/bosilca I think it might be worthwhile for
my team to to try to reproduce your experiments. Should be pretty
straightforward as @nkogteva https://github.com/nkogteva has automated
the process. What version of OMPI was used?

—
Reply to this email directly or view it on GitHub
#394 (comment).

hppritcha · 2015-03-11T17:30:34Z

I've gotten some data off of the edison system at NERSC (cray XC30) 5500 node system. Its not practical except during DST to get more than about 500 or so though. I've managed to get data comparing job launch/shutdown times for a simple hello world - no communication, using the native aprun launcher. 8 ranks per node. I'm not seeing any difference in performance between master and the 1.8 branch using aprun. I configured with --disable-dlopen --disable-debug for these runs, otherwise no special configure options (well for master, for 1.8 its a configuration nightmare). coll ml is disabled. Times are output from using "time", i.e. time aprun -n X -N 8 ./a.out. btl selection was restricted to self,vader,ugni.

I'm endeavoring to get data using mpirun, but I'm seeing hangs above 128 hosts, or at least the job hasn't completed launching within the 10 minutes I give the job script.

Note the relatively high startup time even at small node counts is most likely an artifact of the machine not being in dedicated mode, as well as the size of the machine. There are complications with management of RDMA credentials required for all jobs that adds some overhead as the number of jobs running on the system grows. ALPS manages the RDMA credentials. Generally there are around 500-1000 jobs running on the system at any one time.

rhc54 · 2015-03-12T04:25:08Z

I'm afraid that doesn't really tell us anything except to characterize Cray PMI's performance. When you direct launch with aprun, OMPI just uses the available PMI for all barriers and modex operations. So none of the internal code in question (i.e., PMIx) is invoked.

The real question here is why you are having hangs with mpirun at such small scale. We aren't seeing that elsewhere, so it is a concern.

hppritcha · 2015-03-12T13:55:56Z

The data is useful in helping to sort out whether any performance
difference between 1.8 and master is due to the btls. Id say that at least
on the system where this data was collected the answer appears to be no.
On Mar 11, 2015 10:25 PM, "rhc54" notifications@github.com wrote:

I'm afraid that doesn't really tell us anything except to characterize
Cray PMI's performance. When you direct launch with aprun, OMPI just uses
the available PMI for all barriers and modex operations. So none of the
internal code in question (i.e., PMIx) is invoked.

The real question here is why you are having hangs with mpirun at such
small scale. We aren't seeing that elsewhere, so it is a concern.

—
Reply to this email directly or view it on GitHub
#394 (comment).

jladd-mlnx · 2015-03-18T18:33:10Z

Nadia fixed the last remaining big performance bug in #482. With this fix, we are seeing end-to-end hello world times on the order of 90 seconds on 16K ranks ~ 800 nodes. We are focusing now on ratcheting down the shared memory performance.

rhc54 · 2015-03-18T18:57:24Z

I have a favor to ask - please quit touching the pmix native and server code in the master. I'm trying to complete the pmix integration, and have to fight thru a bunch of conflicts every time you touch it. Only solution is for me to just dump whatever you changed as the code is changing too much and your changes are too out-of-date to be relevant.

nkogteva · 2015-03-19T08:31:25Z

Latest results (1 run, 0-vpid, hello_c):

jladd-mlnx · 2015-03-19T16:03:15Z

Nice, @nkogteva . Looks like you resolved the performance issues. Great work! Now let's help @elenash drill into and analyze the shared memory DB work.

hppritcha · 2015-04-20T17:32:32Z

can this be closed now?

rhc54 · 2015-04-20T18:00:19Z

I think so, but I'll let Josh be the final decider

On Mon, Apr 20, 2015 at 10:32 AM, Howard Pritchard <notifications@github.com

wrote:

can this be closed now?

—
Reply to this email directly or view it on GitHub
#394 (comment).

jladd-mlnx · 2015-04-21T15:03:42Z

This can be closed.

Default allocated nodes to the UP state

nkogteva added this to the Open MPI 1.9 milestone Feb 13, 2015

rhc54 added bug Severity: critical labels Feb 20, 2015

jladd-mlnx closed this as completed Apr 21, 2015

jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 21, 2016

Merge pull request open-mpi#394 from rhc54/cmr/alloc18

71bfd7d

Default allocated nodes to the UP state

PMIx performance issue #394

PMIx performance issue #394

Comments

nkogteva commented Feb 13, 2015

jladd-mlnx commented Feb 17, 2015

nkogteva commented Feb 18, 2015

hppritcha commented Feb 18, 2015

jladd-mlnx commented Feb 18, 2015

rhc54 commented Feb 18, 2015

rhc54 commented Feb 18, 2015

jladd-mlnx commented Feb 18, 2015

nkogteva commented Feb 19, 2015

rhc54 commented Feb 19, 2015

hppritcha commented Feb 19, 2015

rhc54 commented Feb 19, 2015

rhc54 commented Feb 20, 2015

nkogteva commented Mar 3, 2015

jladd-mlnx commented Mar 3, 2015

rhc54 commented Mar 3, 2015

bosilca commented Mar 3, 2015

jladd-mlnx commented Mar 3, 2015

rhc54 commented Mar 4, 2015

bosilca commented Mar 4, 2015

rhc54 commented Mar 4, 2015

hppritcha commented Mar 4, 2015

bosilca commented Mar 5, 2015

bosilca commented Mar 5, 2015

jladd-mlnx commented Mar 5, 2015

bosilca commented Mar 5, 2015

hppritcha commented Mar 11, 2015

rhc54 commented Mar 12, 2015

hppritcha commented Mar 12, 2015

jladd-mlnx commented Mar 18, 2015

rhc54 commented Mar 18, 2015

nkogteva commented Mar 19, 2015

jladd-mlnx commented Mar 19, 2015

hppritcha commented Apr 20, 2015

rhc54 commented Apr 20, 2015

jladd-mlnx commented Apr 21, 2015