-
Notifications
You must be signed in to change notification settings - Fork 860
WeeklyTelcon_20160809
Jeff Squyres edited this page Nov 18, 2016
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Ralph
- Arm Patinyasakdikul
- Brian
- Edgar Gabriel
- Nathan Hjelm
- Todd Kordenbrock
- Josh Hursey
- Artem Polyakov
- Milestones
- 1.10.4
- A few PRs to pull in. want folks to focus on 2.0. Once 2.0.1 is out, might begin work on 1.10.4.
- Wiki
- 2.0.1 PRs that are reviewed and approved
- Blocker Issues
- Milestones
- A bunch of 2.0.1 PRs, that are outstanding, but not reviewed yet!
- Need Reviews. "Jeff and Howard aren't going to wait".
- With exception of the performance issue, Freezing TODAY! and will release 2.0.1 the Next Tuesday.
- 2.0.1 issues:
- Performance issues on devel list (Issue 1943???)
- Howard said Nathan said he had a fix, but didn't see it yesterday.
- Issue is adding all RDMA end points, for use by one-sided, but OB1 will try to stripe across ALL RDMA endpoints (slowing things down).
- Nathan working on patch, but it's crashing for Fujitsu. On some systems Open IB is terrible for on-node. CMA or XPM you never want to use openib BTL for on-node.
- Nathan should have something for this today. Working on Fujitsu system.
- disqualifying OpenIB component, but not falling back to Send/eciv.
- PML_OB1_ALL_RDMA - if this is true, will use all RDMA, but defaults to false (same as 1.10, ignores BTLs in the RDMA list, but not on EAGER list) for send/recv.
- Howard said Nathan said he had a fix, but didn't see it yesterday.
- Two blocker issues:
- Petsc MPI_Req_free - listed as a blocker.
- Need a reproducer. Something is doing an Obj_release on a datatype.
- Nathan can look until thursday.
- Looks like we're stuck.
- Nathan is hitting this in OSC pt2pt.
- Design Flaw: Use the request callback to call start, which calls request callback which calls start.
- instead put it on a list, put it in progress loop, process list.
- Disable Atomics during thread single path (20-30% message rate cost)
- just needs to be closed. If datatype or communicator are intrinsic, don't use atomics.
- IBM was going to do this, but probably a 2.0.2.
- just needs to be closed. If datatype or communicator are intrinsic, don't use atomics.
- Petsc MPI_Req_free - listed as a blocker.
- Need Reviews: 6 of 15 need to be reviewed.
- Nathan on one (waiting for howard for fix for this).
- Ryan has 3. Jeff has a few.
- Would like PMIx 1.1.5 in OMPI 2.0.1 -
- Created issue 1930 on Master
- Howard would like to see the impact of NOT merging in PMIX 1.1.5.
- Ralph - SLURM support for 2.0 would be broken without this.
- Ralph - Some other reason to do this upgrade. Requested because of a problem.
- Artem would need to review.
- OpenSHMEM in 2.0 is DOA??? From last week.
- is this still an issue? https://www.mail-archive.com/users@lists.open-mpi.org/msg00052.html
- this has been fixed already in 2.0.1
- is this still an issue? https://www.mail-archive.com/users@lists.open-mpi.org/msg00052.html
- Howard mentioned, Would like to add a performance regression test, especially for OB1, perf.
- Old Issue 194, non-uniform BTL usage.
- There is a performance regression test in MTT. Today we output a number and no one manages.
- Want some historical knowledge. Ralph will take as an action to put into Python client.
- Performance issues on devel list (Issue 1943???)
- 2.1.0 : No date at the moment.
- Mellanox needs PMIx 2.0 in 2.1.0
- PMIx will relase a 2.0 that just has shared memory data as an addition,
- but doesn't have everything else they were targeting for 2.0.0.
- This should come out Early September.
- This is the piece that Mellanox and IBM are interested in.
- Put items requested on the wiki (e.g., PMIx direct modex, OpenSHMEM, stability improvements)
- What do people want to see for 2.1.0?
- Finalize the list in Dallas meeting
- Hopefully target Sept./Oct. release, not Super Computing Goal.
Review Master MTT testing (https://mtt.open-mpi.org/)
- May not have been updated to go to the lists. Haven't seen this in a while.
- Josh can update MTT new email list names.
- Ick! lots and lots of failures. A bunch from Cisco, because upgrade to RH7. Maybe driver issues.
- over 1200 failures on Master on PowerPC.
- IBM running both MTT and Jenkins. but wanted to run with XL compiler. Nysal has a patch.
- only 42 tests ran on master.
- ppc64 with XL compiler segv in OB1_send.
- on 2.x Josh playing with configuration. Hoping to ramp this up, but starting slowly.
- Meet every week, a lot of progress has been made.
- Python client is ready to go for non-combinitorial runs. (Howard's intern working on combinitorial)
- Standard list of tests should work. Ralph is testing today. Intel can run against their version of MTT database.
- Should be done in next month.
- Ralph is adding provisioning plugin.
- Most of it's migrated now.
- Jeff switched over the GITHUB webhooks now going through hostgator.
- Onlything left to migrate now are: MTT and Jenkin's master.
- Brian has gotten the permission to host MTT and Jenkin's master, if community would like. Discuss at face2face.
-
If you are coming then make sure to register for the event, and put name on wiki
-
Facilities are the same as last year, (but possible different room the first day)
-
Assessment of the refactoring request handling
- Artem provided summary of results
- 1.10 vs. master
- We need to run the benchmark more broadly, and some deeper analysis on the results.
- Action items
- Arm to provide a 2.0 version of the benchmark for the community
- Artem to setup a wiki page with details on how to run, as a place to coordinate results
- Folks please run the benchmarks
- Make sure this item stays on the agenda until resolved
- Wiki is here: https://github.com/open-mpi/ompi/wiki/Request-refactoring-test
- LANL
- Houston
- IBM
- LANL, Houston, IBM
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel