-
Notifications
You must be signed in to change notification settings - Fork 860
WeeklyTelcon_20200331
Geoffrey Paulsen edited this page Mar 31, 2020
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Austen Lauria (IBM)
- Akshay Venkatesh (NVIDIA)
- Brian Barrett (AWS)
- Brendan Cunningham (Intel)
- Edgar Gabriel (UH)
- Erik Zeiske
- George Bosilca (UTK)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Hursey (IBM)
- Joshua Ladd (Mellanox)
- Matthew Dosanjh (Sandia)
- Thomas Naughton (ORNL)
- Noah Evans (Sandia)
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- William Zhang (AWS)
- Geoffroy Vallee (ARM)
- Harumi Kuno (HPE)
- Michael Heinz (Intel)
- Shintaro iwasaki
- Todd Kordenbrock (Sandia)
- David Bernhold (ORNL)
- Artem Polyakov (Mellanox)
- Nathan Hjelm (Google)
- Charles Shereda (LLNL)
- Brandon Yates (Intel)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Xin Zhao (Mellanox)
- mohan (AWS)
-
MTT -
- If you change your MTT to startup PRRTE at begining of session, and just use prun.
- Can see times cut in half or more.
- This is good, but also need to test mpirun wrapper.
- Cisco is converting half of MPI installs to use prrte/prun
-
OMPI master submodule pointers setup to track PMIx and PRRTE master.
- Jeff discussed an idea to have some integration with PRRTE that putting a string in a PRRTE PR would automatically open an Open-MPI PR to update the PRRTE submodule after that PRRTE PR is merged to PRRTE master.
Blockers All Open Blockers
- v3.0.6 and v3.1.6 are hopefully the last on those branches.
- Removing from weekly meetings.
- Advise users to move to v4.0.x
Review v4.0.x Milestones v4.0.4
- v4.0.4 in the works.
- Ralph PRed RSH scaling PR 7581
- Thomas Naughten signed off on.
- PR 7579 - UCX PML
- UCX OSHMEM - Josh Ladd signed up to have someone review.
- PMIx-v3.1.x - Update 3/23: Until we hear a problem, we won't backport a PMIx PR or ship another PMIx v3.1.x
-
Schedule:
- Feature Freeze: April 30
- Release: End of June
-
Austen took an initial stab at issues and is starting a google sheets of v5.0 features.
- Today we went through all of the items on the google sheets document (https://docs.google.com/spreadsheets/d/1OXxoxT9P_YLtepHg6vsW3-vp4pdzGQgyknNbkzenYvw/edit#gid=0) which were taken from the face to face wiki.
- Josh Ladd led us to gather owners and a status for each of the various tasks.
-
Updated status in above google sheets.
-
PMIx v4.0.0 - on track
- Schedule: IBM needs branch mid-april.
- PMIX Probably won't have all v4.0.0 requirements done by mid-april
- Remaing issues include:
- Finalizing the current PMIx standard version.
- Issue captured in https://github.com/pmix/pmix-standard/issues/189
- Concern that we need more institutional knowledge of PMIx and PRRTE
- Weekly Thursday Meetings for interested newcomers:
-
PRRTE v2.0 - on track
-
Remove OSC pt2pt - need some work from PSM2/OFI before we can remove. - at risk
-
Discussed Multithreaded Framework
- Concerns about some non-posix implementations and MPI progress in general.
- see https://github.com/open-mpi/ompi/pull/6578
- Consensus that we want the framework / reorganization (using pthread as default)
- Will address a few other PR specific issues before merging.
- Greater progress issues in various components can be discussed in the future.
-
Issues not tracked on spreadsheet.
- Some of the PMIx / PRRTE integration isn't right in Open MPI.
- libopal isn't slurped into Open-MPI correctly (related to 7560)
- Jeff and Brian will meet Friday and
-
Heriarchacal collectives
- If someone wants to do, PMIx has much of this information already.
- Not too hard to do, and they're much faster. Will be in next version of competitor MPI
- Probably not for v5.0
-
PR7566 - can't merge until Mellanox CI testing rev.
- How do we handle this?
- Link on here on PR on Mellanox HPC repo.
-
Static linking is failing on master right now.
- Issue 7560
- May be an issue in static build support in PMIx and PRTE as well as how we're pulling it in.
- Affects everything, just masked at the moment because static linking is broken.
- Jeff will investigate
- No progress.
-
SLURM PMIx plugin has been locked on PMIx v2 for some time.
- There are some NEW PMIx calls that SHOULD be added to bring it up.
- Ralph has started a PR, but needs help.
- So for now, there's some optional info that won't be passed correctly.
- No OMPI_INFO for now.
- Ralph gets pinged occasionally.
- Not sure priority of this.
- There are some NEW PMIx calls that SHOULD be added to bring it up.
-
MTT on master is looking pretty good.
- Defered.
- scale-testing, PRs have to opt-into it.
Review Master Master Pull Requests
- CI testing only tests build and did it run, but doesn't test HOW it ran.
- Environment setup can be a bit different.
- For example no-permissions in
/tmp
. Might pass on one machine, and fail on another without/tmp
permissions.