Skip to content

WeeklyTelcon_20160531

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Arm Patinyasakdikul
  • Edgar Gabriel
  • Howard Pritchard
  • Nathan Hjelmn
  • Ralph Castain
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • PR 1174: needs a minor tweak, then we'll put it in
    • PR 1199: request revamp
      • Nathan found 2 new issues
        • one thread adds callback while another thread is calling that callback (Nathan working on this right now -- PR probably in the next few hours)
        • PR 1729: minor thread leak in persistent communications leak / callbacks can be lost (very old bug -- dates back to 2005!). Need George PR.
    • One more PR coming about XRC fix from NAthan
    • Nathan has a 1-line uGNI fix that he'd like to get in -- will send to Howard
    • NVIDIA CUDA build failed in MTT: the fix was just merged
  • Fallout from request overhaul
    • How's it look on master?
      • looking good; other than missing CM, we're turning up mostly other genuine threading bugs
      • all PMLs should be good now
      • George would like to fix a few error paths (maybe v2.0.1)
    • Is there a consolidated PR for v2.x?
      • Yes: PR 1199

Review Master MTT testing (https://mtt.open-mpi.org/)

  • Reminder from last week
    • 23 pull requests on master, some since last October. Not TODAY (since we want George's Multithreaded thing in), but should bring them in or kill them.

MTT Dev status:

Logistics

  • MPI Forum next week
    • Will be there: Jeff, Howard, Nathan, Sylvain

Status Updates:

  • Mellanox: not here
  • Sandia:
    • tracking down bug in rendezvous protocol. Will roll to v2.x -- may have to wait for v2.0.1.
  • Intel: Added stuff to mpirun:
    • --timeout (reminder, in case you didn't know it existed); exits with ETIMEDOUT if timeout expires (110 on Linux / OS X)
    • --report-state-on-timeout
    • --get-stack-traces
    • Added ability to launch N daemons on a node (just for ORTE scale testing; only works with rsh): MCA param ras_base_multiplier. NOT for MPI performance testing! Only for ORTE scale testing.
    • Working on PMIx event notification stuff. ULFM comes in after that.
    • Going to use OPAL MCA stuff for Warewulf rewrite.

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally