Skip to content

WeeklyTelcon_20180122

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • akvenkatesh
  • Artem
  • Brian
  • Edgar Gabriel
  • Geoffroy Vallee
  • Howard
  • Josh Ladd
  • Josh Hursey
  • Matthew Dosanjh
  • Mohan
  • Todd Kordenbrock
  • Nathan

Agenda/New Business

  • News: Ralph will not be able to work on Open MPI anymore. He will continue to work on PMIx, but not even the Open MPI PMIx merge.
  • Mellanox will step up and help with PMIx and ORTE integration issues.
  • IBM can help with bugfixing, but can not own orte.
  • Need a v3.1 release engineer to help Brian will send email to devel-core
  • Ralph offered to have a brain dump day. Email Brian if interested.
  • MPI forum is in Portland in over a month.
  • Face2Face -
    • Brian will email to see about co-locating Open MPI with PMIx with ORTE.
    • if it's not an issue, then resolve next week.

Minutes

Review v2.x Milestones v2.1.3

  • No chance to look at.
  • Pretty quiet, ready to go

Review v3.0.x Milestones v3.0.1

  • Schedule: RC2 is actively building now. [50%]
    • On 3.x series trying to cut RCs on nightly tarballs.
    • Didn't get RC last week
    • Will get RC today.
  • Blocker on v3.1.x
    • PR4516
    • May not be a blocker.
  • Target v3.0.x in PR4715
    • Review required.
  • Will Pull in PR4716
    • Issue 4563
      • not seeing on little arm boxes here, Jenkins uses --disable-builtin-atomics.
  • Comm Spawn - Documentation PR ready or pulled
  • Issue 4509
    • We believe this is closed. Asked Nathan to close.
  • Issue - hwloc can't handle cuda from a different location
    • On Master specifically disabling hwloc cuda.
    • External component does NOT disable build, since
  • 4677 - hwloc2 WIP Cant get to until the Weekend.

Review v3.1.x Milestones v3.1.0

  • SCHEDULE:
  • BLOCKER:
    • OSC monitoring fix (doesn't build with Portals 4)
    • PMIx 2.1 PR4605
      • PR4746
      • Ralph - there is cleanup issue with PMIx 2.1, but we have cleanup issues today
      • Mellanox will help work on this.
    • UCX one sided violating PR4688
    • Issue 4303
      • Probably just need to build a patch.

Review Master Master Pull Requests

  • Issue Issue4686
    • Jeff Tried to reproduce and failed.
    • Thought HCOLL was an issue, Artem took out, and put back.
    • Something going on in there. Possibly atomic related.
    • Might need Nathan's attention.
    • Someone could try reverting the one change to atomics to see if that caused it.
    • Mellanox will try to reproduce after reverting atomic change. Timing issue.
  • Dynamic operations, a TON of sigfaults. All in opal_progress, during ompi_sync_wait multi-credit.
    • Something is wrong with atomics. Intercomm_create or Spawn.
    • Cisco is tickling the most, and will look at.
    • Delayed.
  • PR4697 Got resolved and merged to master. * Opal Progress change looks good for most interconnects. * TCP performance regression was resolved and merged to master. * Going to PR this into v3.1.x * George is unhappy with this * Don't have any non-OS wrappers for TLS * Master now checks for Cx11 Can we make it default? * Mac Sierra may/maynot even with _Thread_local * Would be nice if we could require Cx11 for v4.0
  • Reg-ex expression creation.
    • PR4710
    • someone created a test and put it in make-check rather than MTT.
    • Then made the component static so that don't have to do make install
    • Dont think we should be adding tests to make-check
    • Question - Is there a Regex library we could use? Reg-ex is hard.
    • This is working pretty well, but did add Framework to allow for future components.

Process

  • Change behavior of opal_check_package
    • Brian will send email to devel
    • Make it more explicit when it finds issues
    • Issue Issue4423
  • When your PR has been accepted into a release branch, please go to the issue, and remove the target of the release branch that it was just merged into. Attempting to automate this in the future.

MTT / Jenkins Testing Dev

  • New Topic - We currently can't write unit tests against components.
    • Some way to say "this unit test is against this component".
    • Intel went through and did this internally for orte. Already hosted in public domain.
      • Ralph will send link to Brian to take a look.
  • Python Client can't report back to database.

Other Discussion

Next Face-to-face

  • Probably looking at March or early April
    • San Jose or Dallas
      • Geoff will send out two Doodles for date and time.

Abandoning OpenIB BTL

  • Discuss abandoning openib btl.
    • LNLL - is no longer paying anyone to maintain openib btl.
      • Nathan has a UCX BTL
    • ETA on GPU in UCX - basic minus CUDA IPC in test now.
    • Any warning message if on iWarp
    • What's the roadmap for this? 3.x or 4.x?

Oldest PR

Oldest Issue

Next face-to-face meeting

  • pushed date to late feb or march.

Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally