Skip to content

WeeklyTelcon_20160809

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Ralph
  • Arm Patinyasakdikul
  • Brian
  • Edgar Gabriel
  • Nathan Hjelm
  • Todd Kordenbrock
  • Josh Hursey
  • Artem Polyakov

Agenda

Review 1.10

  • Milestones
  • 1.10.4
    • A few PRs to pull in. want folks to focus on 2.0. Once 2.0.1 is out, might begin work on 1.10.4.

Review 2.0.x

  • Wiki
  • 2.0.1 PRs that are reviewed and approved
  • Blocker Issues
  • Milestones
  • A bunch of 2.0.1 PRs, that are outstanding, but not reviewed yet!
    • Need Reviews. "Jeff and Howard aren't going to wait".
    • With exception of the performance issue, Freezing TODAY! and will release 2.0.1 the Next Tuesday.
  • 2.0.1 issues:
    • Performance issues on devel list (Issue 1943???)
      • Howard said Nathan said he had a fix, but didn't see it yesterday.
        • Issue is adding all RDMA end points, for use by one-sided, but OB1 will try to stripe across ALL RDMA endpoints (slowing things down).
        • Nathan working on patch, but it's crashing for Fujitsu. On some systems Open IB is terrible for on-node. CMA or XPM you never want to use openib BTL for on-node.
      • Nathan should have something for this today. Working on Fujitsu system.
      • disqualifying OpenIB component, but not falling back to Send/eciv.
      • PML_OB1_ALL_RDMA - if this is true, will use all RDMA, but defaults to false (same as 1.10, ignores BTLs in the RDMA list, but not on EAGER list) for send/recv.
    • Two blocker issues:
      • Petsc MPI_Req_free - listed as a blocker.
        • Need a reproducer. Something is doing an Obj_release on a datatype.
        • Nathan can look until thursday.
        • Looks like we're stuck.
        • Nathan is hitting this in OSC pt2pt.
        • Design Flaw: Use the request callback to call start, which calls request callback which calls start.
          • instead put it on a list, put it in progress loop, process list.
      • Disable Atomics during thread single path (20-30% message rate cost)
        • just needs to be closed. If datatype or communicator are intrinsic, don't use atomics.
          • IBM was going to do this, but probably a 2.0.2.
    • Need Reviews: 6 of 15 need to be reviewed.
      • Nathan on one (waiting for howard for fix for this).
      • Ryan has 3. Jeff has a few.
    • Would like PMIx 1.1.5 in OMPI 2.0.1 -
      • Created issue 1930 on Master
      • Howard would like to see the impact of NOT merging in PMIX 1.1.5.
        • Ralph - SLURM support for 2.0 would be broken without this.
        • Ralph - Some other reason to do this upgrade. Requested because of a problem.
      • Artem would need to review.
    • OpenSHMEM in 2.0 is DOA??? From last week.
    • Howard mentioned, Would like to add a performance regression test, especially for OB1, perf.
    • Old Issue 194, non-uniform BTL usage.
      • There is a performance regression test in MTT. Today we output a number and no one manages.
      • Want some historical knowledge. Ralph will take as an action to put into Python client.
  • 2.1.0 : No date at the moment.
    • Mellanox needs PMIx 2.0 in 2.1.0
    • PMIx will relase a 2.0 that just has shared memory data as an addition,
      • but doesn't have everything else they were targeting for 2.0.0.
      • This should come out Early September.
      • This is the piece that Mellanox and IBM are interested in.
    • Put items requested on the wiki (e.g., PMIx direct modex, OpenSHMEM, stability improvements)
    • What do people want to see for 2.1.0?
    • Finalize the list in Dallas meeting
    • Hopefully target Sept./Oct. release, not Super Computing Goal.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • May not have been updated to go to the lists. Haven't seen this in a while.
    • Josh can update MTT new email list names.
    • Ick! lots and lots of failures. A bunch from Cisco, because upgrade to RH7. Maybe driver issues.
    • over 1200 failures on Master on PowerPC.
    • IBM running both MTT and Jenkins. but wanted to run with XL compiler. Nysal has a patch.
      • only 42 tests ran on master.
      • ppc64 with XL compiler segv in OB1_send.
      • on 2.x Josh playing with configuration. Hoping to ramp this up, but starting slowly.

MTT Dev status:

  • Meet every week, a lot of progress has been made.
  • Python client is ready to go for non-combinitorial runs. (Howard's intern working on combinitorial)
    • Standard list of tests should work. Ralph is testing today. Intel can run against their version of MTT database.
    • Should be done in next month.
    • Ralph is adding provisioning plugin.

Website migration

  • Most of it's migrated now.
    • Jeff switched over the GITHUB webhooks now going through hostgator.
    • Onlything left to migrate now are: MTT and Jenkin's master.
    • Brian has gotten the permission to host MTT and Jenkin's master, if community would like. Discuss at face2face.

Open MPI Developer's Meeting

  • August 2016

  • If you are coming then make sure to register for the event, and put name on wiki

  • Facilities are the same as last year, (but possible different room the first day)

  • Assessment of the refactoring request handling

    • Artem provided summary of results
    • 1.10 vs. master
    • We need to run the benchmark more broadly, and some deeper analysis on the results.
    • Action items
      1. Arm to provide a 2.0 version of the benchmark for the community
      2. Artem to setup a wiki page with details on how to run, as a place to coordinate results
      3. Folks please run the benchmarks
      4. Make sure this item stays on the agenda until resolved
      5. Wiki is here: https://github.com/open-mpi/ompi/wiki/Request-refactoring-test

Status Updates: (skipped this week)

  1. LANL
  2. Houston
  3. IBM

Status Update Rotation

  1. LANL, Houston, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally