Skip to content

WMCore Developers on shift

Alan Malta Rodrigues edited this page Jun 20, 2022 · 23 revisions

This Document is supposed to serve as a short list of responsibilities to be covered during shift weeks by the WMCore Developers.

Usually the developers in WMCore team share the load from operational responsibilities, but a portion of those are regular ones, like following meetings and providing support to other teams which use to cost a lot of time during which a parallel task requiring strong concentration is difficult to follow. The shift week is a week during which one developer is dedicated to cover most of the operational activities with a regular schedule and hence during that week his time is mostly filled with meetings and debugging. A non exhaustive list is provided bellow.

Responsibilities:

  • Meetings - Besides our own weekly meeting, We do need to cover a set of regular meeting with other teams during which we try to provide useful technical information on the pieces of the WMCore system every team uses. For some of the meetings we do have the agreement with the people leading the meeting to have the WMCore section at the beginning, but it is useful to stay to the very end even tough not active, because many times we are asked questions which pop up on the go while discussions are ongoing. During those meeting we also tend to keep the other teams on track with our schedule of regular deployments and updates as well as with major changes or important bug fixes concerning them.
  • Producing internal reports - The WMCore developer on shift is to be serving as a contact between the outer world and the rest of the team, so upon every meeting (we tend to keep that interval short while the info is still fresh), he provides a list of topics discussed during the meeting just followed, together with the replies he could or could not give or eventual outcomes if a solid decision has been taken. In some of the cases these result in action items on us, so we need to be sure each of us is on track. If an GH issue needs to be created for following such an action, most of the time we request the person who brought up the topic to create the GH issue according to the templates we have provided and we follow through there.
  • Support - if possible in some feasible response time.
    • During those weeks many teams have questions asked through the various channels of communication we follow, concerning internals of the system to which only we can provide information, many of them concerning not only different APIs and system behavior but also policies discussed far back in time and well forgotten.
    • Many times we need to provide support in debugging issues (especially with P&R Team) which are exceeding the level of knowledge about the system itself, not only of the people using it and asking the question, but also our won too.
  • System monitoring - We need to constantly monitor the health of the system - 24/7. We need to be sure about:
    • we do provide an uninterrupted usage for everybody who depends on WMCore system
    • we do not have components down resulting in stuck load and overfilling the system in short amount of time
    • we do provide the service bug free, and mostly taking care the way of working of the whole system to not result in data loss or corruption, e.g. because of continuous misbehavior of a component or an overlooked bugg - this is in general difficult task not only during shift weeks.
  • Debugging:
    • on demand
    • on system failure
    • on bug discovery - not always leading to an immediate system failure

Examples of typical debugging issues:

A good place to look:

Here is a wiki we started long ago for accumulating well known misbehavior cases and possible actions to mitigate the effects of them (This still needs to be updated on a regular basis though. ): https://github.com/dmwm/WMCore/wiki/trouble-shooting

Possible responsibilities agreed upon in the past, but ones which could not fit in a fairly manner, because of the hard misaligned between the schedules of deployment cycles and shift weeks rotation:

  • Release validation - we decided to follow that in github issues and assign them on a mutual agreement
  • Monitor and support to CMSWEB Team during regular central services deployments - this more or less still holds as a pure responsibility to the person on shift, even though sometimes one of us needs to follow few consecutive cycles.
  • WMAgent deployment campaigns - currently mostly driven by Alan, because of many reasons, but we can cover him at any time if needed. The draining and monitoring is still a shared responsibility.

Channels to follow:

  • Slack channels:
    • P&R - all of them
    • WMCore - all of them
    • T0 people are following ours, but we do have the T0-dev channel in our slack space
    • Rucio - mostly cms, cms-ops, cms-consistency
  • Mattermost channels:
    • All that may concern us in the O&C group (e.g. SI..) - people use to tag us explicitly if we are needed somewhere
    • DMWM is a must

Meetings to follow:

  • Monday:
    • WMCore
    • Compops
  • Tuesday: we used to have some but now is a day free of meetings
  • Wednesday:
    • T0
    • O&C
    • P&R
  • Thursday: free so far.
  • Friday:
    • P&R development

Monitoring we use:

Clone this wiki locally