Monitor Taskcluster PR and master runs #14210

foolip · 2018-11-23T09:59:43Z

Our Taskcluster setup is amazing, but not so reliable that we never have to worry about it, and also it's possible to break it with our own code changes. Issues so far:

The earliest ones were just part of getting the setup stable, but we will likely continue to have issues.

We should monitor at least:

Runs for PRs and master commits reliably starting and finishing. (Taskcluster status checks not finishing and blocking PRs #14165 for PRs)
That runs for master succeed.
That failure rates for PRs don't suddenly grow drastically.
That results also end up in wpt.fyi.

For humans to notice and act we need to either send email or have status who up in something like https://foolip.github.io/ecosystem-infra-rotation/ (in use) or https://foolip.github.io/wpt-rotation/ (failed experiment).

@lukebjerring given the wpt.fyi connection, do you have thoughts on how we should do this?

Also cc @web-platform-tests/admins for ideas.

(Verify correctness of Taskcluster PR checks is about verifying that the stability jobs on Taskcluster work as well as they did on Travis.)

foolip · 2018-11-23T10:01:35Z

In #13818 I've done some work to monitor Azure Pipelines in an ad-hoc manner. Turning that into something real and sharing it across CI systems might be one path.

Relevant technical detail: Taskcluster uses the Statuses API and Azure Pipelines the Checks API.

lukebjerring · 2018-11-23T12:43:11Z

Eventually, wpt.fyi is intending to respond to Taskcluster having kicked off and become pending until it hears further. If we set up some kind of timeout, this will cover the last item in your monitoring list.

foolip · 2018-11-23T18:50:29Z

Sounds good. If we also have a custom status check that's triggered for the same things as any of our runs (including master) perhaps that could implicitly monitor the other CIs in a way that avoids building a separate monitoring dashboard for this? Kind of a stretch?

foolip · 2018-11-27T09:53:50Z

In #13274 (comment) I reported a failure (https://tools.taskcluster.net/groups/ZklAzb_fTueVrBAWR0kGBA) that should have come up in monitoring. Had that been the daily commit (and we were no longer running Buildbot) we would have no aligned runs for that day.

@lukebjerring, is that a case that would be caught by the time of wpt.fyi monitoring you're considering?

Also explicitly poking @jgraham for ideas.

foolip · 2018-12-13T11:57:42Z

I've written a script that lists non-success statuses in the past week.

NodeJS script

'use strict';

const octokit = require('@octokit/rest')();

// https://github.com/octokit/rest.js/#pagination
function paginate(method, parameters) {
  const options = method.endpoint.merge(parameters);
  return octokit.paginate(options);
}

async function main() {
  octokit.authenticate({
    type: 'token',
    token: process.env.GH_TOKEN
  });

  const WEEK_MS = 7*24*3600*1000;
  const weekAgo = new Date(Date.now() - WEEK_MS).toISOString();

  const commits = await paginate(octokit.repos.listCommits, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    since: weekAgo,
    per_page: 100
  });

  console.log(`Found ${commits.length} commits since ${weekAgo}`);

  for (const commit of commits) {
    // no checks on master yet
    /*
    const checks = await paginate(octokit.checks.listForRef, {
      owner: 'web-platform-tests',
      repo: 'wpt',
      ref: commit.sha,
    });
    */

    const statuses = await paginate(octokit.repos.listStatusesForRef, {
      owner: 'web-platform-tests',
      repo: 'wpt',
      ref: commit.sha,
    });

    // Statuses are in reverse chronological order, so only consider the first
    // (latest).
    const status = statuses.find(s => s.context === 'Taskcluster (push)');
    if (!status) {
      continue;
    }

    if (status.state !== 'success') {
      console.log(`${status.state}: ${status.target_url} (for ${commit.sha})`);
    }
  }
}

main();

The current output is:

failure: https://tools.taskcluster.net/task-group-inspector/#/ZNAaG-moRCG_2CPh3zNmGw (for 69d00d5f81ac1b29673878368055d24390819cc8)
failure: https://tools.taskcluster.net/task-group-inspector/#/cBwRwlUfQy-lpIr4lRMN_g (for e4a221e6c4aead59a34a234a9a3b1d7d161b98af)
failure: https://tools.taskcluster.net/task-group-inspector/#/UulPfqZwR7aaYmX4Bv8PHw (for ed68c260a90dae67b3a0781875e9e82473af95f8)
failure: https://tools.taskcluster.net/task-group-inspector/#/L-umDIXXQNWiWGAOpEt0mg (for 008bfc93426ab6973a843cc201f347b90ae3028f)

3 of them have a single failed task that might pass on rerun. https://tools.taskcluster.net/groups/cBwRwlUfQy-lpIr4lRMN_g failed all the firefox tasks, but also in a way that might pass on rerun.

Hexcles · 2018-12-13T17:52:12Z

Here's one weird example of flaky tests reported by Taskcluster:

#14469
https://tools.taskcluster.net/groups/V7chQRdcQ6iqnYZI-Q0MLw/tasks/YtWQy41TT6SraDjsPmbFtg/runs/0/logs/public%2Flogs%2Flive.log#L261

I retried the test a couple times by force pushing to the branch, and I observe the same result: the first run of the second affected test always fails. I can't reproduce the flakiness locally.

jgraham · 2018-12-13T19:19:38Z

Did you run it locally in the taskcluster docker image, using broadly the same command in in automation?

foolip · 2018-12-17T12:54:59Z

Filed https://github.com/octokit/rest.js/issues/1161 for an issue I've run into with the monitoring.

foolip · 2018-12-20T12:25:19Z

I've found two types of errors during clone of the repo now:
#14617
#14618

Monitoring should be able to distinguish infra failures from real failures in stability jobs.

foolip · 2019-01-25T12:12:26Z

In #15040 Taskcluster didn't start, which I haven't seen since #14165. That's another failure mode that monitoring should flag.

foolip · 2019-03-15T16:31:59Z

In #15820 Taskcluster again didn't start, so this is an error condition we really should monitor for.

foolip · 2019-04-25T13:05:33Z

I refreshed my script a bit today to look for PRs with to Taskcluster status.

'use strict';

const octokit = require('@octokit/rest')({
  auth: `token ${process.env.GH_TOKEN}`,
});

// https://github.com/octokit/rest.js/#pagination
function paginate(method, parameters) {
  const options = method.endpoint.merge(parameters);
  return octokit.paginate(options);
}

function getChecksForRef(ref) {
  return paginate(octokit.checks.listForRef, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    ref,
    per_page: 100,
  });
}

async function getStatusesForRef(ref) {
  const statuses = await paginate(octokit.repos.listStatusesForRef, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    ref,
    per_page: 100,
  });

  // Statuses are in reverse chronological order, so filter out all but the
  // first for each unique context string.
  const seenContexts = new Set;
  return statuses.filter(status => {
    const context = status.context;
    if (seenContexts.has(context)) {
      return false;
    }
    seenContexts.add(context);
    return true;
  });
}

function isRecentlyPendingCheck(check, maxAge = 6*3600) {
  if (check.conclusion === null && check.updated_at) {
    const updated = Date.parse(check.updated_at);
    const age = (Date.now() - updated) / 1000;
    if (age < maxAge) {
      return true;
    }
  }
  return false;
}

function isRecentlyPendingStatus(status, maxAge = 2*3600) {
  if (status.state === 'pending' && status.updated_at) {
    const updated = Date.parse(status.updated_at);
    const age = (Date.now() - updated) / 1000;
    if (age < maxAge) {
      return true;
    }
  }
  return false;
}

async function checkMaster(since) {
  const commits = await paginate(octokit.repos.listCommits, {
    owner: 'web-platform-tests',
    repo: 'wpt',
    since,
    per_page: 100
  });

  console.log(`Found ${commits.length} commits since ${since}`);

  for (const commit of commits) {
    const checks = await getChecksForRef(commit.sha);
    for (const check of checks) {
      if (check.conclusion === 'success' || check.conclusion === 'neutral') {
        continue;
      }

      if (isRecentlyPendingCheck(check)) {
        continue;
      }

      console.log(`${check.conclusion}: ${check.details_url} (for ${commit.sha})`);
    }

    const statuses = await getStatusesForRef(commit.sha);
    const status = statuses.find(s => s.context === 'Taskcluster (push)');
    if (!status) {
      continue;
    }

    if (isRecentlyPendingStatus(status)) {
      continue;
    }

    if (status.state !== 'success') {
      console.log(`${status.state}: ${status.target_url} (for ${commit.sha})`);
    }
  }
}

async function checkPRs(since) {
  const prs = await paginate(octokit.search.issuesAndPullRequests, {
    q: `repo:web-platform-tests/wpt is:pr is:open updated:>${since}`,
    per_page: 100,
  });

  console.log(`Found ${prs.length} PRs since ${since}`);

  for (const pr of prs) {
    const commits = await paginate(octokit.pulls.listCommits, {
      owner: 'web-platform-tests',
      repo: 'wpt',
      pull_number: pr.number,
      per_page: 100,
    });

    if (commits.length >= 100) {
      console.warn(`Ignoring PR #${pr.number} because it has too many (>=100) commits`);
      continue;
    }

    // only look at the final commit
    const commit = commits[commits.length - 1];

    const statuses = await getStatusesForRef(commit.sha);
    const status = statuses.find(s => s.context === 'Taskcluster (pull_request)');
    if (!status) {
      console.log(`No Taskcluster status for ${pr.html_url}`);
      continue;
    }

    if (isRecentlyPendingStatus(status)) {
      continue;
    }

    if (status.state !== 'success' && status.state !== 'failure') {
      console.log(`${status.state}: ${status.target_url} (for ${pr.html_url})`);
    }
  }
}

async function main() {
  const WEEK_MS = 7*24*3600*1000;
  const weekAgo = new Date(Date.now() - WEEK_MS).toISOString();
  // Get rid of milliseconds, GitHub doesn't support it.
  const since = weekAgo.replace(/\.[0-9]+Z/, 'Z');

  //await checkMaster(since);
  await checkPRs(since);
}

main();

Result is:

Found 109 PRs since 2019-04-18T12:55:01Z
No Taskcluster status for https://github.com/web-platform-tests/wpt/pull/12805
No Taskcluster status for https://github.com/web-platform-tests/wpt/pull/12803
No Taskcluster status for https://github.com/web-platform-tests/wpt/pull/10420

The first two are IDL sync PRs that don't matter, and I've commented on #10420 to explain what's wrong.

So it's good news, it doesn't look like Taskcluster is missing PRs at a high rate at least. I haven't looked beyond a week back.

foolip · 2019-04-25T13:07:57Z

@Hexcles do you think any of this monitoring could be incorporated with your work on results receiver validation? There's a big difference between a CI system failing to start at all and the results being bad, but we're interested in the overall system working to deliver results, so it'd be great to not have to monitor the different parts separately. I don't know how practical that is though.

foolip · 2019-09-23T12:35:55Z

@jgraham what do you think the right layer for finding out what Taskcluster runs should have been triggered is?

If the problem described in #19207 is very rare then we could look at all tagged commits and all PRs, but that wouldn't catch when Taskcluster is failing to even start. Would one have to monitor webhooks to work out what was requested?

@Hexcles this is similar to the question of how we'd record pending runs for wpt.fyi before they're finished, do you have thoughts?

stephenmcgruer · 2020-05-04T20:56:05Z

This falls under @LukeZielinski 's monitoring/metrics efforts, but I'm not sure what the priority is for TaskCluster runs. Luke - feel free to downgrade to backlog.

stephenmcgruer · 2020-07-12T15:36:20Z

We started to look at this in Q2, and did land some monitoring setup in https://github.com/ecosystem-Infra/wptmon that monitored Taskcluster runs. However:

It never progressed past an experiment,
it hasn't been updated for the switch to the Checks API, and
we aren't scheduling anymore time on this currently.

As such, dropping to backlog and removing Luke.

foolip · 2021-04-30T13:46:04Z

Back in 2019 I wrote some scripts in https://github.com/foolip/wpt-stats which I seemingly never mentioned here. They still exist and do some checking of both master and PR runs, but don't really summarize it into anything useful.

foolip · 2021-05-06T16:36:58Z

I'll go ahead and close this, it's not concrete enough to act on.

foolip mentioned this issue Nov 23, 2018

Enable Azure Pipelines #13818

Closed

foolip added infra priority:roadmap labels Nov 27, 2018

Hexcles mentioned this issue Dec 3, 2018

Taskcluster Chrome runs for commit 71fe0a6c3d (epochs/daily) failed #13989

Closed

foolip self-assigned this Dec 13, 2018

foolip mentioned this issue Dec 18, 2018

Safari check runs not showing up for most PRs web-platform-tests/wpt.fyi#945

Closed

foolip mentioned this issue Jan 9, 2019

Monitor Azure Pipelines runs #14763

Closed

foolip mentioned this issue Jan 25, 2019

Trusted Types: Implement Origin Trial (TrustedDOMTypes) #15040

Merged

foolip mentioned this issue Feb 15, 2019

Cause of Taskcluster failures are too hard to figure out #15412

Closed

gsnedders added the Taskcluster label Feb 19, 2019

foolip mentioned this issue Apr 25, 2019

Master runs from Taskcluster via Checks API cannot be ingested web-platform-tests/wpt.fyi#1240

Closed

foolip mentioned this issue Aug 22, 2019

Commits touching many files fail to run in Taskcluster #18608

Closed

This was referenced Sep 23, 2019

[Taskcluster] run Chrome and Firefox beta daily #19212

Closed

Chrome/Firefox beta runs aren't reliable #19206

Closed

foolip assigned stephenmcgruer and unassigned foolip Oct 25, 2019

stephenmcgruer assigned LukeZielinski and unassigned stephenmcgruer May 4, 2020

stephenmcgruer added priority:backlog and removed priority:roadmap labels Jul 12, 2020

stephenmcgruer unassigned LukeZielinski Jul 12, 2020

foolip closed this as completed May 6, 2021

foolip mentioned this issue May 6, 2021

Suggestion: track (and drive down) a metric like "percent of times a trivial PR to master fails CI" #21706

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor Taskcluster PR and master runs #14210

Monitor Taskcluster PR and master runs #14210

foolip commented Nov 23, 2018

foolip commented Nov 23, 2018

lukebjerring commented Nov 23, 2018

foolip commented Nov 23, 2018

foolip commented Nov 27, 2018

foolip commented Dec 13, 2018

Hexcles commented Dec 13, 2018

jgraham commented Dec 13, 2018

foolip commented Dec 17, 2018

foolip commented Dec 20, 2018

foolip commented Jan 25, 2019

foolip commented Mar 15, 2019

foolip commented Apr 25, 2019

foolip commented Apr 25, 2019

foolip commented Sep 23, 2019

stephenmcgruer commented May 4, 2020

stephenmcgruer commented Jul 12, 2020

foolip commented Apr 30, 2021

foolip commented May 6, 2021

Monitor Taskcluster PR and master runs #14210

Monitor Taskcluster PR and master runs #14210

Comments

foolip commented Nov 23, 2018

foolip commented Nov 23, 2018

lukebjerring commented Nov 23, 2018

foolip commented Nov 23, 2018

foolip commented Nov 27, 2018

foolip commented Dec 13, 2018

Hexcles commented Dec 13, 2018

jgraham commented Dec 13, 2018

foolip commented Dec 17, 2018

foolip commented Dec 20, 2018

foolip commented Jan 25, 2019

foolip commented Mar 15, 2019

foolip commented Apr 25, 2019

foolip commented Apr 25, 2019

foolip commented Sep 23, 2019

stephenmcgruer commented May 4, 2020

stephenmcgruer commented Jul 12, 2020

foolip commented Apr 30, 2021

foolip commented May 6, 2021