-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor Taskcluster PR and master runs #14210
Comments
In #13818 I've done some work to monitor Azure Pipelines in an ad-hoc manner. Turning that into something real and sharing it across CI systems might be one path. Relevant technical detail: Taskcluster uses the Statuses API and Azure Pipelines the Checks API. |
Eventually, wpt.fyi is intending to respond to Taskcluster having kicked off and become pending until it hears further. If we set up some kind of timeout, this will cover the last item in your monitoring list. |
Sounds good. If we also have a custom status check that's triggered for the same things as any of our runs (including master) perhaps that could implicitly monitor the other CIs in a way that avoids building a separate monitoring dashboard for this? Kind of a stretch? |
In #13274 (comment) I reported a failure (https://tools.taskcluster.net/groups/ZklAzb_fTueVrBAWR0kGBA) that should have come up in monitoring. Had that been the daily commit (and we were no longer running Buildbot) we would have no aligned runs for that day. @lukebjerring, is that a case that would be caught by the time of wpt.fyi monitoring you're considering? Also explicitly poking @jgraham for ideas. |
I've written a script that lists non-success statuses in the past week. NodeJS script'use strict';
const octokit = require('@octokit/rest')();
// https://github.com/octokit/rest.js/#pagination
function paginate(method, parameters) {
const options = method.endpoint.merge(parameters);
return octokit.paginate(options);
}
async function main() {
octokit.authenticate({
type: 'token',
token: process.env.GH_TOKEN
});
const WEEK_MS = 7*24*3600*1000;
const weekAgo = new Date(Date.now() - WEEK_MS).toISOString();
const commits = await paginate(octokit.repos.listCommits, {
owner: 'web-platform-tests',
repo: 'wpt',
since: weekAgo,
per_page: 100
});
console.log(`Found ${commits.length} commits since ${weekAgo}`);
for (const commit of commits) {
// no checks on master yet
/*
const checks = await paginate(octokit.checks.listForRef, {
owner: 'web-platform-tests',
repo: 'wpt',
ref: commit.sha,
});
*/
const statuses = await paginate(octokit.repos.listStatusesForRef, {
owner: 'web-platform-tests',
repo: 'wpt',
ref: commit.sha,
});
// Statuses are in reverse chronological order, so only consider the first
// (latest).
const status = statuses.find(s => s.context === 'Taskcluster (push)');
if (!status) {
continue;
}
if (status.state !== 'success') {
console.log(`${status.state}: ${status.target_url} (for ${commit.sha})`);
}
}
}
main(); The current output is:
3 of them have a single failed task that might pass on rerun. https://tools.taskcluster.net/groups/cBwRwlUfQy-lpIr4lRMN_g failed all the firefox tasks, but also in a way that might pass on rerun. |
Here's one weird example of flaky tests reported by Taskcluster: #14469 I retried the test a couple times by force pushing to the branch, and I observe the same result: the first run of the second affected test always fails. I can't reproduce the flakiness locally. |
Did you run it locally in the taskcluster docker image, using broadly the same command in in automation? |
Filed https://github.com/octokit/rest.js/issues/1161 for an issue I've run into with the monitoring. |
In #15820 Taskcluster again didn't start, so this is an error condition we really should monitor for. |
I refreshed my script a bit today to look for PRs with to Taskcluster status.'use strict';
const octokit = require('@octokit/rest')({
auth: `token ${process.env.GH_TOKEN}`,
});
// https://github.com/octokit/rest.js/#pagination
function paginate(method, parameters) {
const options = method.endpoint.merge(parameters);
return octokit.paginate(options);
}
function getChecksForRef(ref) {
return paginate(octokit.checks.listForRef, {
owner: 'web-platform-tests',
repo: 'wpt',
ref,
per_page: 100,
});
}
async function getStatusesForRef(ref) {
const statuses = await paginate(octokit.repos.listStatusesForRef, {
owner: 'web-platform-tests',
repo: 'wpt',
ref,
per_page: 100,
});
// Statuses are in reverse chronological order, so filter out all but the
// first for each unique context string.
const seenContexts = new Set;
return statuses.filter(status => {
const context = status.context;
if (seenContexts.has(context)) {
return false;
}
seenContexts.add(context);
return true;
});
}
function isRecentlyPendingCheck(check, maxAge = 6*3600) {
if (check.conclusion === null && check.updated_at) {
const updated = Date.parse(check.updated_at);
const age = (Date.now() - updated) / 1000;
if (age < maxAge) {
return true;
}
}
return false;
}
function isRecentlyPendingStatus(status, maxAge = 2*3600) {
if (status.state === 'pending' && status.updated_at) {
const updated = Date.parse(status.updated_at);
const age = (Date.now() - updated) / 1000;
if (age < maxAge) {
return true;
}
}
return false;
}
async function checkMaster(since) {
const commits = await paginate(octokit.repos.listCommits, {
owner: 'web-platform-tests',
repo: 'wpt',
since,
per_page: 100
});
console.log(`Found ${commits.length} commits since ${since}`);
for (const commit of commits) {
const checks = await getChecksForRef(commit.sha);
for (const check of checks) {
if (check.conclusion === 'success' || check.conclusion === 'neutral') {
continue;
}
if (isRecentlyPendingCheck(check)) {
continue;
}
console.log(`${check.conclusion}: ${check.details_url} (for ${commit.sha})`);
}
const statuses = await getStatusesForRef(commit.sha);
const status = statuses.find(s => s.context === 'Taskcluster (push)');
if (!status) {
continue;
}
if (isRecentlyPendingStatus(status)) {
continue;
}
if (status.state !== 'success') {
console.log(`${status.state}: ${status.target_url} (for ${commit.sha})`);
}
}
}
async function checkPRs(since) {
const prs = await paginate(octokit.search.issuesAndPullRequests, {
q: `repo:web-platform-tests/wpt is:pr is:open updated:>${since}`,
per_page: 100,
});
console.log(`Found ${prs.length} PRs since ${since}`);
for (const pr of prs) {
const commits = await paginate(octokit.pulls.listCommits, {
owner: 'web-platform-tests',
repo: 'wpt',
pull_number: pr.number,
per_page: 100,
});
if (commits.length >= 100) {
console.warn(`Ignoring PR #${pr.number} because it has too many (>=100) commits`);
continue;
}
// only look at the final commit
const commit = commits[commits.length - 1];
const statuses = await getStatusesForRef(commit.sha);
const status = statuses.find(s => s.context === 'Taskcluster (pull_request)');
if (!status) {
console.log(`No Taskcluster status for ${pr.html_url}`);
continue;
}
if (isRecentlyPendingStatus(status)) {
continue;
}
if (status.state !== 'success' && status.state !== 'failure') {
console.log(`${status.state}: ${status.target_url} (for ${pr.html_url})`);
}
}
}
async function main() {
const WEEK_MS = 7*24*3600*1000;
const weekAgo = new Date(Date.now() - WEEK_MS).toISOString();
// Get rid of milliseconds, GitHub doesn't support it.
const since = weekAgo.replace(/\.[0-9]+Z/, 'Z');
//await checkMaster(since);
await checkPRs(since);
}
main(); Result is:
The first two are IDL sync PRs that don't matter, and I've commented on #10420 to explain what's wrong. So it's good news, it doesn't look like Taskcluster is missing PRs at a high rate at least. I haven't looked beyond a week back. |
@Hexcles do you think any of this monitoring could be incorporated with your work on results receiver validation? There's a big difference between a CI system failing to start at all and the results being bad, but we're interested in the overall system working to deliver results, so it'd be great to not have to monitor the different parts separately. I don't know how practical that is though. |
@jgraham what do you think the right layer for finding out what Taskcluster runs should have been triggered is? If the problem described in #19207 is very rare then we could look at all tagged commits and all PRs, but that wouldn't catch when Taskcluster is failing to even start. Would one have to monitor webhooks to work out what was requested? @Hexcles this is similar to the question of how we'd record pending runs for wpt.fyi before they're finished, do you have thoughts? |
This falls under @LukeZielinski 's monitoring/metrics efforts, but I'm not sure what the priority is for TaskCluster runs. Luke - feel free to downgrade to backlog. |
We started to look at this in Q2, and did land some monitoring setup in https://github.com/ecosystem-Infra/wptmon that monitored Taskcluster runs. However:
As such, dropping to backlog and removing Luke. |
Back in 2019 I wrote some scripts in https://github.com/foolip/wpt-stats which I seemingly never mentioned here. They still exist and do some checking of both master and PR runs, but don't really summarize it into anything useful. |
I'll go ahead and close this, it's not concrete enough to act on. |
Our Taskcluster setup is amazing, but not so reliable that we never have to worry about it, and also it's possible to break it with our own code changes. Issues so far:
The earliest ones were just part of getting the setup stable, but we will likely continue to have issues.
We should monitor at least:
For humans to notice and act we need to either send email or have status who up in something like https://foolip.github.io/ecosystem-infra-rotation/ (in use) or https://foolip.github.io/wpt-rotation/ (failed experiment).
@lukebjerring given the wpt.fyi connection, do you have thoughts on how we should do this?
Also cc @web-platform-tests/admins for ideas.
(Verify correctness of Taskcluster PR checks is about verifying that the stability jobs on Taskcluster work as well as they did on Travis.)
The text was updated successfully, but these errors were encountered: