Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate pull requests in TaskCluster #12657

Merged
merged 20 commits into from
Sep 12, 2018

Conversation

jugglinmike
Copy link
Contributor

@jugglinmike jugglinmike commented Aug 23, 2018

Extend the configuration for the TaskCluster service to generate tasks in
response to GitHub pull requests. Each pull request should create one cluster
containing four tasks, all of which operate on the affected tests as identified
by wpt tests-affected:

  • run the tests in Chrome and publish the results as a gzipped artifact
  • run the tests in Firefox and publish the results as a gzipped artifact
  • verify the stability of the tests in Chrome
  • verify the stability of the tests in Firefox

The former two tasks are intended for future use in comparing test results with
data available on https://wpt.fyi. This will be informative only. The latter
two tasks are intended to verify that the affected tests are stable. This will
influence whether or not the pull request may be merged.

A demonstration of expected functionality is available on Bocoup's fork of WPT:

It seems wise to vet this in WPT for a bit before allowing the results to
control whether pull requests may be merged. The configuration proposed here
will permit unstable pull requests, but it stores the status of the check in a
dedicated artifact. After some time, we can use this to compare the behavior
between the current TravisCI-powered checks and this new one. If they align, we
can revert the second commit.

Due to some changes to indentation, using -w or the ?w=1 query string
parameter
may make this change set easier to review.

[fixes #10503]

@jugglinmike jugglinmike force-pushed the taskcluster-stability-2 branch from 70c22b2 to c3f109b Compare August 24, 2018 00:37
@jugglinmike
Copy link
Contributor Author

I had to iterate on this a bit after changing the upstream repository, but this patch is functioning as expected. I've rebased and pushed up a couple of commits to demonstrate how the CI handles patches with no affected tests and how it handles patches with one or more affected tests (and unstable ones, at that). Those final two commits should not be merged to master.

Copy link
Contributor

@jgraham jgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very very excited to see this land. But I think the implementation got a little hard to maintain by trying to force everything to go via a shell script.

.taskcluster.yml Outdated
owner: ${event.pusher.email}
source: ${event.repository.url}
payload:
image: jugglinmike/web-platform-tests:0.18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to set up a shared dockerhub account :)

.taskcluster.yml Outdated
- name: wpt-${browser.name}-${browser.channel}-stability
description: >-
Verify that all tests affected by a pull request are stable
when executed in ${browser.name}. As of 2018-08-23, this task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra text from the description; I don't think it's that helpful (we're more likely to forget to remove it later).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like there to be some contributor-facing indication of this behavior.
I appreciate the concern about bitrot, though, so I've moved it to a statement
logged to standard out immediately following the command invocation. Does that
seem safe enough to you?

.taskcluster.yml Outdated
# Bash removes null bytes from string values when set as
# environment variables. This invalidates the output of `wpt
# affected-tests` because it uses the null byte as a separator
# between test names. The list of effected tests is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be affected. But this complexity seems like a clue that a better solution is required.

# to be the name of the browser under test. This restricts the syntax available
# to consumers: value-accepting options must be specified using the equals sign
# (`=`).
for argument in $@; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this exceeded my complexity threshold for a bash script. If we want a single entry point for all the things, let's make it a python script instead.

@gsnedders
Copy link
Member

From #10503:

In TravisCI today, we script unnecessary jobs to exit early. This avoids unnecessary work, but it also makes task reports a little noisy (since they include meaningless results for those no-op jobs).

A nice-to-have improvement would be to use TaskCluster's "decision tasks" feature to avoid scheduling unnecessary tasks.

At the same time, I'm unconvinced we should block doing this on having that working, because it's not a regression currently.

@jugglinmike jugglinmike force-pushed the taskcluster-stability-2 branch from 82e452d to ad8f76b Compare August 29, 2018 01:57
@jugglinmike jugglinmike force-pushed the taskcluster-stability-2 branch from 569c1b6 to 0d6c2a2 Compare August 29, 2018 02:03
@jugglinmike
Copy link
Contributor Author

@jgraham I've re-implemented ci_taskcluster.sh to taskcluster-run.py (the
ci_ prefix seemed superfluous given its location in a directory named ci/).
You were right to recommend this; the logic that wired the commands together
was definitely too complex to be readable in Bash.

This introduces more indirection, and I'm on the fence about whether it's
justified. The alternative is to thin out the wrapper script, and while that
makes the .taskcluster.yml file more literal, it also requires a fair amount
of repetition. I've implemented the more concise version with a separate commit
so you can more clearly see what I'm talking about. Do you have an opinion
about this?

tools/ci/taskcluster-run.py Show resolved Hide resolved
import subprocess

browser_specific_args = {
"firefox": ["--install-browser", "--reftest-internal"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--reftest-internal is the default, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right

tools/wptrunner/wptrunner/wptcommandline.py:    if kwargs["reftest_internal"] is None:
tools/wptrunner/wptrunner/wptcommandline.py-        # Default to the internal reftest implementation on Linux and OSX
tools/wptrunner/wptrunner/wptcommandline.py:        kwargs["reftest_internal"] = sys.platform.startswith("linux") or sys.platform.startswith("darwin")

I wanted to verify with the person who added this flag, and it turns out that was you:

Explicitly set Firefox to use the fast reftest runner.

This should happen by default on Linux, but it doesn't hurt to be explicit.

Have you had a change of heart? Or should we keep the flag for the sake of explicitness?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the flag and aim for a situation where we can use the same flags for all browsers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it.

@@ -0,0 +1,101 @@
#!/usr/bin/env python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think I'm OK with landing this as-is, but I wonder what the effect would be of moving the logic in this file into wpt run directly? If one added --commit-range as an argument to that function then the only things that wouldn't directly fit would be getting the arguments right per-browser and gzipping artifacts. Those could perhaps be moved out into the task definitions.

}

def tests_affected(commit_range):
output = subprocess.check_output([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, of course, just import and use the function directly rather than going via a process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgraham wouldn't that require hacking sys.path? If so, I'm in favour of forking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think it makes more long term sense to move all of this into wpt run and not have another wrapper script at all. So I don't object to defering that change here.

Copy link
Member

@Hexcles Hexcles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation LGTM. I didn't look into the Travis & TC failures though. Could you take a look?

tools/ci/taskcluster-run.py Show resolved Hide resolved
}

def tests_affected(commit_range):
output = subprocess.check_output([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgraham wouldn't that require hacking sys.path? If so, I'm in favour of forking.

@jugglinmike
Copy link
Contributor Author

@jgraham Instead of installing Firefox via wpt run, we could install it via wpt install firefox browser from within start.sh. Admittedly, that script is a little more difficult to maintain since it's built into the Docker image. However, refactoring like this will make the script's behavior more consistent, and (depending on how you feel about the technically-unnecessary argument), it could allow us to remove all browser-specific run arguments.

@jugglinmike
Copy link
Contributor Author

Yes. I think it makes more long term sense to move all of this into wpt run and not have another wrapper script at all. So I don't object to defering that change here.

This aspect of the wrapper is motivated by my naive attempt to print the affected tests to standard out. Without that, it would be a simple matter of piping wpt tests-affected into wpt run. Then again, after reviewing the xargs docs, it looks like the --verbose flag would get us the functionality without the need for all the encoding nonsense:

$ ./wpt tests-affected --null master... | xargs --verbose --null ./wpt run firefox
INFO:manifest:Updating manifest
./wpt run firefox dom/events/CustomEvent.html 
Using certutil /usr/bin/certutil
(etc.)

That serves the same purpose as the logging statement in this patch:

logger.info("Executing command: %s" % " ".join(command))

Considering that we may not need browser-specific arguments, either, we may be able to compose these tasks in the .taskcluster.yml file, after all. Look ma, no wrapper. I'd be happy to keep iterating in this direction, but since we're also considering the long-term, it would be helpful for me to understand your design sensibilities.

Do either of you see value in maintaining separation between the various sub-commands? I don't mean to push the UNIX philosophy just for the sake of it, and I know that technically speaking, these are all implemented in the same application. That said, as someone who's been collecting results for a while, I appreciate the ability to compose tasks without patching WPT. I also find it easier to find features that are exposed in their own terms rather than as arguments to the increasingly-large wpt run subcommand. This could also allow us to be more strict in the contract we offer to consumers--if we were able to formally document a stable Python API (or remove it altogether), then I think the contribution experience would be even better.

@jgraham
Copy link
Contributor

jgraham commented Aug 30, 2018

@jgraham Instead of installing Firefox via wpt run, we could install it via wpt install firefox browser from within start.sh. Admittedly, that script is a little more difficult to maintain since it's built into the Docker image. However, refactoring like this will make the script's behavior more consistent, and (depending on how you feel about the technically-unnecessary argument), it could allow us to remove all browser-specific run arguments.

I was thinking about the opposite; making wpt install chrome browser (and hence wpt run --install-browser chrome) actually work in the case that you're on Ubuntu/Debian Linux and can get root (and maybe adding more configurations in the future). Ideally I'd like to have less in the start.sh script becase updating the docker image is such a pain.

Do either of you see value in maintaining separation between the various sub-commands? I don't mean to push the UNIX philosophy just for the sake of it, and I know that technically speaking, these are all implemented in the same application. That said, as someone who's been collecting results for a while, I appreciate the ability to compose tasks without patching WPT. I also find it easier to find features that are exposed in their own terms rather than as arguments to the increasingly-large wpt run subcommand. This could also allow us to be more strict in the contract we offer to consumers--if we were able to formally document a stable Python API (or remove it altogether), then I think the contribution experience would be even better.

So, I think there are just a series of tradeoffs here and no clearly obvious right answer. Having a series of independent subcommands is great because, as you say, we can script different things together using common utilities, and we get a cleaner separation between concerns for testing. But on the other hand it has real and serious disadvantages:

  • It makes it harder to reproduce results locally. Having a single command that can just be copied and pasted locally is a big advanatage for reproducing results, especially compared to something that requies CI-specific setup (gecko CI has an egregious case of this where the way things run on CI is via a wrapper application that's very difficult to get running on a local machine, and it's awful. Nothing we are doing is that bad, but it's a concern).

  • It's not very portable. Bash scripts aren't going to work on Windows unless people are using WSL, so it becomes particuarly hard for Windows users to figure out how to reproduce what they saw in CI.

  • We end up with a collection of single-purpose shell scripts that themselves represent significant complexity; although I like the idea of lots of small composable utilities (and indeed that's how the wpt cli is structured in large part), it is often overlooked that the composition itself is a program that has to be maintained.

So I don't have a single answer at the moment, but my inclination such as it is is to keep building composable utilities for flexibility and experimentation, but to make the most common patterns baked in features (particularly of wpt run which is largely a wrapper around other functionaility anyway) so that they are portable and easy to run locally.

The simplest solution would be to perform a full repository fetch for every pull request. I don't know what effect this would have on build times, though maybe we could use TaskCluster's caching feature to optimize that.

So, the current start.sh script is supposed to perform a shallow clone and then if we don't have the commit we are looking for extend it to a full clone. But I think that for PRs we probably want to modify the script to pull from the refs/pulls namespace directly rather than first pulling in some of master and then extending it with the relevant PR. We can get the number of commits in the PR from the GH API so can figure out how much we need to pull to get the whole PR plus the merge base.

@jugglinmike
Copy link
Contributor Author

So I don't have a single answer at the moment, but my inclination such as it
is is to keep building composable utilities for flexibility and
experimentation, but to make the most common patterns baked in features
(particularly of wpt run which is largely a wrapper around other
functionaility anyway) so that they are portable and easy to run locally.

Sounds good to me. Thanks for taking the time to write all of that up!

We can get the number of commits in the PR from the GH API so can figure out
how much we need to pull to get the whole PR plus the merge base.

I've pushed up a commit to do this. It necessitated a change to start.sh
because we can't use git clone to initialize a repository from an arbitrary
reference such as refs/pull/1/head (the command only supports branch names
and tag names). I renamed the variable REV to REVISION in order to avoid
confusion with the new REF variable.

That said, the strategy may be incomplete. If we test from the tip of the pull
request branch, then we will not be including any commits that may have landed
in master since the pull request was opened. That means the result reported
by TaskCluster might not reflect what we'd see after the patch was merged.

Contributors could get a more accurate picture by rebasing their patch, but I
don't know if many people would expect this. We could perform a merge in
TaskCluster (or maybe use the merge_commit_sha from GitHub). Would it be an
issue that we couldn't validate unmergable pull requests?

That idea highlights an underlying problem: we will only run these tasks in
response to a pull request event. Even with a "test merge" strategy, the
results will fall out of date if master advances and no new commits are
pushed. So staleness may be unavoidable to some extent.

These are all questions that I'm sure TravisCI et. al have worked out already,
but at the moment, I can't bring to mind any evidence of how they operate. I'll
look into that tomorrow, but I thought I'd leave some rambling here in case
@jgraham (or anyone else) feels like this issue is worth discussing.

This improves the authenticity of the reported results because it
simulates how the patch will behave after it is merged. This also mimics
the behavior of the TravisCI continuous integration platform.
The `start.sh` script now supports all git references, so this
computation is no longer necessary.
@jugglinmike
Copy link
Contributor Author

I think wpt run will do this be default now, if possible.

wpt run does attempt to download a manifest file by default, but wpt tests-affected (which precedes run in this script) does not. That seems appropriate for a lower-level utility like wpt tests-affected, so I've persisted the invocation of wpt manifest-download.

Regarding revision selection: a quick review of some TravisCI logs showed that they use a GitHub-provided reference I was not aware of: refs/pull/{pull request number}/merge. Since TravisCI has precedent in this project and on GitHub at large, I've updated the TaskCluster configuration to mimic this behavior. It means that the results will tend to reflect how the patch will behave after being merged to master, regardless of whether the author has rebased. There are a few other benefits, too. The diffing logic is simplified since we can just compare the first parent. Also, because the git repository now includes a commit present in master, we are downloading the generated manifest (rather than creating it from scratch).

I'm removing the "do not merge yet label"; this should be ready for another review. @jgraham and/or @Hexcles: would you mind?

@jugglinmike
Copy link
Contributor Author

@jgraham @Hexcles In master today, TravisCI is configured to verify stability using the command ./wpt check-stability. This patch configures TaskCluster to verify stability using the command wpt run --verify. I chose this command because check-stability performs extra git/GitHub operations and because it includes TravisCI-specific code. I assumed validation heuristics of the commands were equivalent, but while researching an issue filed against the results-collection project, I learned that they differ.

I walked through an example in that issue report, but to summarize:

  • ./wpt check-stability interleaves the tests
  • ./wpt run --verify executes all iterations of each test separately

Is this difference intentional? The former allowed an unstable test to pass through undetected, but that doesn't necessarily mean the latter is technically superior.

@jgraham
Copy link
Contributor

jgraham commented Sep 4, 2018

The difference is semi-intentional in that the --verify behaviour matches what other suites do on mozilla-central where the approach to test verification is to verify every test one at a time until some time limit is reached. This is an alternative apporach to the problem of how to handle time limits in the case that many tests need to be validated. I guess it also means that if one test is leaking state the others aren't affected (although that's bad in itself; doing this really really well would require running whole directories which has a significant time penalty).

Anyway, my general feeling is that the difference is OK and we should try out this patch as is without trying to change the semantics.

.taskcluster.yml Outdated Show resolved Hide resolved
import subprocess

browser_specific_args = {
"firefox": ["--install-browser", "--reftest-internal"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the flag and aim for a situation where we can use the same flags for all browsers.

tools/docker/start.sh Show resolved Hide resolved
tools/ci/taskcluster-run.py Show resolved Hide resolved
`--reftest-internal` is enabled by default in GNU/Linux environments.
@jugglinmike
Copy link
Contributor Author

I've included another commit to avoid a runtime exception for pull requests that have zero affected tests. We can validate that prior to merging when we remove the intentional instability.

@jugglinmike
Copy link
Contributor Author

@jgraham I've resolved the conflicts introduced by gh-12679 and triggered both types of jobs:

Could you take another look?

@jugglinmike
Copy link
Contributor Author

gh-12878 was recently merged to master. That introduced a conflict in this branch: it modified tools/ci/ci_taskcluster.sh which this patch removes in favor of a new Python script, taskcluster-run.py. The new script includes the semantics that gh-12878 introduced in the old script, so no change was necessary to resolve the conflict.

Copy link
Member

@Hexcles Hexcles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

owner: ${event.pull_request.user.login}@users.noreply.github.com
source: ${event.repository.url}
payload:
image: jugglinmike/web-platform-tests:0.21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A somewhat unrelated question: how can we create a public/shareable DockerHub account?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, but I think we all want that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Run PRs on Taskcluster
5 participants