Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests with Isolate.run pause indefinitely. #520

Closed
HosseinYousefi opened this issue Nov 21, 2023 · 13 comments · Fixed by #595
Closed

Tests with Isolate.run pause indefinitely. #520

HosseinYousefi opened this issue Nov 21, 2023 · 13 comments · Fixed by #595
Assignees

Comments

@HosseinYousefi
Copy link
Member

HosseinYousefi commented Nov 21, 2023

This never prints 42 when it's run using --pause-isolates-on-exit.

import 'dart:isolate';

void main() async {
  print(await Isolate.run(() => 42));
}

And since coverage uses the --pause-isolates-on-exit flag (presumably to communicate with the VM service), the tests with Isolate.run time out.

@dcharkes
Copy link
Contributor

cc @lrhn for any insights about Isolate.run.
cc @cbracken who added the first uses of --pause-isolates-on-exit. Are we correct in presuming this is needed to get coverage?
cc @bkonyi for any insights on the service protocol and coverage.

@lrhn
Copy link
Member

lrhn commented Nov 21, 2023

The Isolate.run implementation runs Isolate.exit(resultMessage, responsePort); when it's done.
That should shut down the isolate (prevent any further code executing) and send a message on responsePort
before sending any onExitListener messages (that will send a null on the same port, which will have been closed when the result message arrived).

If sending the resultMessage gets blocked on --pause-isolates-on-exit, before sending the result,
the the original isolate won't receive the message, and the await won't complete.

In that case, it's pretty much working as intended.

And it pretty much has to work that way, because the resultMessage isn't as much sent as its heap is merged into the receiver's heap, at which point it no longer belongs to the exiting isolate. If paused after that, it would be impossible to do anything with the paused isolate anyway, since its heap is no longer its own.

So "pause-isolates-on-exit" probably has to mean "pause-before-merging-heaps", which means it happens before, and blocks, the result message being received.

@dcharkes
Copy link
Contributor

Thanks @lrhn! That's what @HosseinYousefi and I thought as well, thanks for confirming!

So the question is how that would work together with coverage, do all the isolates need to stay in a paused state to get full coverage info, or do we just need the main isolate to pause on exit? E.g. would a hypothetical --pause-main-isolate-on-exit suffice for coverage collection?

@cbracken
Copy link
Member

@lrhn covered it above, yep. --pause-isolates-on-exit was indeed added so that we could keep all isolates around in a wedged state once they've wrapped up, then collect coverage from them before finally unwedging and bailing out.

The knock-on effect of the stuck resultMessage wasn't something I considered.

I can't remember if we currently wait for all isolates to be paused before collecting coverage, but that sounds plausible. Perhaps we could collect coverage on each isolate once we determine it's paused, then unpause them once we've collected, in an iterative fashion until we're down to the last one? Apologies, it's been years since I've looked at this code.

@dcharkes
Copy link
Contributor

dcharkes commented Nov 22, 2023

Thanks @cbracken!

Perhaps we could collect coverage on each isolate once we determine it's paused, then unpause them once we've collected, in an iterative fashion until we're down to the last one?

@bkonyi You probably know if this is easy to do with the dev protocol.

@bkonyi
Copy link
Contributor

bkonyi commented Nov 22, 2023

Yep, this isn't hard to do if each isolate was configured to pause on exit (simply providing --pause-isolates-on-exit will accomplish that).

First, you'll want to listen to the VM service's Isolate stream to get isolate events:

service.onIsolateEvent.listen((event) {
  if (event.kind == EventKind.kIsolateStart) {
    // An isolate has started
  } else if (event.kind == EventKind.kIsolateExit) {
    // An isolate has exited
  }
});
await service.streamListen(EventStreams.kIsolate);

Then, since there's a chance you might have missed an isolate pause event while listening to the isolate stream, you'll want to check if any isolates are already paused:

// Get the VM state
final vm = await service.getVM();

for (final isolateRef in vm.isolates!) {
  // Get the isolate's current state
  final isolate = await service.getIsolate(isolateRef.id!);
  if (isolate.pauseEvent?.kind == EventKind.kIsolateExit) {
     // isolate has paused at exit.
  }
}

You'll need to keep track of how many isolates have been spawned and how many have exited, but then it should be easy to collect coverage and resume each isolate to cause the process to exit.

@liamappelbe
Copy link
Contributor

Just getting to this now. To summarize, the current flow is we start the test, wait for all the isolates to be paused, then collect all the coverage, then resume all the isolates.

The proposed new flow is that we start the test, listen for individual isolate's PauseExit events, collect coverage for the paused isolate, then resume it. We keep going with this process until there are no isolates left. We also need to take care to handle the case where we miss some of the isolate pause events before the listener is set up.

Unfortunately that approach won't quite work in the presence of isolate groups. The coverage counters are shared between all the isolates in a group. In the currently flow we only collect coverage for one isolate in a group. If we naively apply that logic in the new flow, we'll undercount coverage by only collecting coverage for the entire group when its first isolate exits. If we instead omit this logic and collect coverage for all isolates, we'll overcount coverage (that's less problematic, but users have complained about it before).

So instead we need to collect coverage for an isolate group only when its final isolate exits. I'm not quite sure of the best way to do this. My goal is generally to minimize the number of RPCs. My current thinking is that when I get a pause event, I can query vm.isolates, which returns a list of isolate refs, which contain the isolate's group ID. If the paused isolate is the only isolate in that group, I can collect coverage. That's 1 RPC per pause event. @bkonyi Can you think of a cleaner solution?

@liamappelbe
Copy link
Contributor

I've got a mostly working solution, except for one annoying issue. If you're gathering coverage on a test where the main isolate doesn't wait for its child isolates to finish, the main isolate can finish first. In that case the flow is:

  1. Main isolate starts. Coverage tool sees it.
  2. Main isolate spawns child. Coverage tool sees that there are 2 isolates in the group.
  3. Main isolate doesn't await child, so it finishes, and enters pause-on-exit state
  4. Coverage tool sees that the main isolate is not the last isolate in its group, so doesn't gather coverage. Instead it just resumes the main isolate, allowing it to exit.
  5. Child isolate finishes, and enters pause-on-exit state
  6. Coverage tool sees that it's the last isolate, so starts collecting coverage.
  7. Once coverage is collected, the child isolate is resumed and exits.

The problem is I'm seeing the VM service being disposed early, some time around step 6 or 7 (eg getSourceReport: (112) Service has disappeared). I think what's happening is that the lifetime of the VM service is tied to the main isolate, and it's being shut down before the child isolate is collected and resumed.

Potential fixes:

  1. I can fix this particular test by having the main isolate wait for the child isolate to exit. I've verified this works, but this is not a general solution. The tool needs to be robust to this case.
  2. Change the VM service to stay alive until all isolates are closed. I don't know how feasible this is, and I'd rather not wait for Dart 3.7 to fix this bug.
  3. Delay resuming the main isolate until all other isolates have been resumed. I've verified this works, and this is my preferred solution. The only issue is I'm not sure how to reliably detect the main isolate. My current (super hacky) solution is to check isolateRef.name == "main". @bkonyi is there a better way of detecting which isolate is the main isolate?

@bkonyi
Copy link
Contributor

bkonyi commented Sep 12, 2024

I think what's happening is that the lifetime of the VM service is tied to the main isolate, and it's being shut down before the child isolate is collected and resumed.

As soon as the main isolate exits, the VM starts shutting down which will kill all the isolates immediately (although the service isolate should be one of the last to shutdown).

  1. Change the VM service to stay alive until all isolates are closed. I don't know how feasible this is, and I'd rather not wait for Dart 3.7 to fix this bug.

The VM service is only shutdown after all isolates are killed.

  1. Delay resuming the main isolate until all other isolates have been resumed. I've verified this works, and this is my preferred solution. The only issue is I'm not sure how to reliably detect the main isolate. My current (super hacky) solution is to check isolateRef.name == "main". @bkonyi is there a better way of detecting which isolate is the main isolate?

Unfortunately the definition of the "main" isolate is embedder specific, so there's no generic way to query whether or not an isolate is the main isolate. For the standalone VM, checking for the isolate named "main" will work, but it's purely an implementation detail and could change in the future (at least in theory). The Flutter engine may assign a different name to the main isolate in Flutter apps as well.

@kenzieschmoll, does DevTools just assume the first non-system isolate to start is the main isolate or is there another check that's more consistent that's used elsewhere?

@liamappelbe
Copy link
Contributor

@kenzieschmoll, does DevTools just assume the first non-system isolate to start is the main isolate or is there another check that's more consistent that's used elsewhere?

Looking at that code link, they're catching the first EventKind.kIsolateStart event. But package:coverage usually doesn't subscribe to the event stream early enough to catch that.

Another approach I thought of was to assume the first isolate in service.getVM().isolates was the main isolate (under the assumption that no isolates have shut down yet and shuffled the order, because they're all paused at exit). Still pretty hacky, but less hacky than checking the isolate name.

Martin suggested looking for the isolate that has id == originId, but I don't see originId in the VM service spec so that might be a VM specific solution.

@aam
Copy link
Contributor

aam commented Sep 13, 2024

Looking at that code link, they're catching the first EventKind.kIsolateStart event. But package:coverage usually doesn't subscribe to the event stream early enough to catch that.

Can you launch the app in pause-isolates-on-start mode therefore guaranteeing that you get full picture of what is started and in what order?

@liamappelbe
Copy link
Contributor

Can you launch the app in pause-isolates-on-start mode therefore guaranteeing that you get full picture of what is started and in what order?

Maybe, but many of our users start the tests themselves, then gather coverage separately. We'd have to convince all of them to change how they invoke the tests.

@brianquinlan
Copy link
Contributor

FYI, I just spent a day trying to figure out why some tests were timing out due to this issue. Since the code under test involves native code that uses condition variables, it never occurred to me that this could be an infrastructure problem ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants