Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not crash on unexpected exception #48367

Merged
merged 4 commits into from
Oct 12, 2020
Merged

Conversation

tmat
Copy link
Member

@tmat tmat commented Oct 6, 2020

Avoid crashing the host process when an unexpected exception is thrown.

When an unexpected exception is thrown in ServiceHub process it needs to be propagated to the client so that we report actionable diagnostics in the log and in the info-bar. If the process is crashed via fail-fast we can't report good error. See e.g. #47234 (comment)

Proposal

ReportAndCatch, ReportAndCatchUnlessCanceled

The caller declares it is able to recover from the exception at this call site (catches the exception - returns true).
This uses NonFatal handler - csc ignores it, VS/OOP reports non-fatal Watson
In the compiler this would only be used for exceptions from analyzers.

ReportAndPropagate, ReportAndPropagateUnlessCanceled

The caller declares it is not able to recover at this point and the exception needs to be propagated (returns false).
This uses Fatal handler - csc fail-fast, VS/OOP reports non-fatal Watson
We would never use fail fast in these handlers in VS/OOP. We could potentially have env variable/reg key that forces fatal watson for testing/manual investigations

Review strategy: commit by commit.

@tmat tmat requested review from a team as code owners October 6, 2020 19:05
@tmat
Copy link
Member Author

tmat commented Oct 6, 2020

@dotnet/roslyn-compiler @dotnet/roslyn-ide PTAL

@jasonmalinowski
Copy link
Member

@tmat I know this sounds funny what I'm about to propose, but rather than changing this which also impacts other hosts (and anything running in devenv.exe itself), should we have the code in service hub set the 'fatal' handler to be non-fatal as well?

@sharwell sharwell marked this pull request as draft October 6, 2020 21:48
@sharwell
Copy link
Member

sharwell commented Oct 6, 2020

@tmat I marked this as draft so it doesn't accidentally get merged before a design review.

@sharwell
Copy link
Member

sharwell commented Oct 6, 2020

This is related to #47891 as well

Copy link
Member

@sharwell sharwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Consider adding a data property to the exception indicating that it has already been reported through telemetry. Otherwise we could see large numbers of duplicate reports as the stack unwinds.

@jaredpar
Copy link
Member

jaredpar commented Oct 7, 2020

Avoid crashing the host process when an unexpected exception is thrown.
When an unexpected exception is thrown in ServiceHub process ...

This change doesn't seem like it impacts only ServiceHub though. This seems to change the policy for all hosts. Correct?

@tmat
Copy link
Member Author

tmat commented Oct 7, 2020

This change doesn't seem like it impacts only ServiceHub though. This seems to change the policy for all hosts. Correct?

Correct. I think it's better not to fail-fast in any of our hosts.

@tmat
Copy link
Member Author

tmat commented Oct 7, 2020

@tmat I know this sounds funny what I'm about to propose, but rather than changing this which also impacts other hosts (and anything running in devenv.exe itself), should we have the code in service hub set the 'fatal' handler to be non-fatal as well?

I don't think so. I think we need to consider at each call-site what the intended semantics is. In some rare cases we might want to actually fail fast. In most cases we should either recover or report NFE and propagate the exception.

@jaredpar
Copy link
Member

jaredpar commented Oct 7, 2020

Correct. I think it's better not to fail-fast in any of our hosts.

The hosts include the command line compilers and we generally feel that fail-fast there is highly beneficial to us.

@tmat
Copy link
Member Author

tmat commented Oct 7, 2020

The hosts include the command line compilers and we generally feel that fail-fast there is highly beneficial to us.

I see - because we don't report NFW from command line compilers, right? Should command line compilers set the handler for NFW to fail fast then?

@jaredpar
Copy link
Member

jaredpar commented Oct 7, 2020

Should command line compilers set the handler for NFW to fail fast then?

Yes. I think we should probably essentially have a method on the vein of UseHostPolicyUnlessCanceled that we use here so it's explicit that it's a host decision. Then we can focus on the hosts picking the right defaults here.

@tmat
Copy link
Member Author

tmat commented Oct 7, 2020

The handlers always used host policy - that has been the default.

Looking again at call sites in the compiler:

  • In most call sites exceptions are definitely not expected and we want to crash in csc but not in VS/OOP.
  • When we see an exception from analyzer we want to log NFW if in VS/OOP and do nothing in csc (not fail-fast).

So maybe we need to do what @jasonmalinowski suggested and also rename the methods to be clear.

Proposal:

ReportAndCatch, ReportAndCatchUnlessCanceled

  • The caller declares it is able to recover from the exception at this call site (catches the exception - returns true).
  • This uses NonFatal handler - csc ignores it, VS/OOP reports non-fatal Watson
  • In the compiler this would only be used for exceptions from analyzers.

ReportAndPropagate, ReportAndPropagateUnlessCanceled

  • The caller declares it is not able to recover at this point and the exception needs to be propagated (returns false).
  • This uses Fatal handler - csc fail-fast, VS/OOP reports non-fatal Watson

We would never use fail fast in these handlers in VS/OOP. We could potentially have env variable/reg key that forces fatal watson for testing/manual investigations

@jaredpar
Copy link
Member

jaredpar commented Oct 7, 2020

@tmat love that summary in its entirety (including the env var trick to help with investigations)

@tmat
Copy link
Member Author

tmat commented Oct 7, 2020

Cool. I'll update the PR.

@jasonmalinowski
Copy link
Member

The proposal makes sense to me @tmat. I'd say let's communicate well to the compiler team that what might have been 'fatal' in the command line may not be fatal in the VS/OOP case. If the compiler gets a bunch of non-fatal Watsons from VS those might not be possible to hit in the command line case because we marched along after a bad failure and had downstream issues. In the VS/OOP case a fatal exception is data loss, and so marching along is the right thing in many cases but it can be surprising.

@tmat
Copy link
Member Author

tmat commented Oct 7, 2020

we marched along after a bad failure and had downstream issues.

The first filter that will encounter the unexpected exception will report NFW - at this point it's the same stack/state as Fatal Watson from command line.
The handler marks the exception so that we don't report any more Watsons for it. If our code recovers from the exception (ReportAndCatch is used) then there is no unexpected downstream behavior (unless we have a bug in the recovery). If the code does not recover (ReportAndPropagate) then the exception is not caught and the same exception is propagated to the caller (but since it's marked we no longer report Watsons for it). It may happen that we translate one exception to another (e.g. wrap the exception in outer exception). But then we will see what the inner exception was and it's clear Watsons for the outer exception are different from the inner one (the outer exception no longer has a stack that originates in the compiler).

@tmat tmat force-pushed the ReportAndPropagate branch from 1ca6fae to e76e6ec Compare October 7, 2020 19:45
@tmat tmat marked this pull request as ready for review October 9, 2020 17:39
@tmat tmat requested a review from a team as a code owner October 9, 2020 17:39
@tmat
Copy link
Member Author

tmat commented Oct 9, 2020

@sharwell PTAL - I have changed the approach from the original PR.

Copy link
Member

@cston cston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler changes LGTM.

@ryzngard
Copy link
Contributor

ryzngard commented Oct 9, 2020

Design lgtm, we should consider capturing dumps for VS/OOP in ReportAndPropagate and ReportAndPropagateUnlessCanceled cases, and maybe using severity on the FaultError to indicate that these are more problematic. Right now all of our FaultEvent calls use the same severity, maybe we should set severity to something more drastic as well. This will help us prioritize accordingly with telemetry we get.

@tmat
Copy link
Member Author

tmat commented Oct 10, 2020

@ryzngard We do capture dumps in all Report* handlers when running in VS/OOP - both fatal and non-fatal handlers end up reporting NFE. See 1366d93#diff-92b0fbadf21baf9af53e946ec4a2b433R30-R31.

Right now all of our FaultEvent calls use the same severity, maybe we should set severity to something more drastic as well. This will help us prioritize accordingly with telemetry we get.

We can do that, but I'd consider that a separate work item - feel free to file an issue if you think it would help.
I think it might be sufficient to prioritize either based on hit count. In either case a feature won't work due to a bug.

@tmat
Copy link
Member Author

tmat commented Oct 10, 2020

@sharwell OK to merge?

@sharwell
Copy link
Member

@tmat I'll take a look later today but I'm guessing it's in line with what I was already asking for

@ryzngard
Copy link
Contributor

@ryzngard We do capture dumps in all Report* handlers when running in VS/OOP - both fatal and non-fatal handlers end up reporting NFE. See 1366d93#diff-92b0fbadf21baf9af53e946ec4a2b433R30-R31.

Right now all of our FaultEvent calls use the same severity, maybe we should set severity to something more drastic as well. This will help us prioritize accordingly with telemetry we get.

We can do that, but I'd consider that a separate work item - feel free to file an issue if you think it would help.
I think it might be sufficient to prioritize either based on hit count. In either case a feature won't work due to a bug.

NFE != dump, so something to consider. For crashes we always* get a dump, and we no longer will be.

@sharwell
Copy link
Member

@ryzngard that's true, but even for fatal errors the default heap dump collected is often insufficient and we have to request a full dump. In this case, the two options are close to the same.

@tmat
Copy link
Member Author

tmat commented Oct 12, 2020

@sharwell Ping.

@tmat tmat merged commit 9f82a08 into dotnet:master Oct 12, 2020
@ghost ghost added this to the Next milestone Oct 12, 2020
@tmat tmat deleted the ReportAndPropagate branch October 12, 2020 19:23
@Cosifne Cosifne modified the milestones: Next, 16.9.P1 Oct 12, 2020
333fred added a commit that referenced this pull request Oct 13, 2020
…features/interpolated-string-constants

* upstream/master: (295 commits)
  Update F1 Keywords to differentiate between semantics of default keyword (#48500)
  Default constructor suggestion between members (#48318) (#48503)
  Adjust ERR_PartialMisplaced diagnostic message (#48524)
  Refactor ChangedText.Merge and add fuzz testing (#48420)
  Apply suggestions from code review
  Do not crash on unexpected exception (#48367)
  Reference the contributing doc in 'Analyzer Suggestion' issue template
  Apply suggestions from code review
  Hardcode skipped Regex diagnostic ID as it is not available in CodeStyle layer
  Add using
  Skip help link for Regex diagnostic analyzer
  Add contributing doc for IDE code style analyzer documentation
  Make db lock static to investigate issue.
  Update dependencies from https://github.com/dotnet/roslyn build 20201012.2 (#48513)
  Hook up help link even for AbstractCodeQualityDiagnosticAnalyzer
  Add destructor intellisense test for record (#48297)
  Remove unused method (#48429)
  Fix bug
  Update src/EditorFeatures/Core.Wpf/InlineHints/InlineHintsTag.cs
  Add more test
  ...
333fred added a commit to 333fred/roslyn that referenced this pull request Oct 13, 2020
* upstream/master: (68 commits)
  Update F1 Keywords to differentiate between semantics of default keyword (dotnet#48500)
  Default constructor suggestion between members (dotnet#48318) (dotnet#48503)
  Adjust ERR_PartialMisplaced diagnostic message (dotnet#48524)
  Refactor ChangedText.Merge and add fuzz testing (dotnet#48420)
  Apply suggestions from code review
  Do not crash on unexpected exception (dotnet#48367)
  Reference the contributing doc in 'Analyzer Suggestion' issue template
  Apply suggestions from code review
  Hardcode skipped Regex diagnostic ID as it is not available in CodeStyle layer
  Add using
  Skip help link for Regex diagnostic analyzer
  Add contributing doc for IDE code style analyzer documentation
  Make db lock static to investigate issue.
  Update dependencies from https://github.com/dotnet/roslyn build 20201012.2 (dotnet#48513)
  Hook up help link even for AbstractCodeQualityDiagnosticAnalyzer
  Add destructor intellisense test for record (dotnet#48297)
  Remove unused method (dotnet#48429)
  Fix bug
  Update src/EditorFeatures/Core.Wpf/InlineHints/InlineHintsTag.cs
  Add more test
  ...
@Cosifne Cosifne modified the milestones: 16.9.P1, 16.9.P2 Oct 26, 2020
@sharwell
Copy link
Member

@Cosifne It looks like this is missing a milestone

@sharwell
Copy link
Member

@tmat It looks like this is missing a milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants