Do not crash on unexpected exception #48367

tmat · 2020-10-06T19:05:13Z

Avoid crashing the host process when an unexpected exception is thrown.

When an unexpected exception is thrown in ServiceHub process it needs to be propagated to the client so that we report actionable diagnostics in the log and in the info-bar. If the process is crashed via fail-fast we can't report good error. See e.g. #47234 (comment)

Proposal

ReportAndCatch, ReportAndCatchUnlessCanceled

The caller declares it is able to recover from the exception at this call site (catches the exception - returns true).
This uses NonFatal handler - csc ignores it, VS/OOP reports non-fatal Watson
In the compiler this would only be used for exceptions from analyzers.

ReportAndPropagate, ReportAndPropagateUnlessCanceled

The caller declares it is not able to recover at this point and the exception needs to be propagated (returns false).
This uses Fatal handler - csc fail-fast, VS/OOP reports non-fatal Watson
We would never use fail fast in these handlers in VS/OOP. We could potentially have env variable/reg key that forces fatal watson for testing/manual investigations

Review strategy: commit by commit.

tmat · 2020-10-06T19:05:34Z

@dotnet/roslyn-compiler @dotnet/roslyn-ide PTAL

jasonmalinowski · 2020-10-06T19:22:45Z

@tmat I know this sounds funny what I'm about to propose, but rather than changing this which also impacts other hosts (and anything running in devenv.exe itself), should we have the code in service hub set the 'fatal' handler to be non-fatal as well?

sharwell · 2020-10-06T21:49:21Z

@tmat I marked this as draft so it doesn't accidentally get merged before a design review.

sharwell · 2020-10-06T21:49:58Z

This is related to #47891 as well

src/Features/Core/Portable/SolutionCrawler/WorkCoordinator.IncrementalAnalyzerProcessor.cs

src/VisualStudio/Core/Def/Implementation/Venus/ContainedLanguage.cs

src/VisualStudio/Core/Def/Telemetry/VSTelemetryLogger.cs

src/VisualStudio/VisualBasic/Impl/LanguageService/VisualBasicPackage.vb

src/Workspaces/CoreTest/UtilityTest/ExceptionHelpersTests.cs

src/Workspaces/SharedUtilitiesAndExtensions/Compiler/Core/Utilities/AsyncLazy`1.cs

sharwell

💡 Consider adding a data property to the exception indicating that it has already been reported through telemetry. Otherwise we could see large numbers of duplicate reports as the stack unwinds.

jaredpar · 2020-10-07T00:20:09Z

Avoid crashing the host process when an unexpected exception is thrown.
When an unexpected exception is thrown in ServiceHub process ...

This change doesn't seem like it impacts only ServiceHub though. This seems to change the policy for all hosts. Correct?

tmat · 2020-10-07T00:49:53Z

This change doesn't seem like it impacts only ServiceHub though. This seems to change the policy for all hosts. Correct?

Correct. I think it's better not to fail-fast in any of our hosts.

tmat · 2020-10-07T00:54:56Z

@tmat I know this sounds funny what I'm about to propose, but rather than changing this which also impacts other hosts (and anything running in devenv.exe itself), should we have the code in service hub set the 'fatal' handler to be non-fatal as well?

I don't think so. I think we need to consider at each call-site what the intended semantics is. In some rare cases we might want to actually fail fast. In most cases we should either recover or report NFE and propagate the exception.

jaredpar · 2020-10-07T07:28:45Z

Correct. I think it's better not to fail-fast in any of our hosts.

The hosts include the command line compilers and we generally feel that fail-fast there is highly beneficial to us.

tmat · 2020-10-07T18:02:38Z

The hosts include the command line compilers and we generally feel that fail-fast there is highly beneficial to us.

I see - because we don't report NFW from command line compilers, right? Should command line compilers set the handler for NFW to fail fast then?

jaredpar · 2020-10-07T18:19:34Z

Should command line compilers set the handler for NFW to fail fast then?

Yes. I think we should probably essentially have a method on the vein of UseHostPolicyUnlessCanceled that we use here so it's explicit that it's a host decision. Then we can focus on the hosts picking the right defaults here.

tmat · 2020-10-07T18:40:20Z

The handlers always used host policy - that has been the default.

Looking again at call sites in the compiler:

In most call sites exceptions are definitely not expected and we want to crash in csc but not in VS/OOP.
When we see an exception from analyzer we want to log NFW if in VS/OOP and do nothing in csc (not fail-fast).

So maybe we need to do what @jasonmalinowski suggested and also rename the methods to be clear.

Proposal:

ReportAndCatch, ReportAndCatchUnlessCanceled

The caller declares it is able to recover from the exception at this call site (catches the exception - returns true).
This uses NonFatal handler - csc ignores it, VS/OOP reports non-fatal Watson
In the compiler this would only be used for exceptions from analyzers.

ReportAndPropagate, ReportAndPropagateUnlessCanceled

The caller declares it is not able to recover at this point and the exception needs to be propagated (returns false).
This uses Fatal handler - csc fail-fast, VS/OOP reports non-fatal Watson

We would never use fail fast in these handlers in VS/OOP. We could potentially have env variable/reg key that forces fatal watson for testing/manual investigations

jaredpar · 2020-10-07T18:45:45Z

@tmat love that summary in its entirety (including the env var trick to help with investigations)

tmat · 2020-10-07T18:46:23Z

Cool. I'll update the PR.

jasonmalinowski · 2020-10-07T18:59:46Z

The proposal makes sense to me @tmat. I'd say let's communicate well to the compiler team that what might have been 'fatal' in the command line may not be fatal in the VS/OOP case. If the compiler gets a bunch of non-fatal Watsons from VS those might not be possible to hit in the command line case because we marched along after a bad failure and had downstream issues. In the VS/OOP case a fatal exception is data loss, and so marching along is the right thing in many cases but it can be surprising.

tmat · 2020-10-07T19:44:21Z

we marched along after a bad failure and had downstream issues.

The first filter that will encounter the unexpected exception will report NFW - at this point it's the same stack/state as Fatal Watson from command line.
The handler marks the exception so that we don't report any more Watsons for it. If our code recovers from the exception (ReportAndCatch is used) then there is no unexpected downstream behavior (unless we have a bug in the recovery). If the code does not recover (ReportAndPropagate) then the exception is not caught and the same exception is propagated to the caller (but since it's marked we no longer report Watsons for it). It may happen that we translate one exception to another (e.g. wrap the exception in outer exception). But then we will see what the inner exception was and it's clear Watsons for the outer exception are different from the inner one (the outer exception no longer has a stack that originates in the compiler).

tmat · 2020-10-09T17:40:11Z

@sharwell PTAL - I have changed the approach from the original PR.

cston

Compiler changes LGTM.

ryzngard · 2020-10-09T19:33:51Z

Design lgtm, we should consider capturing dumps for VS/OOP in ReportAndPropagate and ReportAndPropagateUnlessCanceled cases, and maybe using severity on the FaultError to indicate that these are more problematic. Right now all of our FaultEvent calls use the same severity, maybe we should set severity to something more drastic as well. This will help us prioritize accordingly with telemetry we get.

tmat · 2020-10-10T17:32:10Z

@ryzngard We do capture dumps in all Report* handlers when running in VS/OOP - both fatal and non-fatal handlers end up reporting NFE. See 1366d93#diff-92b0fbadf21baf9af53e946ec4a2b433R30-R31.

Right now all of our FaultEvent calls use the same severity, maybe we should set severity to something more drastic as well. This will help us prioritize accordingly with telemetry we get.

We can do that, but I'd consider that a separate work item - feel free to file an issue if you think it would help.
I think it might be sufficient to prioritize either based on hit count. In either case a feature won't work due to a bug.

tmat · 2020-10-10T22:43:54Z

@sharwell OK to merge?

sharwell · 2020-10-10T23:05:05Z

@tmat I'll take a look later today but I'm guessing it's in line with what I was already asking for

ryzngard · 2020-10-11T01:13:52Z

@ryzngard We do capture dumps in all Report* handlers when running in VS/OOP - both fatal and non-fatal handlers end up reporting NFE. See 1366d93#diff-92b0fbadf21baf9af53e946ec4a2b433R30-R31.

Right now all of our FaultEvent calls use the same severity, maybe we should set severity to something more drastic as well. This will help us prioritize accordingly with telemetry we get.

We can do that, but I'd consider that a separate work item - feel free to file an issue if you think it would help.
I think it might be sufficient to prioritize either based on hit count. In either case a feature won't work due to a bug.

NFE != dump, so something to consider. For crashes we always* get a dump, and we no longer will be.

sharwell · 2020-10-11T01:34:58Z

@ryzngard that's true, but even for fatal errors the default heap dump collected is often insufficient and we have to request a full dump. In this case, the two options are close to the same.

tmat · 2020-10-12T18:55:56Z

@sharwell Ping.

…features/interpolated-string-constants * upstream/master: (295 commits) Update F1 Keywords to differentiate between semantics of default keyword (#48500) Default constructor suggestion between members (#48318) (#48503) Adjust ERR_PartialMisplaced diagnostic message (#48524) Refactor ChangedText.Merge and add fuzz testing (#48420) Apply suggestions from code review Do not crash on unexpected exception (#48367) Reference the contributing doc in 'Analyzer Suggestion' issue template Apply suggestions from code review Hardcode skipped Regex diagnostic ID as it is not available in CodeStyle layer Add using Skip help link for Regex diagnostic analyzer Add contributing doc for IDE code style analyzer documentation Make db lock static to investigate issue. Update dependencies from https://github.com/dotnet/roslyn build 20201012.2 (#48513) Hook up help link even for AbstractCodeQualityDiagnosticAnalyzer Add destructor intellisense test for record (#48297) Remove unused method (#48429) Fix bug Update src/EditorFeatures/Core.Wpf/InlineHints/InlineHintsTag.cs Add more test ...

* upstream/master: (68 commits) Update F1 Keywords to differentiate between semantics of default keyword (dotnet#48500) Default constructor suggestion between members (dotnet#48318) (dotnet#48503) Adjust ERR_PartialMisplaced diagnostic message (dotnet#48524) Refactor ChangedText.Merge and add fuzz testing (dotnet#48420) Apply suggestions from code review Do not crash on unexpected exception (dotnet#48367) Reference the contributing doc in 'Analyzer Suggestion' issue template Apply suggestions from code review Hardcode skipped Regex diagnostic ID as it is not available in CodeStyle layer Add using Skip help link for Regex diagnostic analyzer Add contributing doc for IDE code style analyzer documentation Make db lock static to investigate issue. Update dependencies from https://github.com/dotnet/roslyn build 20201012.2 (dotnet#48513) Hook up help link even for AbstractCodeQualityDiagnosticAnalyzer Add destructor intellisense test for record (dotnet#48297) Remove unused method (dotnet#48429) Fix bug Update src/EditorFeatures/Core.Wpf/InlineHints/InlineHintsTag.cs Add more test ...

sharwell · 2021-04-23T15:44:48Z

@Cosifne It looks like this is missing a milestone

sharwell · 2021-07-14T14:27:21Z

@tmat It looks like this is missing a milestone

tmat requested review from a team as code owners October 6, 2020 19:05

sharwell marked this pull request as draft October 6, 2020 21:48