Test Host randomly crashing with message "##[error]The active test run was aborted. Reason: Test host process crashed" #4819

ibiglari · 2024-01-11T09:39:53Z

We have an MFC application and are using Microsoft::VisualStudio::CppUnitTestFramework for testing. VSTest@2 step is failing almost 50% of the time with the message ##[error]The active test run was aborted. Reason: Test host process crashed after the last step of the test routine is finished, as apparent from the log messages printed by Microsoft::VisualStudio::CppUnitTestFramework::Logger::WriteMessage(). The test successfully finishes when being executed directly via Visual Studio 17.8.4.

Pipeline Step:

  - task: VSTest@2
    displayName: Integration Tests
    inputs:
      testSelector: 'testAssemblies'
      testAssemblyVer2: 'Release\IntegrationTests.dll'
      searchFolder: '$(System.DefaultWorkingDirectory)'
      resultsFolder: '$(System.DefaultWorkingDirectory)/CoverageResults'
      runTestsInIsolation: true
      codeCoverageEnabled: true
      otherConsoleOptions: '/collect:"Code Coverage;Format=Xml" /platform:x86 /Diag:vstestlog.diag'
      testRunTitle: 'Run tests'
      rerunFailedTests: false

Steps to reproduce

Since this is a random occurrence, re-queuing the job usually resolves the issue.

Expected behavior

No crash during test execution

Actual behavior

Random crashes after last step of the test routine

Diagnostic logs

vstestlog.datacollector.24-01-11_08-49-42_80953_4.diag.log
vstestlog.diag.log
vstestlog.host.24-01-11_08-49-51_57026_4.diag.log

Environment

Agent name: 'Azure Pipelines 2'
Agent machine name: 'xxx'
Current agent version: '3.232.1'
Operating System
Microsoft Windows Server 2022
10.0.20348
Datacenter
Runner Image
Image: windows-2022
Version: 20240108.1.0
Included Software: https://github.com/actions/runner-images/blob/win22/20240108.1/images/windows/Windows2022-Readme.md
Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F20240108.1
Runner Image Provisioner
2.0.321.1
Current image version: '20240108.1.0'
Agent running as: 'VssAdministrator'

The text was updated successfully, but these errors were encountered:

nohwnd · 2024-01-16T11:16:10Z

@jakubch1 this looks like a CodeCoverage issue. Testhost just crashes and there is no error reported. Please have a look.

jakubch1 · 2024-01-16T11:44:16Z

@ibiglari you are using dynamic native instrumentation which is not fully stable.
Please check: https://learn.microsoft.com/en-us/visualstudio/test/customizing-code-coverage-analysis?view=vs-2022#static-and-dynamic-native-instrumentation and enable for your projects: https://learn.microsoft.com/en-us/cpp/build/reference/profile-performance-tools-profiler?view=msvc-170 to use static native instrumentation.

After doing above you should not see in logs such errors:

TpTrace Information: 0 : 6604, 1, 2024/01/11, 08:49:47.363, 12024538561, datacollector.exe, StaticInstrumenter: searching in directory: D:\a\1\s\Release, recursive: True
TpTrace Information: 0 : 6604, 1, 2024/01/11, 08:49:50.835, 12059210729, datacollector.exe, StaticInstrumenter: native instrumentation result for D:\a\1\s\Release\IntegrationTests.dll: ErrorNoFixupData. Time taken: 00:00:03.4388736
TpTrace Information: 0 : 6604, 6, 2024/01/11, 08:49:50.929, 12060148132, datacollector.exe, StaticInstrumenter: native instrumentation result for D:\a\1\s\Release\SOFTWARE-TO-BE-TESTED.exe: ErrorNoFixupData. Time taken: 00:00:03.5328777

ibiglari · 2024-01-19T21:42:42Z

@jakubch1 so if I add /PROFILE to linker options, code coverage will automatically use static instrumentation instead of dynamic instrumentation? If not, how do I instruct test host to use static instrumentation in the pipeline?

ibiglari · 2024-01-20T23:37:42Z

@jakubch1 I'm afraid adding /PROFILE did not fix the issue. Logs attached
vstestlog.datacollector.24-01-20_23-20-38_09648_4.log
vstestlog.host.24-01-20_23-20-51_39345_4.log
vstestlog.log

Linker command line as extracted from build pipeline logs:
Link: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\bin\HostX86\x86\link.exe /ERRORREPORT:QUEUE /OUT:"D:\a\1\s\Release\SOFTWARE-TO-BE-TESTED.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:D:\a\_temp\LIBS\OpenSSL\HQ\lib /LIBPATH:CsdBase\Win32\Release /LIBPATH:CsdSystem\Win32\Release /LIBPATH:DbLib\Win32\Release /LIBPATH:Interfaces\Win32\Release /LIBPATH:Utils\Win32\Release odbc32.lib psapi.lib libcrypto.lib libssl.lib Crypt32.lib version.lib dbghelp.lib csdbase.lib csdsystem.lib dblib.lib interfaces.lib utils.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /manifestinput:"C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\Include\Manifest\dpiaware.manifest" /DEBUG /PDB:".\Release/SOFTWARE-TO-BE-TESTED.pdb" /MAP:"SOFTWARE-TO-BE-TESTED-Release.map" /SUBSYSTEM:WINDOWS /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"D:\a\1\s\Release\SOFTWARE-TO-BE-TESTED.lib" /MACHINE:X86 /PROFILE /SAFESEH Release\Eftsrv.res

jakubch1 · 2024-01-22T07:53:23Z

@jakubch1 so if I add /PROFILE to linker options, code coverage will automatically use static instrumentation instead of dynamic instrumentation? If not, how do I instruct test host to use static instrumentation in the pipeline?

Yes. Static instrumentation is better and more reliable so we have it enabled by default but if /PROFILE is missing static instrumentation is not working and we fallback to use dynamic instrumentation.

Please try to configure below flags in runsettings:

<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>

Full runsettings samples: https://github.com/microsoft/codecoverage/blob/main/docs/configuration.md

ibiglari · 2024-01-22T10:13:01Z

@jakubch1 Even after updating .runsettings I am seeing the same behaviour.

Please try to configure below flags in runsettings:
<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
Full runsettings samples: https://github.com/microsoft/codecoverage/blob/main/docs/configuration.md

jakubch1 · 2024-01-22T13:34:00Z

@ibiglari could you please share logs?

ibiglari · 2024-01-22T20:23:13Z

@jakubch1 Sorry forgot to attah
vstestlog.log
vstestlog.datacollector.24-01-22_08-10-45_20726_4.log
vstestlog.host.24-01-22_08-11-01_73850_4.log

jakubch1 · 2024-01-22T21:06:54Z

@ibiglari something is wrong as I don't see

<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>

inside logs. Can you double check that this was added into <CodeCoverage> tag?

Did you configure:

runSettingsFile: # string. Settings file.

inside task: VSTest@2?

ibiglari · 2024-01-22T21:37:12Z

@jakubch1

Added runSettingsFile to VSTest@2 task, double checked the pipeline execution log.

.
.
.
##########################################################################
DtaExecutionHost version 19.229.34210.1.
Starting TestExecution Model...
Result Attachments will be stored in LogStore
			<DataCollector friendlyName="Code Coverage" uri="datacollector://Microsoft/CodeCoverage/2.0" assemblyQualifiedName="Microsoft.VisualStudio.Coverage.DynamicCoverageDataCollector, Microsoft.VisualStudio.TraceCollector, Version=11.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a">
				<Configuration>
					<CodeCoverage>
						<UseVerifiableInstrumentation>False</UseVerifiableInstrumentation>
						<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
						<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
					</CodeCoverage>
				</Configuration>
			</DataCollector>
		</DataCollectors>
	</DataCollectionRunSettings>	
	<Catch2Adapter>
		<FilenameFilter>^*Test</FilenameFilter>
	</Catch2Adapter>
</RunSettings>
Updated Run Settings:
<RunSettings>
  <DataCollectionRunSettings>
    <DataCollectors>
      <DataCollector friendlyName="Code Coverage" uri="datacollector://Microsoft/CodeCoverage/2.0" assemblyQualifiedName="Microsoft.VisualStudio.Coverage.DynamicCoverageDataCollector, Microsoft.VisualStudio.TraceCollector, Version=11.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a">
        <Configuration>
          <CodeCoverage>
            <UseVerifiableInstrumentation>False</UseVerifiableInstrumentation>
            <EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
            <EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
          </CodeCoverage>
        </Configuration>
      </DataCollector>
    </DataCollectors>
  </DataCollectionRunSettings>
  <Catch2Adapter>
    <FilenameFilter>^*Test</FilenameFilter>
  </Catch2Adapter>
  <RunConfiguration>
    <BatchSize>1000</BatchSize>
    <ResultsDirectory>D:\a\1\s\CoverageResults</ResultsDirectory>
  </RunConfiguration>
</RunSettings>
**************** Starting test execution *********************
.
.
.

New logs attached.

vstestlog.log
vstestlog.host.24-01-22_21-22-54_31244_4.log

vstestlog.datacollector.24-01-22_21-22-41_03781_4.log

jakubch1 · 2024-01-23T07:21:09Z

@ibiglari currently you are instrumenting only 1 file: D:\a\1\s\IntegrationTests\Utils\Reflector\ReflectrService.dll. Is it required? Please try to exclude it to see what happen:

<ModulePaths><Exclude><ModulePath>.*ReflectrService\.dll$</ModulePath></Exclude></ModulePaths>

(inside <CodeCoverage>)

ibiglari · 2024-01-23T21:58:43Z

@jakubch1 No joy. Still seeing the crash. Logs attached.

Also, not sure what you mean by instrumenting only 1 file? Our test project is IntegrationTests,dll and pipeline console reflects this:

**************** Starting test execution *********************
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\IDE\Extensions\TestPlatform\vstest.console.exe "@D:\a\_temp\ds03vzkkih5.tmp"
Microsoft (R) Test Execution Command Line Tool Version 17.8.0 (x64)
Copyright (c) Microsoft Corporation.  All rights reserved.
vstest.console.exe "D:\a\1\s\Release\IntegrationTests.dll"
/Settings:"D:\a\_temp\wv4dzzinw1c.tmp.runsettings"
/EnableCodeCoverage
/InIsolation
/Logger:"trx"
/TestAdapterPath:"D:\a\1\s"
/collect:"Code Coverage;Format=Xml" /platform:x86 /Diag:vstestlog.diag
Starting test execution, please wait...
Logging Vstest Diagnostics in file: D:\a\1\s\vstestlog.diag
A total of 1 test files matched the specified pattern.
4.1736
Data collection : Logging DataCollector Diagnostics in file: D:\a\1\s\vstestlog.datacollector.24-01-23_21-27-55_24149_4.diag

vstestlog.log
vstestlog.datacollector.24-01-23_21-27-55_24149_4.log
vstestlog.host.24-01-23_21-28-13_32590_4.log

jakubch1 · 2024-01-24T07:51:57Z

@ibiglari could you please try to run it without code coverage enabled? We should check if this is even coverage related. Based on log currently your build doesn't make any instrumentation.

ibiglari · 2024-01-24T23:57:12Z

@jakubch1 I find that hard to believe, as I can see code coverage in SonarCloud. However, I will try and get back to you.

ibiglari · 2024-01-25T12:54:59Z

@jakubch1 Turning off code coverage leads to successful execution of the test.

Logs attached
vstestlog.log
vstestlog.datacollector.24-01-25_02-45-39_84177_4.log
vstestlog.host.24-01-25_02-45-55_30456_4.log

jakubch1 · 2024-01-25T13:28:41Z

@ibiglari thank you for info. Could you please experiment with below 4 flags:

<EnableStaticNativeInstrumentation>False</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
<EnableStaticManagedInstrumentation>False</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

I expect if all flags are False you don't have issue. Then you could 1 by 1 enable and check. If you can share your binaries I can also make this testing on my side. You can share privately binaries here: https://developercommunity.visualstudio.com/home

ibiglari · 2024-01-26T10:21:59Z

@jakubch1 I'm afraid I had a crash with all four set to false.
vstestlog.datacollector.24-01-26_10-07-10_50220_4.log
vstestlog.host.24-01-26_10-07-13_39508_4.log
vstestlog.log

jakubch1 · 2024-01-26T11:09:05Z

@ibiglari any idea what is "Waiting for HQ Thread to Terminate..." ?

ibiglari · 2024-01-26T11:11:24Z

@jakubch1 Yes we set a mutex and wait for a thread to terminate... That's one of the log messages from the test routine. Why?

jakubch1 · 2024-01-26T11:44:53Z

I saw this in Test Platform logs. Could you please try this:
https://github.com/microsoft/vstest/blob/main/docs/troubleshooting.md#collect-logs-and-crash-dumphang

ibiglari · 2024-01-26T22:31:55Z

@jakubch1 Now this is funny. If I add /Blame:"CollectDump;DumpType=Full;CollectHangDump;TestTimeout=30min;HangDumpType=Full" to the command line parameters for VSTest@2 task it doesn't crash. I can only guess this parameter changes the internal state of test host somehow?

The problem is, since this error happens randomly, it might be pure luck. I have to run the queue multiple times to ascertain whether a change in parameters causes a crash or not

ibiglari · 2024-01-29T06:38:09Z

@jakubch1 Disregard previous comment. Test host crashed again, and this time this message was added to execution log:

Data collector 'Blame' message: procdump.exe could not be found, please make sure that the executable is available on PATH. Alternatively set PROCDUMP_PATH environment variable to a directory that contains procdump.exe executable.

How do I make this available to my DevOps build agent? Do I need to manually include the binary?

jakubch1 · 2024-01-29T07:57:11Z

@ibiglari please install proc dump (https://github.com/microsoft/vstest/blob/main/docs/troubleshooting.md#collect-process-dump-using-procdump-on-windows-ie-outofmemory) and set PROCDUMP_PATH when running your tests.

ibiglari · 2024-01-29T09:06:38Z

@jakubch1 Found an easier way. However, this meant the test ran for 40 minutes instead of 9 minutes, and then the pipeline timed out. Is that to be expected when using proc dump?

ibiglari · 2024-02-01T04:29:27Z

@jakubch1 After adjusting job's timeout, I can confirm test host does not crash when proc dump is attached to it. However, once the test method is finished, there is a delay of about 30 minutes before test host returns and the pipeline can continue. I can only surmise that proc dump somehow changes internal state of test host. Any advise?

jakubch1 · 2024-02-01T07:46:42Z

@ibiglari in that case please remove /Blame:"CollectDump;DumpType=Full;CollectHangDump;TestTimeout=30min;HangDumpType=Full" stuff and before VSTest@2 step add step:

    - task: PowerShell@2
      displayName: 'Download and enable ProcDump'
      inputs:
        targetType: inline
        script: |
          Add-MpPreference -ExclusionPath $(Build.SourcesDirectory) -Force
          Set-MpPreference -DisableRealtimeMonitoring $true -Force
          Invoke-WebRequest -Uri "https://download.sysinternals.com/files/Procdump.zip" -OutFile $(Agent.TempDirectory)\Procdump.zip
          Expand-Archive -LiteralPath $(Agent.TempDirectory)\Procdump.zip -DestinationPath $(Agent.ToolsDirectory)\Procdump -Force
          mkdir -p "$(Build.SourcesDirectory)\artifacts\dumps"
          $(Agent.ToolsDirectory)\Procdump\procdump.exe -ma -i $(Build.SourcesDirectory)\artifacts\dumps -accepteula

and after VSTest@2 and step:

    - task: PowerShell@2
      displayName: 'Disable ProcDump'
      inputs:
        targetType: inline
        script: |
          $(Agent.ToolsDirectory)\Procdump\procdump.exe -u -accepteula
          Set-MpPreference -DisableRealtimeMonitoring $false -Force
      condition: always()

then please reproduce crash issue and provide us dumps. Please try to repro with:

<EnableStaticNativeInstrumentation>False</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
<EnableStaticManagedInstrumentation>False</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

ibiglari · 2024-02-01T21:40:42Z

@jakubch1 Will advise. I was installing procdump using a PowerShell task:

  - powershell: |
      choco install procdump -y
    displayName: Install Procdump
    condition: and(succeeded(), eq(variables['Agent.OS'], 'Windows_NT'))

I guess I need to remove it, right?

ibiglari · 2024-02-08T22:19:04Z

@jakubch1 Just an update,

I have not been able to replicate the crash for the past week. The random nature of this crash makes it rather hard to investigate, but as soon as it happens, I will provide the logs here.

tig · 2024-07-08T00:06:35Z

Likely related to

#2952

which is intermittent for me to (although it fails most of the time).

Super frustrating that the vstest team does not seem to think the lack of diagnosabiliy here is a priority. I've wasted hours and hours on this.

nohwnd · 2024-07-08T06:34:00Z

@tig what tools do you lack to diagnose the problem? In the linked issue you are getting full call stack on screen and you have a memory dump.

tig · 2024-07-08T15:52:42Z

@tig what tools do you lack to diagnose the problem? In the linked issue you are getting full call stack on screen and you have a memory dump.

You're kidding right? Have you read all the reports where people are complaining about how hard these issues are to diagnose?

This is not the fault of the customer but the vstest infrastructure and documentation.

No matter what I tried I could not get a dump to be created on Linux. Only because I decided to try the mac runner did I get a dump.

I have no experience reading crash dumps, like 99.99% of dotnet devs.

It seems the team would rather tell customers "you're holding it wrong" than actually fix the infrastructure to make these problems easy to diagnose.

nohwnd · 2024-07-08T17:44:28Z

It seems the team would rather tell customers "you're holding it wrong" than actually fix the infrastructure to make these problems easy to diagnose.

The "Test host process crashed" error means that the hosting process crashed. In many cases it is the user's test code crashing the process for a reason we cannot control. If there was no testhost, but just a plain other .NET process, it would crash the same.

I was not kidding, just honestly asking, what would be required to make this easy to diagnose? Stack trace and memory dumps seems the closest we can get to interactively debugging the problem.

tig · 2024-07-08T19:07:40Z

I was not kidding, just honestly asking, what would be required to make this easy to diagnose? Stack trace and memory dumps seems the closest we can get to interactively debugging the problem.

Here's one obvious example:

#2952 (comment)

Making it so that crash dumps actually get generated when they should would be helpful too: #2952 (comment)

Making is so that a sequence file can be generated regardless of whether the tests failed, or a crash happened would help.

nohwnd · 2024-07-09T07:45:54Z

Making is so that a sequence file can be generated regardless of whether the tests failed, or a crash happened would help.

Sequence file just tells you which tests did not finish (you can also find that in the diag logs), when all tests finish the file is barely useful.

Making it so that crash dumps actually get generated when they should would be helpful too: #2952 (comment)

You told me that memory dumps have no value to you, because you cannot understand them as 99% of other developers. Are there other tools that would help you diagnose the problem? We are building a new platform for running tests and if you think there is a better way that is not memory dumps we can take that into account. :)

For creating the dumps (on .NET) we rely on the built-in .NET memory dump reporting, but because vstest_dump_path is set we don't upload any of the dumps from the run even when it is created.

nohwnd · 2024-07-11T16:39:55Z

Closing in favor of #2952 I would still like to hear your opinion on how to make it easier for devs to debug crashes, when not using memory dumps or stack trace @tig I do agree that the blame collector is far from perfect, but at least on .NET we are offloading all the work to the standard .NET functionality.

nohwnd closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2024

nohwnd reopened this Jan 16, 2024

nohwnd self-assigned this Jul 9, 2024

nohwnd closed this as completed Jul 11, 2024

Test Host randomly crashing with message "##[error]The active test run was aborted. Reason: Test host process crashed" #4819

Test Host randomly crashing with message "##[error]The active test run was aborted. Reason: Test host process crashed" #4819

Comments

ibiglari commented Jan 11, 2024

Pipeline Step:

Steps to reproduce

Expected behavior

Actual behavior

Diagnostic logs

Environment

nohwnd commented Jan 16, 2024

jakubch1 commented Jan 16, 2024

ibiglari commented Jan 19, 2024

ibiglari commented Jan 20, 2024

jakubch1 commented Jan 22, 2024

ibiglari commented Jan 22, 2024

jakubch1 commented Jan 22, 2024

ibiglari commented Jan 22, 2024

jakubch1 commented Jan 22, 2024

ibiglari commented Jan 22, 2024 • edited Loading

jakubch1 commented Jan 23, 2024

ibiglari commented Jan 23, 2024

jakubch1 commented Jan 24, 2024

ibiglari commented Jan 24, 2024

ibiglari commented Jan 25, 2024

jakubch1 commented Jan 25, 2024

ibiglari commented Jan 26, 2024

jakubch1 commented Jan 26, 2024

ibiglari commented Jan 26, 2024

jakubch1 commented Jan 26, 2024

ibiglari commented Jan 26, 2024 • edited Loading

ibiglari commented Jan 29, 2024

jakubch1 commented Jan 29, 2024

ibiglari commented Jan 29, 2024

ibiglari commented Feb 1, 2024

jakubch1 commented Feb 1, 2024

ibiglari commented Feb 1, 2024

ibiglari commented Feb 8, 2024

tig commented Jul 8, 2024

nohwnd commented Jul 8, 2024

tig commented Jul 8, 2024

nohwnd commented Jul 8, 2024

tig commented Jul 8, 2024

nohwnd commented Jul 9, 2024

nohwnd commented Jul 11, 2024

ibiglari commented Jan 22, 2024 •

edited

Loading

ibiglari commented Jan 26, 2024 •

edited

Loading