Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Host randomly crashing with message "##[error]The active test run was aborted. Reason: Test host process crashed" #4819

Closed
ibiglari opened this issue Jan 11, 2024 · 35 comments
Assignees

Comments

@ibiglari
Copy link

We have an MFC application and are using Microsoft::VisualStudio::CppUnitTestFramework for testing. VSTest@2 step is failing almost 50% of the time with the message ##[error]The active test run was aborted. Reason: Test host process crashed after the last step of the test routine is finished, as apparent from the log messages printed by Microsoft::VisualStudio::CppUnitTestFramework::Logger::WriteMessage(). The test successfully finishes when being executed directly via Visual Studio 17.8.4.

Pipeline Step:

  - task: VSTest@2
    displayName: Integration Tests
    inputs:
      testSelector: 'testAssemblies'
      testAssemblyVer2: 'Release\IntegrationTests.dll'
      searchFolder: '$(System.DefaultWorkingDirectory)'
      resultsFolder: '$(System.DefaultWorkingDirectory)/CoverageResults'
      runTestsInIsolation: true
      codeCoverageEnabled: true
      otherConsoleOptions: '/collect:"Code Coverage;Format=Xml" /platform:x86 /Diag:vstestlog.diag'
      testRunTitle: 'Run tests'
      rerunFailedTests: false

Steps to reproduce

Since this is a random occurrence, re-queuing the job usually resolves the issue.

Expected behavior

No crash during test execution

Actual behavior

Random crashes after last step of the test routine

Diagnostic logs

vstestlog.datacollector.24-01-11_08-49-42_80953_4.diag.log
vstestlog.diag.log
vstestlog.host.24-01-11_08-49-51_57026_4.diag.log

Environment

Agent name: 'Azure Pipelines 2'
Agent machine name: 'xxx'
Current agent version: '3.232.1'
Operating System
Microsoft Windows Server 2022
10.0.20348
Datacenter
Runner Image
Image: windows-2022
Version: 20240108.1.0
Included Software: https://github.com/actions/runner-images/blob/win22/20240108.1/images/windows/Windows2022-Readme.md
Image Release: https://github.com/actions/runner-images/releases/tag/win22%2F20240108.1
Runner Image Provisioner
2.0.321.1
Current image version: '20240108.1.0'
Agent running as: 'VssAdministrator'
@nohwnd
Copy link
Member

nohwnd commented Jan 16, 2024

@jakubch1 this looks like a CodeCoverage issue. Testhost just crashes and there is no error reported. Please have a look.

@jakubch1
Copy link
Member

@ibiglari you are using dynamic native instrumentation which is not fully stable.
Please check: https://learn.microsoft.com/en-us/visualstudio/test/customizing-code-coverage-analysis?view=vs-2022#static-and-dynamic-native-instrumentation and enable for your projects: https://learn.microsoft.com/en-us/cpp/build/reference/profile-performance-tools-profiler?view=msvc-170 to use static native instrumentation.

After doing above you should not see in logs such errors:

TpTrace Information: 0 : 6604, 1, 2024/01/11, 08:49:47.363, 12024538561, datacollector.exe, StaticInstrumenter: searching in directory: D:\a\1\s\Release, recursive: True
TpTrace Information: 0 : 6604, 1, 2024/01/11, 08:49:50.835, 12059210729, datacollector.exe, StaticInstrumenter: native instrumentation result for D:\a\1\s\Release\IntegrationTests.dll: ErrorNoFixupData. Time taken: 00:00:03.4388736
TpTrace Information: 0 : 6604, 6, 2024/01/11, 08:49:50.929, 12060148132, datacollector.exe, StaticInstrumenter: native instrumentation result for D:\a\1\s\Release\SOFTWARE-TO-BE-TESTED.exe: ErrorNoFixupData. Time taken: 00:00:03.5328777

@nohwnd nohwnd closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2024
@nohwnd nohwnd reopened this Jan 16, 2024
@ibiglari
Copy link
Author

@jakubch1 so if I add /PROFILE to linker options, code coverage will automatically use static instrumentation instead of dynamic instrumentation? If not, how do I instruct test host to use static instrumentation in the pipeline?

@ibiglari
Copy link
Author

@jakubch1 I'm afraid adding /PROFILE did not fix the issue. Logs attached
vstestlog.datacollector.24-01-20_23-20-38_09648_4.log
vstestlog.host.24-01-20_23-20-51_39345_4.log
vstestlog.log

Linker command line as extracted from build pipeline logs:
Link: C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\bin\HostX86\x86\link.exe /ERRORREPORT:QUEUE /OUT:"D:\a\1\s\Release\SOFTWARE-TO-BE-TESTED.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:D:\a\_temp\LIBS\OpenSSL\HQ\lib /LIBPATH:CsdBase\Win32\Release /LIBPATH:CsdSystem\Win32\Release /LIBPATH:DbLib\Win32\Release /LIBPATH:Interfaces\Win32\Release /LIBPATH:Utils\Win32\Release odbc32.lib psapi.lib libcrypto.lib libssl.lib Crypt32.lib version.lib dbghelp.lib csdbase.lib csdsystem.lib dblib.lib interfaces.lib utils.lib /MANIFEST /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /manifest:embed /manifestinput:"C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Tools\MSVC\14.37.32822\Include\Manifest\dpiaware.manifest" /DEBUG /PDB:".\Release/SOFTWARE-TO-BE-TESTED.pdb" /MAP:"SOFTWARE-TO-BE-TESTED-Release.map" /SUBSYSTEM:WINDOWS /TLBID:1 /DYNAMICBASE /NXCOMPAT /IMPLIB:"D:\a\1\s\Release\SOFTWARE-TO-BE-TESTED.lib" /MACHINE:X86 /PROFILE /SAFESEH Release\Eftsrv.res

@jakubch1
Copy link
Member

@jakubch1 so if I add /PROFILE to linker options, code coverage will automatically use static instrumentation instead of dynamic instrumentation? If not, how do I instruct test host to use static instrumentation in the pipeline?

Yes. Static instrumentation is better and more reliable so we have it enabled by default but if /PROFILE is missing static instrumentation is not working and we fallback to use dynamic instrumentation.

Please try to configure below flags in runsettings:

<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>

Full runsettings samples: https://github.com/microsoft/codecoverage/blob/main/docs/configuration.md

@ibiglari
Copy link
Author

@jakubch1 Even after updating .runsettings I am seeing the same behaviour.

Please try to configure below flags in runsettings:

<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>

Full runsettings samples: https://github.com/microsoft/codecoverage/blob/main/docs/configuration.md

@jakubch1
Copy link
Member

@ibiglari could you please share logs?

@ibiglari
Copy link
Author

@jakubch1
Copy link
Member

@ibiglari something is wrong as I don't see

<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>

inside logs. Can you double check that this was added into <CodeCoverage> tag?

Did you configure:

runSettingsFile: # string. Settings file. 

inside task: VSTest@2?

@ibiglari
Copy link
Author

ibiglari commented Jan 22, 2024

@jakubch1

Added runSettingsFile to VSTest@2 task, double checked the pipeline execution log.

.
.
.
##########################################################################
DtaExecutionHost version 19.229.34210.1.
Starting TestExecution Model...
Result Attachments will be stored in LogStore
			<DataCollector friendlyName="Code Coverage" uri="datacollector://Microsoft/CodeCoverage/2.0" assemblyQualifiedName="Microsoft.VisualStudio.Coverage.DynamicCoverageDataCollector, Microsoft.VisualStudio.TraceCollector, Version=11.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a">
				<Configuration>
					<CodeCoverage>
						<UseVerifiableInstrumentation>False</UseVerifiableInstrumentation>
						<EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
						<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
					</CodeCoverage>
				</Configuration>
			</DataCollector>
		</DataCollectors>
	</DataCollectionRunSettings>	
	<Catch2Adapter>
		<FilenameFilter>^*Test</FilenameFilter>
	</Catch2Adapter>
</RunSettings>
Updated Run Settings:
<RunSettings>
  <DataCollectionRunSettings>
    <DataCollectors>
      <DataCollector friendlyName="Code Coverage" uri="datacollector://Microsoft/CodeCoverage/2.0" assemblyQualifiedName="Microsoft.VisualStudio.Coverage.DynamicCoverageDataCollector, Microsoft.VisualStudio.TraceCollector, Version=11.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a">
        <Configuration>
          <CodeCoverage>
            <UseVerifiableInstrumentation>False</UseVerifiableInstrumentation>
            <EnableStaticNativeInstrumentation>True</EnableStaticNativeInstrumentation>
            <EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
          </CodeCoverage>
        </Configuration>
      </DataCollector>
    </DataCollectors>
  </DataCollectionRunSettings>
  <Catch2Adapter>
    <FilenameFilter>^*Test</FilenameFilter>
  </Catch2Adapter>
  <RunConfiguration>
    <BatchSize>1000</BatchSize>
    <ResultsDirectory>D:\a\1\s\CoverageResults</ResultsDirectory>
  </RunConfiguration>
</RunSettings>
**************** Starting test execution *********************
.
.
.

New logs attached.

vstestlog.log
vstestlog.host.24-01-22_21-22-54_31244_4.log

vstestlog.datacollector.24-01-22_21-22-41_03781_4.log

@jakubch1
Copy link
Member

@ibiglari currently you are instrumenting only 1 file: D:\a\1\s\IntegrationTests\Utils\Reflector\ReflectrService.dll. Is it required? Please try to exclude it to see what happen:

<ModulePaths><Exclude><ModulePath>.*ReflectrService\.dll$</ModulePath></Exclude></ModulePaths>

(inside <CodeCoverage>)

@ibiglari
Copy link
Author

@jakubch1 No joy. Still seeing the crash. Logs attached.

Also, not sure what you mean by instrumenting only 1 file? Our test project is IntegrationTests,dll and pipeline console reflects this:

**************** Starting test execution *********************
C:\Program Files\Microsoft Visual Studio\2022\Enterprise\Common7\IDE\Extensions\TestPlatform\vstest.console.exe "@D:\a\_temp\ds03vzkkih5.tmp"
Microsoft (R) Test Execution Command Line Tool Version 17.8.0 (x64)
Copyright (c) Microsoft Corporation.  All rights reserved.
vstest.console.exe "D:\a\1\s\Release\IntegrationTests.dll"
/Settings:"D:\a\_temp\wv4dzzinw1c.tmp.runsettings"
/EnableCodeCoverage
/InIsolation
/Logger:"trx"
/TestAdapterPath:"D:\a\1\s"
/collect:"Code Coverage;Format=Xml" /platform:x86 /Diag:vstestlog.diag
Starting test execution, please wait...
Logging Vstest Diagnostics in file: D:\a\1\s\vstestlog.diag
A total of 1 test files matched the specified pattern.
4.1736
Data collection : Logging DataCollector Diagnostics in file: D:\a\1\s\vstestlog.datacollector.24-01-23_21-27-55_24149_4.diag

vstestlog.log
vstestlog.datacollector.24-01-23_21-27-55_24149_4.log
vstestlog.host.24-01-23_21-28-13_32590_4.log

@jakubch1
Copy link
Member

@ibiglari could you please try to run it without code coverage enabled? We should check if this is even coverage related. Based on log currently your build doesn't make any instrumentation.

@ibiglari
Copy link
Author

@jakubch1 I find that hard to believe, as I can see code coverage in SonarCloud. However, I will try and get back to you.

@ibiglari
Copy link
Author

@jakubch1 Turning off code coverage leads to successful execution of the test.

Logs attached
vstestlog.log
vstestlog.datacollector.24-01-25_02-45-39_84177_4.log
vstestlog.host.24-01-25_02-45-55_30456_4.log

@jakubch1
Copy link
Member

@ibiglari thank you for info. Could you please experiment with below 4 flags:

<EnableStaticNativeInstrumentation>False</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
<EnableStaticManagedInstrumentation>False</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

I expect if all flags are False you don't have issue. Then you could 1 by 1 enable and check. If you can share your binaries I can also make this testing on my side. You can share privately binaries here: https://developercommunity.visualstudio.com/home

@ibiglari
Copy link
Author

@jakubch1
Copy link
Member

@ibiglari any idea what is "Waiting for HQ Thread to Terminate..." ?

@ibiglari
Copy link
Author

@jakubch1 Yes we set a mutex and wait for a thread to terminate... That's one of the log messages from the test routine. Why?

@jakubch1
Copy link
Member

I saw this in Test Platform logs. Could you please try this:
https://github.com/microsoft/vstest/blob/main/docs/troubleshooting.md#collect-logs-and-crash-dumphang

@ibiglari
Copy link
Author

ibiglari commented Jan 26, 2024

@jakubch1 Now this is funny. If I add /Blame:"CollectDump;DumpType=Full;CollectHangDump;TestTimeout=30min;HangDumpType=Full" to the command line parameters for VSTest@2 task it doesn't crash. I can only guess this parameter changes the internal state of test host somehow?

The problem is, since this error happens randomly, it might be pure luck. I have to run the queue multiple times to ascertain whether a change in parameters causes a crash or not

@ibiglari
Copy link
Author

@jakubch1 Disregard previous comment. Test host crashed again, and this time this message was added to execution log:

Data collector 'Blame' message: procdump.exe could not be found, please make sure that the executable is available on PATH. Alternatively set PROCDUMP_PATH environment variable to a directory that contains procdump.exe executable.

How do I make this available to my DevOps build agent? Do I need to manually include the binary?

@jakubch1
Copy link
Member

@ibiglari
Copy link
Author

@jakubch1 Found an easier way. However, this meant the test ran for 40 minutes instead of 9 minutes, and then the pipeline timed out. Is that to be expected when using proc dump?

@ibiglari
Copy link
Author

ibiglari commented Feb 1, 2024

@jakubch1 After adjusting job's timeout, I can confirm test host does not crash when proc dump is attached to it. However, once the test method is finished, there is a delay of about 30 minutes before test host returns and the pipeline can continue. I can only surmise that proc dump somehow changes internal state of test host. Any advise?

@jakubch1
Copy link
Member

jakubch1 commented Feb 1, 2024

@ibiglari in that case please remove /Blame:"CollectDump;DumpType=Full;CollectHangDump;TestTimeout=30min;HangDumpType=Full" stuff and before VSTest@2 step add step:

    - task: PowerShell@2
      displayName: 'Download and enable ProcDump'
      inputs:
        targetType: inline
        script: |
          Add-MpPreference -ExclusionPath $(Build.SourcesDirectory) -Force
          Set-MpPreference -DisableRealtimeMonitoring $true -Force
          Invoke-WebRequest -Uri "https://download.sysinternals.com/files/Procdump.zip" -OutFile $(Agent.TempDirectory)\Procdump.zip
          Expand-Archive -LiteralPath $(Agent.TempDirectory)\Procdump.zip -DestinationPath $(Agent.ToolsDirectory)\Procdump -Force
          mkdir -p "$(Build.SourcesDirectory)\artifacts\dumps"
          $(Agent.ToolsDirectory)\Procdump\procdump.exe -ma -i $(Build.SourcesDirectory)\artifacts\dumps -accepteula

and after VSTest@2 and step:

    - task: PowerShell@2
      displayName: 'Disable ProcDump'
      inputs:
        targetType: inline
        script: |
          $(Agent.ToolsDirectory)\Procdump\procdump.exe -u -accepteula
          Set-MpPreference -DisableRealtimeMonitoring $false -Force
      condition: always()

then please reproduce crash issue and provide us dumps. Please try to repro with:

<EnableStaticNativeInstrumentation>False</EnableStaticNativeInstrumentation>
<EnableDynamicNativeInstrumentation>False</EnableDynamicNativeInstrumentation>
<EnableStaticManagedInstrumentation>False</EnableStaticManagedInstrumentation>
<EnableDynamicManagedInstrumentation>False</EnableDynamicManagedInstrumentation>

@ibiglari
Copy link
Author

ibiglari commented Feb 1, 2024

@jakubch1 Will advise. I was installing procdump using a PowerShell task:

  - powershell: |
      choco install procdump -y
    displayName: Install Procdump
    condition: and(succeeded(), eq(variables['Agent.OS'], 'Windows_NT'))

I guess I need to remove it, right?

@ibiglari
Copy link
Author

ibiglari commented Feb 8, 2024

@jakubch1 Just an update,

I have not been able to replicate the crash for the past week. The random nature of this crash makes it rather hard to investigate, but as soon as it happens, I will provide the logs here.

@tig
Copy link

tig commented Jul 8, 2024

Likely related to

#2952

which is intermittent for me to (although it fails most of the time).

Super frustrating that the vstest team does not seem to think the lack of diagnosabiliy here is a priority. I've wasted hours and hours on this.

@nohwnd
Copy link
Member

nohwnd commented Jul 8, 2024

@tig what tools do you lack to diagnose the problem? In the linked issue you are getting full call stack on screen and you have a memory dump.

@tig
Copy link

tig commented Jul 8, 2024

@tig what tools do you lack to diagnose the problem? In the linked issue you are getting full call stack on screen and you have a memory dump.

You're kidding right? Have you read all the reports where people are complaining about how hard these issues are to diagnose?

This is not the fault of the customer but the vstest infrastructure and documentation.

No matter what I tried I could not get a dump to be created on Linux. Only because I decided to try the mac runner did I get a dump.

I have no experience reading crash dumps, like 99.99% of dotnet devs.

It seems the team would rather tell customers "you're holding it wrong" than actually fix the infrastructure to make these problems easy to diagnose.

@nohwnd
Copy link
Member

nohwnd commented Jul 8, 2024

It seems the team would rather tell customers "you're holding it wrong" than actually fix the infrastructure to make these problems easy to diagnose.

The "Test host process crashed" error means that the hosting process crashed. In many cases it is the user's test code crashing the process for a reason we cannot control. If there was no testhost, but just a plain other .NET process, it would crash the same.

I was not kidding, just honestly asking, what would be required to make this easy to diagnose? Stack trace and memory dumps seems the closest we can get to interactively debugging the problem.

@tig
Copy link

tig commented Jul 8, 2024

I was not kidding, just honestly asking, what would be required to make this easy to diagnose? Stack trace and memory dumps seems the closest we can get to interactively debugging the problem.

Here's one obvious example:

image
#2952 (comment)

Making it so that crash dumps actually get generated when they should would be helpful too: #2952 (comment)

Making is so that a sequence file can be generated regardless of whether the tests failed, or a crash happened would help.

@nohwnd
Copy link
Member

nohwnd commented Jul 9, 2024

Making is so that a sequence file can be generated regardless of whether the tests failed, or a crash happened would help.

Sequence file just tells you which tests did not finish (you can also find that in the diag logs), when all tests finish the file is barely useful.

Making it so that crash dumps actually get generated when they should would be helpful too: #2952 (comment)

You told me that memory dumps have no value to you, because you cannot understand them as 99% of other developers. Are there other tools that would help you diagnose the problem? We are building a new platform for running tests and if you think there is a better way that is not memory dumps we can take that into account. :)

For creating the dumps (on .NET) we rely on the built-in .NET memory dump reporting, but because vstest_dump_path is set we don't upload any of the dumps from the run even when it is created.

@nohwnd nohwnd self-assigned this Jul 9, 2024
@nohwnd
Copy link
Member

nohwnd commented Jul 11, 2024

Closing in favor of #2952 I would still like to hear your opinion on how to make it easier for devs to debug crashes, when not using memory dumps or stack trace @tig I do agree that the blame collector is far from perfect, but at least on .NET we are offloading all the work to the standard .NET functionality.

@nohwnd nohwnd closed this as completed Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants