-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Test host process crashed" error hard to diagnose #2952
Comments
FYI @AArnott who has been investigating this recently |
I have found out what my problem was. It's that infamous 100ms bug striking again, I can't even fathom how much that bug has caused our ecosystem in terms of productivity loss. I personally wasted four full days tracking this down. I had to add Finally I saw this:
I think it's this issue: vstest/src/Microsoft.TestPlatform.CrossPlatEngine/Client/ProxyOperationManager.cs Lines 211 to 215 in d10bcbb
The process is busy doing something, then we impatiently kill it after 100ms, AND WE DON'T TELL THE USER WE KILLED IT. All the user sees is:
So then the user is sent on a wild goose chase for 4 days trying to figure out various ways to deploy procdump.exe to the CI agent, find out that it doesn't work anyway because we only explicitly pass StackOverflowException and AccessViolationException as arguments to procdump and ignore other types of exceptions. We killed the process, we know it, but we don't log it, don't publish the dump, don't have any MSBuild errors, leaving the user helpless, frustrated, blocked, not knowing how to proceed. |
I agree it could be easier if you use dotnet test. But we also have vstest yaml task (and classic pipeline task) which has all those cool levers, which could have saved you those 4 days.
The error above looks more like testhost crashed, rather than vstest.console just failing to await it. Please upload the logs if you can, or point me to the build. Or is it one of the builds you shared in chat? |
Yes, it's the build I mentioned in chat. Things should work regardless of whether I'm using dotnet test or the VSTest yaml task. Where are the docs for the VSTest task? Where are these screenshots from? What is the yaml syntax if I wanted to switch my pipeline from dotnet test to VSTest? dotnet test experience should be just as well supported as the VSTest task. |
I've switched to the LocalDumps registry key and will report how that goes. - powershell: |
$key = "HKLM:\\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps"
$LogDir = "$(Build.ArtifactStagingDirectory)\build_logs"
New-Item -Path $key -ErrorAction SilentlyContinue
New-ItemProperty -Path $key -Name 'DumpType' -PropertyType 'DWord' -Value 2 -Force
New-ItemProperty -Path $key -Name 'DumpCount' -PropertyType 'DWord' -Value 10 -Force
New-ItemProperty -Path $key -Name 'DumpFolder' -PropertyType 'String' -Value $LogDir -Force
displayName: Enable LocalDumps in registry
continueOnError: true - powershell: |
$key = "HKLM:\\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps"
New-Item -Path $key -ErrorAction SilentlyContinue
New-ItemProperty -Path $key -Name 'DumpType' -PropertyType 'DWord' -Value 0 -Force
displayName: Disable LocalDumps in registry
continueOnError: true
condition: always() |
Not sure if this is the same as this issue as well. We've also just started encountering this error as well, seemingly randomly as we're trying to add one test and it's failing in a completely different error. Log notes here:
@nohwnd @KirillOsenkov any updates on getting more diagnostics? We've wasted a couple of weeks pulling our hair out trying to understand what's going wrong here. |
Have you tried the LocalDumps registry key to upload dumps from your CI? |
I was wondering if you had any update on this issue? I am seeing a stack trace that looks just like the one @michael-hawker mentioned.
This worked fine in .NET 5 VS 2019. In .NET 6 VS 2022 I am seeing this issue. I was wondering if I can provide something to help diagnose this issue. Until this is fixed I will have to turn off my unit test. |
@Daniellled do you have diag logs that you could share? Please. Add --diag:logs/log.txt to collect them, multiple files will be written into the logs folder. Please share all if you can. |
I think that I'm seeing the same thing when trying to use the blame collector in dotnet test. It is causing my CI to fail tests and not upload any of the results or the blame crashes. There is a simple repro:
See this output:
I think that if any collector is attempting to gather data at the end of the test run, then this timeout will be hit and none of the attachments will be uploaded. |
Was this file not created at all? The error below just means that the testhost was killed (the dumper tool did that because it was hanging), but it is the data collector that is reponsible for waiting for the dump to end and passing the file as attachment to the vstest.console. It shows up on the screen so the file should be created successfully.
|
The file gets created, but it doesn't get copied into the correct location to be uploaded to artifacts in my pipelines. Normally trx files and their corresponding dumps would get uploaded to artifacts. I've had to resort to scanning the temp directory for test results and dumps when the blame collector captures hangs. |
Getting the same with net6.0 but also net5.0, 3.x and 2.x by varying the global json. Thus the .NET Core version does not seem to matter. I believe that a .NET Core SDK update has resulted in this bug, quite possibly by modifying a global value/location because |
In dotnet/runtime#66515 this was because of an AV in the test itself. That's not a vs test bug. But IMO vstest should not dump the NotifyDataAvailable stack at this point. NotifyDataAvailable should catch the exception, which should be an IOException (it's been unwrapped below I believe) since that stack is purely a detail of vstest implementation when the test itself spontaneously quits. It makes it look like vstest has a bug, which in this case it does not.
|
I agree, I've been planning to fix that message for a long time because that error is just confusing. |
Same issue here with xUnit on Unix using |
@KundM did you try enabling diag logs, and looking at it? Or collecting crash dumps? What target framework is your project targetting? |
Any updates on this issue? I encountered the same error message |
Just FYI that I've resolved this issue and found out it's due to low version of Microsoft.NET.Test.Sdk. Upgrading it to 16.10.0 for test projects perfectly fixed my issue. |
I'm encountering this issue randomly with the latest .NET 6 SDK. It will happen in random test projects (using XUnit) with no predictability (whether the test project itself was modified doesn't always seem to matter when running the whole of |
The same issue with Playwright.NET, xUnit and Microsoft.NET.Test.Sdk 17.4.0 in .netcoreapp3.1 project. Tests run on TeamCity agent. |
Hi everyone! I am running in this issue while running
I also looked at the If any of the people involved (@nohwnd or @KirillOsenkov), would be kind enough to have a look at the attached |
Also just started getting this error in the last week. Has to disable our tests to get CI running which is not ideal. This using .NET7 on Linux Gitlab runners |
This is what --blame does, you will get sequence file which has a list of all finished and unfinished tests. And if you enable --diag you will get .datacollector. log that also has entries for tests that started and ended. --blame-crash is then there to observe your tests (on windows you need procdump).
If the runtime writes it to the console of the app that is running we are able to capture that from vstest.console and write it to screen and log. That is what we do for .NET and .NET Framework applications. The experience with .NET is way better because they handle stack overflows and other exceptions "in-process" and write them to the screen with callstack before the process dies. In .NET Framework it simply writes the error without call stack and the process is killed. The info is written to event log, but we don't grab that info, because we cannot know exactly what entry to take and don't want to expose info of other processes. We can look at how it behaves for UWP if you have a specific crash in mind. |
Quick update from my side: Looking at the procdump files, it looked like an issue with a specific DLL file in my build folder. After letting it rest for a while and then updating all packages at some point (including |
Just wanted to point out something that has not been made clear in this thread so far. |
@sanikolov Some issues might not manifest in the development environment (e.g. only in CI, or remote machines with different configurations). I think it's reasonable to want to more easily diagnose failures after the fact, espcecially in the case of a simple internal timeout from msbuild. Personally, I think it would also be helpful here if there were an easy way to set up process level isolation with method level parallelization akin to #3424. Then we could more easily identify the failing test and get a simplified procdump to diagnose. I note the change in isolation level might also prevent the crash from occuring, but that would probably be OK for my affected case (as it is due to third-party proprietary DLLs which the upstream provider won't/is unable to fix). |
Hi all - I was just hoping if someone could either help or summarize the state of this issue. I'll start with whats brought me here. We've recently started to migrate to running on linux starting running our unit tests dlls using dotnet test on a linux host. When running on windows the test past with no issue. So we followed some of the steps here added
and all I found was
that exception is repeated a few times. From some searching this appears the the error you get when the test host unexpectedly terminates... but what I can't find is why... I see near the start of this issue someone mentions that the above stack trace is useless and that should be improved #2952 (comment) However I'm not seeing any real mention of a 'problem' after that. It seems it gets into weeds about adding diag logs etc... but I've done all that and it sheds no light. Jump ahead a few wasted days... we found a null ref exception in the test. Its handled and causes no issues on windows. However if we prevent that exception being thrown... the tests run fine on linux... no crashed process. My theory atm is that for some reason that exception is causing some kind of 'freeze' in the process and then its getting terminated because....??? Anyway I've chatted quite enough... can anyone help us with how could go about diagnosing such issues in the future. (I can share full logs privately, but can't share them via github) |
Have you also added --blame-crash? On which version of .NET sdk / vstest.console are you? And could you post simple repro of your problem, where was the null ref failing? My experience with crashes on .NET (not .NET Framework) is that you almost always get the full stack trace on screen in the crashing process. In the past we had problems picking this text up in the top level process, because the output streams api behavior differs on windows and linux, but that should have been fixed for a lo0ong time (since netcoreapp3.1 I think).
Printing that error on crash / abort, was addressed, but it still shows up in the diag logs, because sometimes it is useful. |
Hello! I really appreciate you responding. I did add blame crash and did gott a crash dump - which I'll say two things about.
I probably didn't make this that clear in my original comment - it very much appears that its not a case of my tests 'terminating' - they are being 'terminated' - the tests themselves have no issue - they pass on a windows host 100% of the time. I mentioned the null reference exception because it did change the outcome on linux - but that null ref is always handled - what we changed was just to not throw it and go down the same path we'd go down when the exception was caught. If I update my tests to actually fail/crash/have a problem - as you say I do see that on screen. So not to repeat myself but I do think the issue I'm having is my tests are being terminated by the 'wrapping' test process - not that the tests themselves are terminating. Someone initially posts about an 'infamous 100ms bug' #2952 (comment) What bug is this and is there more information on what the fix was/will be and any work arounds. Regarding versions, heres what I'm running with vstest.console.dll, Version: 17.0.3+cc7fb0593127e24f55ce016fb3ac85b5b2857fec (this is from the diag.log file) dotnet sdk version: 6.0.122 I'm running the test dll - by just copying over a .net6.0 output folder to a linux host and running dotnet test ./mytests.dll |
We had this issue pop up in last few days running our unit tests via a Github action. - name: Test
run: |
sed -i 's/"stopOnFail": false/"stopOnFail": true/g' UnitTests/xunit.runner.json
dotnet test --no-restore --verbosity normal --collect:"XPlat Code Coverage" --settings UnitTests/coverlet.runsettings --blame
mv -v UnitTests/TestResults/*/*.* UnitTests/TestResults/ I just switched our runs-on from If I run the tests local in WSL they pass. |
Hello, is there any progress? We also encountered this on CI/CD |
Checking back on this, hoping something has been fixed in the meantime. It has not. |
I've now spent several more hours on this with no joy. Latest logs/dumps attached. I can't get a crash dump to collect, and --blame is not creating a Can someone see anything in these logs that might point to the culprit. logs.txt Recall that all this works fine on Windows, but fails on Here's my command: dotnet test --verbosity normal --collect:"XPlat Code Coverage" --settings UnitTests/coverlet.runsettings --diag:logs/logs.txt --blame --blame-crash --blame-hang --blame-hang-timeout 60s --blame-crash-collect-always |
There is not sequence file because all tests finished running, see the last line in your screenshot. In the logs that you uploaded you are specifying VSTEST_DUMP_PATH, this means you are responsible for uploading the created dumps (rather than vstest uploading them as attachments for you), are you doing that? The dump you would need to inspect using dotnet dump tool, you are probably looking for the locals in the callstack to see what data cannot be serialized. In any case this looks like an XUnit bug, there is a currently active issue that has the same error message: |
I never set VSTEST_DUMP_PATH explicitly. It's being set automatically somewhere. |
We don't set it, it must be your or your system. I am just pointing that variable out, because it disables uploading the dumps automatically, because some systems prefer to collect all the dumps themselves from various places to have them in a single view. |
For the record and others stumbling on this, the "infamous 100ms bug" mentioned under #2952 (comment) is #2379. It's here in the code. While the default is still 100ms, you can now set the |
Our CI has an intermittent failure:
As far as I can tell, the message is printed by this code:
vstest/src/Microsoft.TestPlatform.CommunicationUtilities/TestRequestSender.cs
Line 667 in eff66c0
It would be really helpful here if it printed the full path to the process executable that crashed. I took a quick look but I couldn't find where to get the executable full path from. Also it seems that the error output stream contents was empty (
clientExitErrorMessage
) so when the test process crashes it should print the full exception stack to the Console.Error so that the stack appears in the CI log.The text was updated successfully, but these errors were encountered: