Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in GC when attached profiler calls ICorProfilerInfo4::EnumThreads using .NET9 runtime #110062

Closed
obeton opened this issue Nov 21, 2024 · 6 comments · Fixed by #110548
Closed
Assignees
Labels
area-Diagnostics-coreclr in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@obeton
Copy link

obeton commented Nov 21, 2024

Description

Background: I am member of team responsible for maintaining a closed source .net profiler

Issue: I am currently working on validating .NET9 with profiler and while running regression tests, I've observed process stall when using .NET9 runtime with the profiler attached.

Root Cause: I've managed to narrow down the cause of the issue to an invocation of ICorProfilerInfo4::EnumThreads within our profiler's overridden ICorProfilerCallback::RuntimeSuspendFinished method, which I'll share in the 'actual behavior' form

Reproduction Steps

In profiler dll project have class:
`class ProfilerCallback : public ICorProfilerCallback5
{
public:
virtual COM_METHOD(HRESULT) Initialize(IUnknown* pICorProfilerInfoUnk)
{
return pICorProfilerInfoUnk->QueryInterface(IID_ICorProfilerInfo8, (void**)&m_pProfilerInfo);
}

COM_METHOD(HRESULT) RuntimeSuspendFinished()
{
    ICorProfilerThreadEnum* threadEnum = nullptr;
    HRESULT enumThreadsHR = m_pProfilerInfo->EnumThreads(&threadEnum);
    threadEnum->Release();
    return enumThreadsHR;
}
ICorProfilerInfo8* m_pProfilerInfo = nullptr;

};
`

Attach as profiler this (using guide
Run and observe process stall/deadlock after RuntimeSuspendFinished is called

Expected behavior

Process should not deadlock

Actual behavior

Process deadlocks

This is the chain of calls that leads to a deadlock:

  1. Inside RuntimeSuspendFinished, EnumThreads is called
  2. Via StateHolder destructor in ProfilerThreadEnum::Init, ThreadStore::s_pThreadStore->m_HoldingThread is set to NULL
  3. RuntimeSuspendFinished returns
  4. WKS::GCHeap::GarbageCollectGeneration is called which then calls Thread::RareDisablePreemptiveGC
  5. Where there is a call made to ThreadStore::HoldingThreadStore will return true as m_HoldingThread == NULL, which avoids early exit from Thread::RareDisablePreemptiveGC
  6. Later on a call to GCHeapUtilities::GetGCHeap()->WaitUntilGCComplete() is made which will deadlock as GCHeap::SetWaitForGCEvent() hasn't been called yet by ThreadSuspend::RestartEE

Below are the relevant stack traces:

Resetting holding thread:
coreclr.dll!ThreadSuspend::UnlockThreadStore(int bThreadDestroyed, ThreadSuspend::SUSPEND_REASON) Line 1934 C++ [Inline Frame] coreclr.dll!ThreadStore::UnlockThreadStore() Line 5110 C++ [Inline Frame] coreclr.dll!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::Release() Line 359 C++ [Inline Frame] coreclr.dll!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::{dtor}() Line 340 C++ coreclr.dll!ProfilerThreadEnum::Init() Line 585 C++ coreclr.dll!ProfToEEInterfaceImpl::EnumThreads(ICorProfilerThreadEnum * * ppEnum) Line 10366 C++ CoreRewriter_x64.dll!ProfilerCallback::RuntimeSuspendFinished() Line 2417 C++ coreclr.dll!EEToProfInterfaceImpl::RuntimeSuspendFinished() Line 5056 C++ coreclr.dll!ProfControlBlock::DoProfilerCallbackHelper<int (__cdecl*)(ProfilerInfo *),long (__cdecl*)(EEToProfInterfaceImpl *)>(ProfilerInfo * pProfilerInfo, int(*)(ProfilerInfo *) condition, HRESULT(*)(EEToProfInterfaceImpl *) callback, HRESULT * pHR) Line 284 C++ coreclr.dll!ThreadSuspend::SuspendEE(ThreadSuspend::SUSPEND_REASON reason) Line 5647 C++ coreclr.dll!GCToEEInterface::SuspendEE(SUSPEND_REASON reason) Line 51 C++ coreclr.dll!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen, gc_reason reason) Line 51030 C++

Checking null holding thread:
coreclr.dll!ThreadStore::HoldingThreadStore(Thread * pThread) Line 7624 C++ coreclr.dll!Thread::RareDisablePreemptiveGC() Line 2123 C++ [Inline Frame] coreclr.dll!WKS::gc_heap::disable_preemptive(bool) Line 1683 C++ coreclr.dll!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen, gc_reason reason) Line 51031 C++

Deadlock:
coreclr.dll!WKS::GCHeap::WaitUntilGCComplete(bool bConsiderGCStart) Line 238 C++ coreclr.dll!Thread::RareDisablePreemptiveGC() Line 2212 C++ [Inline Frame] coreclr.dll!WKS::gc_heap::disable_preemptive(bool) Line 1683 C++ coreclr.dll!WKS::GCHeap::GarbageCollectGeneration(unsigned int gen, gc_reason reason) Line 51031 C++

Regression?

We have not encountered this GC deadlock in .NET8 and presumably earlier

Known Workarounds

I haven't tried it yet but there's probably some way to force ThreadStore::s_pThreadStore->m_HoldingThread into it's previous correct state after calling EnumThreads

Configuration

.NET: 9.0.100

Processor 13th Gen Intel(R) Core(TM) i9-13900H 2.60 GHz Installed RAM 32.0 GB (31.7 GB usable) System type 64-bit operating system, x64-based processor
This has also been replicated on build server. I am not aware of the specs for it.

Other information

this line causes ThreadStore::s_pThreadStore->m_HoldingThread to be set to NULL which has knock-on effects later on which lead to a deadlock.

I have minidumps, but they're pretty large to upload, so let me know if there is any way I can send them

Copy link
Contributor

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Nov 21, 2024
@tommcdon tommcdon added this to the 10.0.0 milestone Nov 25, 2024
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Nov 25, 2024
@VSadov
Copy link
Member

VSadov commented Dec 7, 2024

The reason why m_HoldingThread is set to NULL when thread store is released is because the thread store assumes that it is not taken recursively (when the same thread acquires it again while already holding it).

Among some other things holding the thread store lock ensures that no runtime-managed threads are concurrently added or removed.

Enumerating threads would typically require that the thread store lock is acquired. Also RuntimeSuspendFinished will be called when the lock is already acquired ("all threads suspended" would be meaningless if threads could be added/removed).
I did not dig into this too deeply, but I suspect that the only reason the sample in the description works is that the thread store lock actually can be acquired recursively (possibly only in release build).

It looks like it was like this for many releases back. I am not sure why it started causing problems now. Perhaps some dependencies on m_HoldingThread changed, but the root cause is that m_HoldingThread would not behave correctly in a case of recursive use.

Enforcing that the lock is recursive might be a breaking change. It would most certainly rule out the entire scenario where threads are enumerated in RuntimeSuspendFinished.

On the other hand, there is no danger in actually supporting recursive use.
So perhaps the right fix would be making threadstore lock a proper recursive lock - like fixing the part where m_HoldingThread is not cleared when releasing with recursive count > 1, and similar.

At least this is my current theory.
I am not very familiar with profiler APIs, so I may need to look into this a bit more.

@mdh1418
Copy link
Member

mdh1418 commented Dec 9, 2024

Thanks for pointing out the problematic recursive ThreadStoreLock!

I looked into it a bit more, and the ThreadStoreLock is being acquired recursively because EnumThreads acquires the ThreadStoreLock (EnumThreads -> ProfilerThreadEnum->Init -> ThreadStoreLockHolder(!g_profControlBlock.fProfilerRequestedRuntimeSuspend)).

The RuntimeSuspendFinished called during GC is not a Profiler-requested runtime suspend, so the ThreadStoreLockHolder in ProfilerThreadEnum::Init doesn't recognize that the ThreadStoreLock shouldn't be acquired. As such, when it releases via UnlockThreadStore, we violate the expectation that the ThreadStoreLock will still be held until RestartEE. I'm thinking that most of the time, RareDisablePreemptiveGC would have exited early because of the check

if (ThreadStore::HoldingThreadStore(this))
{
// A thread that performs GC may switch modes inside GC and come here. We would not want to
// suspend the thread that is responsible for the suspension.
// Ideally that would be the only the case when a thread that holds thread store lock
// may want to enter coop mode, but we have a number of other scenarios where TSL is acquired
// in coop mode or when TSL-owning thread tries to swtch to coop.
// We will handle all cases in the same way - by allowing the thread into coop mode without a wait.
goto Exit;
}
, as stated in the issue.

Do we need to update the ThreadStoreLockHolder's condition to acquire the ThreadStoreLock in ProfilerThreadEnum::Init to account for an ongoing GC?
I'm wondering, how do all other instances of ThreadStoreLockHolder avoid reacquiring the ThreadStoreLock? Or do those instances not care about reacquiring because the contexts of those code paths don't assert that the ThreadStoreLock should continue to be held.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Dec 9, 2024
@obeton
Copy link
Author

obeton commented Dec 11, 2024

Sincere thanks for the quick turnaround! any idea as to which release this will be included in?

@mdh1418
Copy link
Member

mdh1418 commented Dec 12, 2024

Currently, it will only be included in .NET 10, which can be previewed in the .NET SDK latest builds when the next version is out. Would that work for you, or do you need it in .NET 9?

@obeton
Copy link
Author

obeton commented Dec 12, 2024

This issue currently makes our profiler unable to be used with any application running on .NET 9, so I would say a fix for .NET 9 is critical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Diagnostics-coreclr in-pr There is an active PR which will close this issue when it is merged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants