Triton Crash with Signal 11 while using python backend #7400

burling · 2024-07-01T14:22:10Z

Description
After using the Python vllm backend, Triton crashed with signal 11. The model had been loaded and preheated for some time before the crash occurred.

Triton Information
What version of Triton are you using?

Triton: v2.42.0
Python backend: r24.01
GPU: A100
OS: CentOS 7

Are you using the Triton container or did you build it yourself?
Yes

trace info:

Signal (11) received.
 0# triton::server::(anonymous namespace)::ErrorSignalHandler(int) at triton_signal.cc:?
 1# 0x00007F2477AC8B50 in /usr/lib64/libc.so.6
 2# 0x00007F24780CE7F2 in /usr/lib64/libm.so.6
 3# 0x00007F24780CF49C in /usr/lib64/libm.so.6
 4# pow in /usr/lib64/libm.so.6
 5# grpc_core::chttp2::TransportFlowControl::PeriodicUpdate() in /opt/tritonserver/bin/tritonserver
 6# finish_bdp_ping_locked(void*, absl::lts_20220623::Status) at chttp2_transport.cc:?
 7# grpc_combiner_continue_exec_ctx() in /opt/tritonserver/bin/tritonserver
 8# grpc_core::ExecCtx::Flush() in /opt/tritonserver/bin/tritonserver
 9# end_worker(grpc_pollset*, grpc_pollset_worker*, grpc_pollset_worker**) at ev_epoll1_linux.cc:?
10# pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) at ev_epoll1_linux.cc:?
11# pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) at ev_posix.cc:?
12# grpc_pollset_work(grpc_pollset*, grpc_pollset_worker**, grpc_core::Timestamp) in /opt/tritonserver/bin/tritonserver
13# cq_next(grpc_completion_queue*, gpr_timespec, void*) at completion_queue.cc:?
14# grpc::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) in /opt/tritonserver/bin/tritonserver
15# triton::server::grpc::InferHandler<inference::GRPCInferenceService::WithAsyncMethod_ServerLive<inference::GRPCInferenceService::WithAsyncMethod_ServerR
eady<inference::GRPCInferenceService::WithAsyncMethod_ModelReady<inference::GRPCInferenceService::WithAsyncMethod_ServerMetadatinference::GRPCInferenceService::WithAsyncMethod_ModelMetadata<inference::G
RPCInferenceService::WithAsyncMethod_ModelInfer<inference::GRPCInferenceService::WithAsyncMethod_ModelStreamInfer<inference::GRPCInferenceService::WithAsyncMethod_ModelConfig<inference::GRPCInferenceServi
ce::WithAsyncMethod_ModelStatistics<inference::GRPCInferenceService::WithAsyncMethod_RepositoryIndex<inference::GRPCInferenceService::WithAsyncMethod_RepositoryModelLoad<inference::GRPCInferenceService::W
ithAsyncMethod_RepositoryModelUnload<inference::GRPCInferenceService::WithAsyncMethod_SystemSharedMemoryStatus<inference::GRPCInferenceService::WithAsyncMethod_SystemSharedMemoryRegister<inference::GRPCIn
ferenceService::WithAsyncMethod_SystemSharedMemoryUnregister<inference::GRPCInferenceService::WithAsyncMethod_CudaSharedMemoryStatus<inference::GRPCInferenceService::WithAsyncMethod_CudaSharedMemoryRegist
er<inference::GRPCInferenceService::WithAsyncMethod_CudaSharedMemoryUnregister<inference::GRPCInferenceService::WithAsyncMethod_TraceSetting<inference::GRPCInferenceService::WithAsyncMethod_LogSettings<in
ference::GRPCInferenceService::Service> > > > > > > > > > > > > > > > > > > >, grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>:
:Start()::{lambda()#1}::operator()() const in /opt/tritonserver/bin/tritonserver
]
16# 0x00007F247849BB13 in /usr/lib64/libstdc++.so.6
17# 0x00007F24787761CA in /usr/lib64/libpthread.so.0
18# clone in /usr/lib64/libc.so.6

The text was updated successfully, but these errors were encountered:

Markovvn1w · 2024-07-02T19:52:59Z

I am getting a very similar problem, however I am not sure if it is the exact same error. I also have a python decupled backend. After starting tritonserver I run stress testing which sends a lot of requests to the tritonserver. Within the first 10 minutes of testing I quite consistently get this error which completely crushes my tritonserver. Unfortunately I have a custom build of tritonserver based on 24.05, so I don't know how relevant the information is. However, I did not have this problem on version 23.10

E0702 19:02:48.658289 148148 infer_handler.h:187] "[INTERNAL] Attempting to access current response when it is not ready"
Signal (11) received.
0.773678183555603
 0# 0x0000561EA6BD83ED in tritonserver
 1# 0x00007F6E5E5D3090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x0000561EA6C4DBE4 in tritonserver
 3# 0x0000561EA6C4E740 in tritonserver
 4# 0x0000561EA6C46DFA in tritonserver
 5# 0x0000561EA6C31AB5 in tritonserver
 6# 0x00007F6E5E9D4793 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F6E5EB64609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
 8# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Segmentation fault (core dumped)

I assume the error occurs because of this check, however I have no clue why this is the case:

server/src/grpc/infer_handler.h

Lines 183 to 192 in c61d993

 ResponseType* GetCurrentResponse() 

 { 

 std::lock_guard<std::mutex> lock(mtx_); 

 if (current_index_ >= ready_count_) { 

 LOG_ERROR << "[INTERNAL] Attempting to access current response when it " 

 "is not ready"; 

 return nullptr; 

 } 

 return responses_[current_index_]; 

 }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Triton Crash with Signal 11 while using python backend #7400

Triton Crash with Signal 11 while using python backend #7400

burling commented Jul 1, 2024

Markovvn1w commented Jul 2, 2024 •

edited

Loading

Triton Crash with Signal 11 while using python backend #7400

Triton Crash with Signal 11 while using python backend #7400

Comments

burling commented Jul 1, 2024

Markovvn1w commented Jul 2, 2024 • edited Loading

Markovvn1w commented Jul 2, 2024 •

edited

Loading