Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Segfaults in Triton 24.05 due to Low Request Cancellation Timeout #7368

Open
AshwinAmbal opened this issue Jun 24, 2024 · 4 comments
Open

Comments

@AshwinAmbal
Copy link

Description
We use gRPC to query Triton for Model Ready, Model Metadata and Model Inference Requests. When running the Triton server for a sustained period of time, we get Segfaults unexpectedly [Signal 11 received]. The trace of the segfault is attached with this issue but the time when it occurs cannot be predicted and happens across our servers at irregular intervals.

Triton Information
What version of Triton are you using?
Version 24.05
I also built my CPU only version with Debug symbols and reproduced the same issue as well

Are you using the Triton container or did you build it yourself?
I can reproduce the issue in Triton Container from NGC and in the custom build that I've done as well

To Reproduce
Steps to reproduce the behavior.

  1. Use xDS with gRPC client side load balancing for routing requests from client to multiple Triton servers
  2. Use basic Golang client to query model in Triton (like here). [Uses gRPC v1.63.2]
  3. Use a Tensorflow DNN model with only numeric or string features
  4. Setup a Model Repo Agent as a side car within the same pod as Triton to copy models from S3 to pod and then trigger Triton with a default config.pbtxt as a HTTP payload as described here.
  5. Setup Triton 24.05 as a container within the same pod and use startup command:
   tritonserver --model-store=/model_repo \
                   --model-control-mode=explicit \
                   --exit-on-error=true \
                   --strict-readiness=true \
                   --allow-cpu-metrics=true
  1. Add Tensorflow TFDF to Triton to support GBT models:
RUN wget https://files.pythonhosted.org/packages/b8/1a/f1a21d24357b9f760e791c7b54804535421de1f1ee08a456d3a7f7ec7bbb/tensorflow_decision_forests-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl && \
    unzip ./tensorflow_decision_forests-1.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -d ./tensorflow_decision_forests && \
    cp ./tensorflow_decision_forests/tensorflow_decision_forests/tensorflow/ops/inference/inference.so /home/inference.so
  1. Use the following environment variable:
    - name: TF_ENABLE_ONEDNN_OPTS
      value: "0"
    - name: LD_PRELOAD
      value: /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4
    - name: GRPC_VERBOSITY
      value: INFO

Description of node
c6i.16xlarge instance from AWS

Error occurred

2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 8# 0x0000782BD9213850 in /lib/x86_64-linux-gnu/libc.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 7# 0x0000782BD9181AC3 in /lib/x86_64-linux-gnu/libc.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 6# 0x0000782BD93F2253 in /lib/x86_64-linux-gnu/libstdc++.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 5# 0x0000564E2B3A40B9 in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 4# 0x0000564E2B3AE85B in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 3# 0x0000564E2B3AD954 in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 2# 0x0000564E2B3B3706 in tritonserver"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 1# 0x0000782BD912F520 in /lib/x86_64-linux-gnu/libc.so.6"} |  
  |   | 2024-06-24 14:36:06.868 | {"stream":"stderr","logtag":"F","log":" 0# 0x0000564E2B34EA9D in tritonserver"} |  

2024-06-24 14:36:06.709{"stream":"stderr","logtag":"F","log":"Signal (11) received."} |   |   | 2024-06-24 14:36:06.709 | {"stream":"stderr","logtag":"F","log":"Signal (11) received."}
  |   | 2024-06-24 14:36:06.709 | {"stream":"stderr","logtag":"F","log":"Signal (11) received."}

GDB Trace by building Debug Container as described here

Thread 11 "tritonserver" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff1e65000 (LWP 155)]
0x0000555555719724 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::Context::IsCancelled (this=0x0) at /workspace/src/grpc/infer_handler.h:669
669     /workspace/src/grpc/infer_handler.h: No such file or directory.
(gdb) bt
#0  0x0000555555719724 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::Context::IsCancelled (this=0x0) at /workspace/src/grpc/infer_handler.h:669
#1  0x0000555555715e48 in triton::server::grpc::InferHandlerState<grpc::ServerAsyncResponseWriter<inference::ModelInferResponse>, inference::ModelInferRequest, inference::ModelInferResponse>::IsGrpcContextCancelled (this=0x555564d52600) at /workspace/src/grpc/infer_handler.h:1034
#2  0x00005555557100c1 in triton::server::grpc::ModelInferHandler::Process (this=0x5555570c2800, state=0x555564d52600, rpc_ok=true) at /workspace/src/grpc/infer_handler.cc:696
#3  0x00005555556f638d in _ZZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS4_27WithAsyncMethod_ServerReadyINS4_26WithAsyncMethod_ModelReadyINS4_30WithAsyncMethod_ServerMetadataINS4_29WithAsyncMethod_ModelMetadataINS4_26WithAsyncMethod_ModelInferINS4_32WithAsyncMethod_ModelStreamInferINS4_27WithAsyncMethod_ModelConfigINS4_31WithAsyncMethod_ModelStatisticsINS4_31WithAsyncMethod_RepositoryIndexINS4_35WithAsyncMethod_RepositoryModelLoadINS4_37WithAsyncMethod_RepositoryModelUnloadINS4_40WithAsyncMethod_SystemSharedMemoryStatusINS4_42WithAsyncMethod_SystemSharedMemoryRegisterINS4_44WithAsyncMethod_SystemSharedMemoryUnregisterINS4_38WithAsyncMethod_CudaSharedMemoryStatusINS4_40WithAsyncMethod_CudaSharedMemoryRegisterINS4_42WithAsyncMethod_CudaSharedMemoryUnregisterINS4_28WithAsyncMethod_TraceSettingINS4_27WithAsyncMethod_LogSettingsINS4_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS3_18ModelInferResponseEEENS3_17ModelInferRequestES1C_E5StartEvENKUlvE_clEv (__closure=0x555557081168) at /workspace/src/grpc/infer_handler.h:1310
#4  0x000055555570b2a3 in _ZSt13__invoke_implIvZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS5_27WithAsyncMethod_ServerReadyINS5_26WithAsyncMethod_ModelReadyINS5_30WithAsyncMethod_ServerMetadataINS5_29WithAsyncMethod_ModelMetadataINS5_26WithAsyncMethod_ModelInferINS5_32WithAsyncMethod_ModelStreamInferINS5_27WithAsyncMethod_ModelConfigINS5_31WithAsyncMethod_ModelStatisticsINS5_31WithAsyncMethod_RepositoryIndexINS5_35WithAsyncMethod_RepositoryModelLoadINS5_37WithAsyncMethod_RepositoryModelUnloadINS5_40WithAsyncMethod_SystemSharedMemoryStatusINS5_42WithAsyncMethod_SystemSharedMemoryRegisterINS5_44WithAsyncMethod_SystemSharedMemoryUnregisterINS5_38WithAsyncMethod_CudaSharedMemoryStatusINS5_40WithAsyncMethod_CudaSharedMemoryRegisterINS5_42WithAsyncMethod_CudaSharedMemoryUnregisterINS5_28WithAsyncMethod_TraceSettingINS5_27WithAsyncMethod_LogSettingsINS5_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS4_18ModelInferResponseEEENS4_17ModelInferRequestES1D_E5StartEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_ (__f=...) at /usr/include/c++/11/bits/invoke.h:61
#5  0x000055555570b1ff in _ZSt8__invokeIZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS5_27WithAsyncMethod_ServerReadyINS5_26WithAsyncMethod_ModelReadyINS5_30WithAsyncMethod_ServerMetadataINS5_29WithAsyncMethod_ModelMetadataINS5_26WithAsyncMethod_ModelInferINS5_32WithAsyncMethod_ModelStreamInferINS5_27WithAsyncMethod_ModelConfigINS5_31WithAsyncMethod_ModelStatisticsINS5_31WithAsyncMethod_RepositoryIndexINS5_35WithAsyncMethod_RepositoryModelLoadINS5_37WithAsyncMethod_RepositoryModelUnloadINS5_40WithAsyncMethod_SystemSharedMemoryStatusINS5_42WithAsyncMethod_SystemSharedMemoryRegisterINS5_44WithAsyncMethod_SystemSharedMemoryUnregisterINS5_38WithAsyncMethod_CudaSharedMemoryStatusINS5_40WithAsyncMethod_CudaSharedMemoryRegisterINS5_42WithAsyncMethod_CudaSharedMemoryUnregisterINS5_28WithAsyncMethod_TraceSettingINS5_27WithAsyncMethod_LogSettingsINS5_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS4_18ModelInferResponseEEENS4_17ModelInferRequestES1D_E5StartEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS1J_DpOS1K_ (__fn=...) at /usr/include/c++/11/bits/invoke.h:96
#6  0x000055555570b170 in _ZNSt6thread8_InvokerISt5tupleIJZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS7_27WithAsyncMethod_ServerReadyINS7_26WithAsyncMethod_ModelReadyINS7_30WithAsyncMethod_ServerMetadataINS7_29WithAsyncMethod_ModelMetadataINS7_26WithAsyncMethod_ModelInferINS7_32WithAsyncMethod_ModelStreamInferINS7_27WithAsyncMethod_ModelConfigINS7_31WithAsyncMethod_ModelStatisticsINS7_31WithAsyncMethod_RepositoryIndexINS7_35WithAsyncMethod_RepositoryModelLoadINS7_37WithAsyncMethod_RepositoryModelUnloadINS7_40WithAsyncMethod_SystemSharedMemoryStatusINS7_42WithAsyncMethod_SystemSharedMemoryRegisterINS7_44WithAsyncMethod_SystemSharedMemoryUnregisterINS7_38WithAsyncMethod_CudaSharedMemoryStatusINS7_40WithAsyncMethod_CudaSharedMemoryRegisterINS7_42WithAsyncMethod_CudaSharedMemoryUnregisterINS7_28WithAsyncMethod_TraceSettingINS7_27WithAsyncMethod_LogSettingsINS7_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS6_18ModelInferResponseEEENS6_17ModelInferRequestES1F_E5StartEvEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE (this=0x555557081168) at /usr/include/c++/11/bits/std_thread.h:259
#7  0x000055555570b120 in _ZNSt6thread8_InvokerISt5tupleIJZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS7_27WithAsyncMethod_ServerReadyINS7_26WithAsyncMethod_ModelReadyINS7_30WithAsyncMethod_ServerMetadataINS7_29WithAsyncMethod_ModelMetadataINS7_26WithAsyncMethod_ModelInferINS7_32WithAsyncMethod_ModelStreamInferINS7_27WithAsyncMethod_ModelConfigINS7_31WithAsyncMethod_ModelStatisticsINS7_31WithAsyncMethod_RepositoryIndexINS7_35WithAsyncMethod_RepositoryModelLoadINS7_37WithAsyncMethod_RepositoryModelUnloadINS7_40WithAsyncMethod_SystemSharedMemoryStatusINS7_42WithAsyncMethod_SystemSharedMemoryRegisterINS7_44WithAsyncMethod_SystemSharedMemoryUnregisterINS7_38WithAsyncMethod_CudaSharedMemoryStatusINS7_40WithAsyncMethod_CudaSharedMemoryRegisterINS7_42WithAsyncMethod_CudaSharedMemoryUnregisterINS7_28WithAsyncMethod_TraceSettingINS7_27WithAsyncMethod_LogSettingsINS7_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS6_18ModelInferResponseEEENS6_17ModelInferRequestES1F_E5StartEvEUlvE_EEEclEv (this=0x555557081168) at /usr/include/c++/11/bits/std_thread.h:266
#8  0x000055555570b0dc in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN6triton6server4grpc12InferHandlerIN9inference20GRPCInferenceService26WithAsyncMethod_ServerLiveINS8_27WithAsyncMethod_ServerReadyINS8_26WithAsyncMethod_ModelReadyINS8_30WithAsyncMethod_ServerMetadataINS8_29WithAsyncMethod_ModelMetadataINS8_26WithAsyncMethod_ModelInferINS8_32WithAsyncMethod_ModelStreamInferINS8_27WithAsyncMethod_ModelConfigINS8_31WithAsyncMethod_ModelStatisticsINS8_31WithAsyncMethod_RepositoryIndexINS8_35WithAsyncMethod_RepositoryModelLoadINS8_37WithAsyncMethod_RepositoryModelUnloadINS8_40WithAsyncMethod_SystemSharedMemoryStatusINS8_42WithAsyncMethod_SystemSharedMemoryRegisterINS8_44WithAsyncMethod_SystemSharedMemoryUnregisterINS8_38WithAsyncMethod_CudaSharedMemoryStatusINS8_40WithAsyncMethod_CudaSharedMemoryRegisterINS8_42WithAsyncMethod_CudaSharedMemoryUnregisterINS8_28WithAsyncMethod_TraceSettingINS8_27WithAsyncMethod_LogSettingsINS8_7ServiceEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN4grpc25ServerAsyncResponseWriterINS7_18ModelInferResponseEEENS7_17ModelInferRequestES1G_E5StartEvEUlvE_EEEEE6_M_runEv (this=0x555557081160) at /usr/include/c++/11/bits/std_thread.h:211
#9  0x00007ffff6974253 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ffff6703ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x00007ffff6795850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Model Configuration seen here.
Model is trained with Tensorflow 2.13 and is a saved_model.pb artifact

Expected behavior
No Segfaults or server crashes

As we can see, the issue starts mainly with the grpc InferHandlerState and goes deeper into the Triton code which I am trying to study myself. I thought I will raise this issue here as it seems major and would like to get more eyes on this from the Triton community.

Please let me know if you need any more information from my end as well.

Thanks

@AshwinAmbal AshwinAmbal changed the title Segfaults with Triton 24.05 gRPC gRPC Segfaults in Triton 24.05 Jun 24, 2024
@AshwinAmbal
Copy link
Author

On quick study of the stack trace, it could be related to a lower context timeout as well (10 Milliseconds) which could be causing this. Seen above as Context::IsCancelled. Will run some experiments to confirm the behavior.

@AshwinAmbal AshwinAmbal changed the title gRPC Segfaults in Triton 24.05 gRPC Segfaults in Triton 24.05 due to Request Cancellation Jun 25, 2024
@AshwinAmbal
Copy link
Author

AshwinAmbal commented Jun 25, 2024

I've confirmed that the issue was request cancellation. In production, we have varying timeouts per Inference Request. One particular set of requests had the timeout set in the range of [1-4] ms for end to end inference. This caused the segmentation fault and increasing it resolved the issue.

I was also reading on this and it seems request cancellation feature is still under development and is only currently supported for gRPC Python as seen here and here. We use gRPC Golang on the contrary.

@tanmayv25 this may be of interest to you as I see you have already worked on this part of the code here.

cc: @Tabrizian @dyastremsky @rmccorm4 as well

Let me know if you need any more details here.

For now, we are averting this by creating a Goroutine with a timeout and not providing a timeout for the inference requests itself.

Thanks

@tanmayv25
Copy link
Contributor

tanmayv25 commented Jul 1, 2024

Thanks @AshwinAmbal for digging into this and sharing results of your experimentation.

So, if the timeout value is very small then you were running into this segfault? You are not sending request cancellation explicitly from the client right? Would you mind sharing your model execution time and rest of the latency breakdown?

Can you update the title of this issue to reflect the current issue?

@AshwinAmbal
Copy link
Author

AshwinAmbal commented Jul 2, 2024

Hi @tanmayv25,

So, if the timeout value is very small then you were running into this segfault? You are not sending request cancellation explicitly from the client right?

Yes. Low context timeouts sent from client using gRPC causes the segfault. We were sending request cancellation by setting the context timeout like done here except that at times our context timeouts can range between 1 ms - 4 ms which causes the Segfault.

Hence, to work around this issue we have created go routines which send the inference request to Triton with a high context timeout while the Goroutine itself has the timeout we expect for the request. In this case, if the timeout has been reached (1 ms - 4 ms), the main routine will return without waiting for the Goroutine to finish while the Goroutine itself will only complete after the Inference response is received (from Triton).

For example, pseudo code below for main routine is as follows:

func getPrediction() {
        resChan := make(chan *ResultType, size)
        go func(client grpcInferenceClient, modelName string, modelVersion string) {
                res := client.ModelInfer(context.Background(), &msg.ModelInferRequest) // <===== High Timeout or no timeout
		resChan <- res
        }()
        t := time.NewTimer(timeout)                    // <===== GoRoutine Timeout of 1 - 4 ms
	for {
		select {
		case r := <-resChan:
			// process result of model inference and return

		case <-t.C:
			return nil, error("triton inference time out")
		}
	}
}

Please also note that we are using Triton with CPU only for inference at this point.

Would you mind sharing your model execution time and rest of the latency breakdown?

The Average Inference Request Duration for the model is 1.04 ms as reported by Triton (nv_inference_request_duration_us / nv_inference_count).
The E2E Inference Request Duration reported by the client for this particular model [including network RTT] is as follows:

Avg: 1.81 ms
p50: 1.74 ms, 
p95: 2.77 ms, 
p99: 3.13 ms

Can you update the title of this issue to reflect the current issue?

I believe the issue is Request Cancellation Timeout being low. I will update the title accordingly.

Let me know if you need any more details.

Thanks

@AshwinAmbal AshwinAmbal changed the title gRPC Segfaults in Triton 24.05 due to Request Cancellation gRPC Segfaults in Triton 24.05 due to Low Request Cancellation Timeout Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants