spend more than 20ms on prepocess on jetson nano, batch=1 #109

JinRanYAO · 2024-04-07T05:07:22Z

Hello, thank you for your excellent work. I am trying to use your yolov8-pose code in jetson nano for real-time detection. I set batchs=1, imageshape=640(h)x384(w). I can get right result, and I found that it costs 40+ ms on inference, but 20+ ms on preprocess. I think it takes too long time on preprocess. Is there anything wrong, and is there anything I can do to optimize it?

FeiYull · 2024-04-07T08:58:54Z

@JinRanYAO Is the data you're testing a picture or a video?

FeiYull · 2024-04-07T09:17:46Z

@JinRanYAO Try to use the following instructions to achieve fp16 quantization, and improve performance by about 100%

./trtexec --onnx=yolov8n-pose.onnx --saveEngine=yolov8n-pose-fp16.trt --buildOnly --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:4x3x640x640 --fp16

【FP32】:
[04/07/2024-09:15:16] [I] preprocess time = 0.841472; infer time = 5.80734; postprocess time = 0.186192
[04/07/2024-09:15:16] [I] preprocess time = 0.837504; infer time = 5.76032; postprocess time = 0.13976
[04/07/2024-09:15:16] [I] preprocess time = 0.845184; infer time = 5.75726; postprocess time = 0.209248
[04/07/2024-09:15:16] [I] preprocess time = 0.839952; infer time = 5.76222; postprocess time = 0.170016
[04/07/2024-09:15:16] [I] preprocess time = 0.844816; infer time = 5.76472; postprocess time = 0.146288
[04/07/2024-09:15:16] [I] preprocess time = 0.838784; infer time = 5.76434; postprocess time = 0.203216
[04/07/2024-09:15:16] [I] preprocess time = 0.808864; infer time = 5.5223; postprocess time = 0.150368
[04/07/2024-09:15:16] [I] preprocess time = 0.811856; infer time = 5.52139; postprocess time = 0.184
[04/07/2024-09:15:16] [I] preprocess time = 0.80856; infer time = 5.52371; postprocess time = 0.20792
[04/07/2024-09:15:16] [I] preprocess time = 0.809776; infer time = 5.51814; postprocess time = 0.168032
[04/07/2024-09:15:16] [I] preprocess time = 0.810064; infer time = 5.5215; postprocess time = 0.208496
[04/07/2024-09:15:16] [I] preprocess time = 0.811216; infer time = 5.51797; postprocess time = 0.201968
[04/07/2024-09:15:16] [I] preprocess time = 0.809136; infer time = 5.51658; postprocess time = 0.179296

【FP16】:
[04/07/2024-09:15:26] [I] preprocess time = 0.84056; infer time = 2.59362; postprocess time = 0.177744
[04/07/2024-09:15:26] [I] preprocess time = 0.84752; infer time = 2.43448; postprocess time = 0.132512
[04/07/2024-09:15:26] [I] preprocess time = 0.840256; infer time = 2.42754; postprocess time = 0.206288
[04/07/2024-09:15:26] [I] preprocess time = 0.841216; infer time = 2.43272; postprocess time = 0.160144
[04/07/2024-09:15:26] [I] preprocess time = 0.840736; infer time = 2.42774; postprocess time = 0.137648
[04/07/2024-09:15:26] [I] preprocess time = 0.841296; infer time = 2.4313; postprocess time = 0.194464
[04/07/2024-09:15:26] [I] preprocess time = 0.840992; infer time = 2.43011; postprocess time = 0.149072
[04/07/2024-09:15:26] [I] preprocess time = 0.83664; infer time = 2.43083; postprocess time = 0.184176
[04/07/2024-09:15:26] [I] preprocess time = 0.841136; infer time = 2.4283; postprocess time = 0.20736
[04/07/2024-09:15:26] [I] preprocess time = 0.844864; infer time = 2.4312; postprocess time = 0.165424
[04/07/2024-09:15:26] [I] preprocess time = 0.842; infer time = 2.42846; postprocess time = 0.207552
[04/07/2024-09:15:26] [I] preprocess time = 0.8444; infer time = 2.43054; postprocess time = 0.203488
[04/07/2024-09:15:26] [I] preprocess time = 0.84024; infer time = 2.43106; postprocess time = 0.179952

JinRanYAO · 2024-04-07T09:52:31Z

@FeiYull Thank you for your quick reply!

My project is built on ROS, so I don't use utils::InputStream. I use yolov8.init() in the beginning. When I receive an image, I run the following code each frame. I think it is like using utils::InputStream::IMAGE? Is this code reasonable or somewhere can be improved?

        imgs_batch.emplace_back(frame.clone());
        yolov8.copy(imgs_batch);
	utils::DeviceTimer d_t1; yolov8.preprocess(imgs_batch);  float t1 = d_t1.getUsedTime();
	utils::DeviceTimer d_t2; yolov8.infer();				  float t2 = d_t2.getUsedTime();
	utils::DeviceTimer d_t3; yolov8.postprocess(imgs_batch); float t3 = d_t3.getUsedTime();
	float avg_times[3] = { t1, t2, t3 };
	sample::gLogInfo << "preprocess time = " << avg_times[0] << "; "
		"infer time = " << avg_times[1] << "; "
		"postprocess time = " << avg_times[2] << std::endl;
	yolov8.reset();
	imgs_batch.clear();

Thanks, I try to use fp16, and the infer time decreased from 40ms to 30ms, with also 20ms preprocess time. Can I use int8 to get faster?
Additionally, my raw image shape is 1920x1080. Is too much time spent on resize?

FeiYull · 2024-04-07T12:52:24Z

@JinRanYAO
It is recommended to enter the function YOLOv8Pose::preprocess to test the internal time overhead.

void YOLOv8Pose::preprocess(const std::vectorcv::Mat& imgsBatch)

JinRanYAO · 2024-04-08T01:07:09Z

@FeiYull It seems that resize, bgr2rgb, norm and hwc2chw cost almost the same time, about 5ms for each process. Could I use the similar fuctions in opencv when I receive image, instead of using these processes here?

FeiYull · 2024-04-08T01:34:22Z

@JinRanYAO U can merge the following operations to one:

resizeDevice
bgr2rgbDevice
normDevice

Inside the resizeDevice's cuda kernel function you call, modify the following:

[modify bofore]

TensorRT-Alpha/utils/kernel_function.cu

Line 142 in bca9575

pdst[0] = c0;

[modify after]
`
//pdst[0] = c0;
//pdst[1] = c1;
//pdst[2] = c2;

// bgr2rgb
pdst[0] = c2;
pdst[1] = c1;
pdst[2] = c0;

// normlization
// float scale = 255.f
// float means[3] = { 0.f, 0.f, 0.f };
// float stds[3] = { 1.f, 1.f, 1.f };
pdst[0] = (pdst[0] / scale - means[0]) / stds[0];
pdst[1] = (pdst[1] / scale - means[0]) / stds[0];
pdst[2] = (pdst[2] / scale - means[0]) / stds[0];
`

JinRanYAO · 2024-04-08T02:08:22Z

@FeiYull Thanks for your advice, the preprocess time decreases to 8ms after merging resize, bgr2rgb, norm to one. Then I resize the image to trtfile size when it is received, and use the same src_size and dst_size in yolov8-pose. Finally I simplify the preporcess code by deleting affinematrix and interpolation to save more time. Here is my code now.

__global__
void resize_rgb_padding_device_kernel(unsigned char* src, int src_width, int src_height, int src_area, int src_volume,
        float* dst, int dst_width, int dst_height, int dst_area, int dst_volume,
        int batch_size, float padding_value, float inv_scale)
{
    int dx = blockDim.x * blockIdx.x + threadIdx.x;
    int dy = blockDim.y * blockIdx.y + threadIdx.y;
    
    if (dx < dst_area && dy < batch_size)
    {
        int dst_y = dx / dst_width;
        int dst_x = dx % dst_width;

        unsigned char* v = src + dy * src_volume + dst_y * src_width * 3 + dst_x * 3;

        float* pdst = dst + dy * dst_volume + dst_y * dst_width * 3 + dst_x * 3;
        pdst[0] = (v[2] + 0.5f) * inv_scale;
        pdst[1] = (v[1] + 0.5f) * inv_scale;
        pdst[2] = (v[0] + 0.5f) * inv_scale;
    }
}

After simplifying, the preprocess time decreases to about 6ms, with right inference result. Is this code all right or anything can be improved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spend more than 20ms on prepocess on jetson nano, batch=1 #109

spend more than 20ms on prepocess on jetson nano, batch=1 #109

JinRanYAO commented Apr 7, 2024

FeiYull commented Apr 7, 2024

FeiYull commented Apr 7, 2024

JinRanYAO commented Apr 7, 2024 •

edited

Loading

FeiYull commented Apr 7, 2024 •

edited

Loading

JinRanYAO commented Apr 8, 2024

FeiYull commented Apr 8, 2024 •

edited

Loading

JinRanYAO commented Apr 8, 2024 •

edited

Loading

spend more than 20ms on prepocess on jetson nano, batch=1 #109

spend more than 20ms on prepocess on jetson nano, batch=1 #109

Comments

JinRanYAO commented Apr 7, 2024

FeiYull commented Apr 7, 2024

FeiYull commented Apr 7, 2024

JinRanYAO commented Apr 7, 2024 • edited Loading

FeiYull commented Apr 7, 2024 • edited Loading

JinRanYAO commented Apr 8, 2024

FeiYull commented Apr 8, 2024 • edited Loading

JinRanYAO commented Apr 8, 2024 • edited Loading

JinRanYAO commented Apr 7, 2024 •

edited

Loading

FeiYull commented Apr 7, 2024 •

edited

Loading

FeiYull commented Apr 8, 2024 •

edited

Loading

JinRanYAO commented Apr 8, 2024 •

edited

Loading