Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spend more than 20ms on prepocess on jetson nano, batch=1 #109

Open
JinRanYAO opened this issue Apr 7, 2024 · 7 comments
Open

spend more than 20ms on prepocess on jetson nano, batch=1 #109

JinRanYAO opened this issue Apr 7, 2024 · 7 comments

Comments

@JinRanYAO
Copy link

Hello, thank you for your excellent work. I am trying to use your yolov8-pose code in jetson nano for real-time detection. I set batchs=1, imageshape=640(h)x384(w). I can get right result, and I found that it costs 40+ ms on inference, but 20+ ms on preprocess. I think it takes too long time on preprocess. Is there anything wrong, and is there anything I can do to optimize it?

@FeiYull
Copy link
Owner

FeiYull commented Apr 7, 2024

@JinRanYAO Is the data you're testing a picture or a video?

@FeiYull
Copy link
Owner

FeiYull commented Apr 7, 2024

@JinRanYAO Try to use the following instructions to achieve fp16 quantization, and improve performance by about 100%

./trtexec --onnx=yolov8n-pose.onnx --saveEngine=yolov8n-pose-fp16.trt --buildOnly --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:4x3x640x640 --fp16

【FP32】:
[04/07/2024-09:15:16] [I] preprocess time = 0.841472; infer time = 5.80734; postprocess time = 0.186192
[04/07/2024-09:15:16] [I] preprocess time = 0.837504; infer time = 5.76032; postprocess time = 0.13976
[04/07/2024-09:15:16] [I] preprocess time = 0.845184; infer time = 5.75726; postprocess time = 0.209248
[04/07/2024-09:15:16] [I] preprocess time = 0.839952; infer time = 5.76222; postprocess time = 0.170016
[04/07/2024-09:15:16] [I] preprocess time = 0.844816; infer time = 5.76472; postprocess time = 0.146288
[04/07/2024-09:15:16] [I] preprocess time = 0.838784; infer time = 5.76434; postprocess time = 0.203216
[04/07/2024-09:15:16] [I] preprocess time = 0.808864; infer time = 5.5223; postprocess time = 0.150368
[04/07/2024-09:15:16] [I] preprocess time = 0.811856; infer time = 5.52139; postprocess time = 0.184
[04/07/2024-09:15:16] [I] preprocess time = 0.80856; infer time = 5.52371; postprocess time = 0.20792
[04/07/2024-09:15:16] [I] preprocess time = 0.809776; infer time = 5.51814; postprocess time = 0.168032
[04/07/2024-09:15:16] [I] preprocess time = 0.810064; infer time = 5.5215; postprocess time = 0.208496
[04/07/2024-09:15:16] [I] preprocess time = 0.811216; infer time = 5.51797; postprocess time = 0.201968
[04/07/2024-09:15:16] [I] preprocess time = 0.809136; infer time = 5.51658; postprocess time = 0.179296

【FP16】:
[04/07/2024-09:15:26] [I] preprocess time = 0.84056; infer time = 2.59362; postprocess time = 0.177744
[04/07/2024-09:15:26] [I] preprocess time = 0.84752; infer time = 2.43448; postprocess time = 0.132512
[04/07/2024-09:15:26] [I] preprocess time = 0.840256; infer time = 2.42754; postprocess time = 0.206288
[04/07/2024-09:15:26] [I] preprocess time = 0.841216; infer time = 2.43272; postprocess time = 0.160144
[04/07/2024-09:15:26] [I] preprocess time = 0.840736; infer time = 2.42774; postprocess time = 0.137648
[04/07/2024-09:15:26] [I] preprocess time = 0.841296; infer time = 2.4313; postprocess time = 0.194464
[04/07/2024-09:15:26] [I] preprocess time = 0.840992; infer time = 2.43011; postprocess time = 0.149072
[04/07/2024-09:15:26] [I] preprocess time = 0.83664; infer time = 2.43083; postprocess time = 0.184176
[04/07/2024-09:15:26] [I] preprocess time = 0.841136; infer time = 2.4283; postprocess time = 0.20736
[04/07/2024-09:15:26] [I] preprocess time = 0.844864; infer time = 2.4312; postprocess time = 0.165424
[04/07/2024-09:15:26] [I] preprocess time = 0.842; infer time = 2.42846; postprocess time = 0.207552
[04/07/2024-09:15:26] [I] preprocess time = 0.8444; infer time = 2.43054; postprocess time = 0.203488
[04/07/2024-09:15:26] [I] preprocess time = 0.84024; infer time = 2.43106; postprocess time = 0.179952

@JinRanYAO
Copy link
Author

JinRanYAO commented Apr 7, 2024

@FeiYull Thank you for your quick reply!

  1. My project is built on ROS, so I don't use utils::InputStream. I use yolov8.init() in the beginning. When I receive an image, I run the following code each frame. I think it is like using utils::InputStream::IMAGE? Is this code reasonable or somewhere can be improved?
        imgs_batch.emplace_back(frame.clone());
        yolov8.copy(imgs_batch);
	utils::DeviceTimer d_t1; yolov8.preprocess(imgs_batch);  float t1 = d_t1.getUsedTime();
	utils::DeviceTimer d_t2; yolov8.infer();				  float t2 = d_t2.getUsedTime();
	utils::DeviceTimer d_t3; yolov8.postprocess(imgs_batch); float t3 = d_t3.getUsedTime();
	float avg_times[3] = { t1, t2, t3 };
	sample::gLogInfo << "preprocess time = " << avg_times[0] << "; "
		"infer time = " << avg_times[1] << "; "
		"postprocess time = " << avg_times[2] << std::endl;
	yolov8.reset();
	imgs_batch.clear();
  1. Thanks, I try to use fp16, and the infer time decreased from 40ms to 30ms, with also 20ms preprocess time. Can I use int8 to get faster?
  2. Additionally, my raw image shape is 1920x1080. Is too much time spent on resize?

@FeiYull
Copy link
Owner

FeiYull commented Apr 7, 2024

@JinRanYAO
It is recommended to enter the function YOLOv8Pose::preprocess to test the internal time overhead.

void YOLOv8Pose::preprocess(const std::vectorcv::Mat& imgsBatch)

123

@JinRanYAO
Copy link
Author

@FeiYull It seems that resize, bgr2rgb, norm and hwc2chw cost almost the same time, about 5ms for each process. Could I use the similar fuctions in opencv when I receive image, instead of using these processes here?

@FeiYull
Copy link
Owner

FeiYull commented Apr 8, 2024

@JinRanYAO U can merge the following operations to one:

  1. resizeDevice
  2. bgr2rgbDevice
  3. normDevice

Inside the resizeDevice's cuda kernel function you call, modify the following:

[modify bofore]

[modify after]
`
//pdst[0] = c0;
//pdst[1] = c1;
//pdst[2] = c2;

// bgr2rgb
pdst[0] = c2;
pdst[1] = c1;
pdst[2] = c0;

// normlization
// float scale = 255.f
// float means[3] = { 0.f, 0.f, 0.f };
// float stds[3] = { 1.f, 1.f, 1.f };
pdst[0] = (pdst[0] / scale - means[0]) / stds[0];
pdst[1] = (pdst[1] / scale - means[0]) / stds[0];
pdst[2] = (pdst[2] / scale - means[0]) / stds[0];
`

@JinRanYAO
Copy link
Author

JinRanYAO commented Apr 8, 2024

@FeiYull Thanks for your advice, the preprocess time decreases to 8ms after merging resize, bgr2rgb, norm to one. Then I resize the image to trtfile size when it is received, and use the same src_size and dst_size in yolov8-pose. Finally I simplify the preporcess code by deleting affinematrix and interpolation to save more time. Here is my code now.

__global__
void resize_rgb_padding_device_kernel(unsigned char* src, int src_width, int src_height, int src_area, int src_volume,
        float* dst, int dst_width, int dst_height, int dst_area, int dst_volume,
        int batch_size, float padding_value, float inv_scale)
{
    int dx = blockDim.x * blockIdx.x + threadIdx.x;
    int dy = blockDim.y * blockIdx.y + threadIdx.y;
    
    if (dx < dst_area && dy < batch_size)
    {
        int dst_y = dx / dst_width;
        int dst_x = dx % dst_width;

        unsigned char* v = src + dy * src_volume + dst_y * src_width * 3 + dst_x * 3;

        float* pdst = dst + dy * dst_volume + dst_y * dst_width * 3 + dst_x * 3;
        pdst[0] = (v[2] + 0.5f) * inv_scale;
        pdst[1] = (v[1] + 0.5f) * inv_scale;
        pdst[2] = (v[0] + 0.5f) * inv_scale;
    }
}

After simplifying, the preprocess time decreases to about 6ms, with right inference result. Is this code all right or anything can be improved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants