The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation #2

eehoeskrap · 2023-02-04T10:55:43Z

Paper : https://arxiv.org/abs/2110.05132
GitHub : https://github.com/dvl-tum/center-group

이번엔 저번에 리뷰했었던 Contextual Instance Decoupling for Robust Multi-Person Pose Estimation 논문 에서 bottom-up 방식 중 SOTA로 언급 되었던 "CenterAttention"의 논문을 읽어보고자 합니다. ICCV 2021에서 채택 되었네요.

논문 개요

본 논문에서는 indentity-agnostic keypoints와 이미지에서 person center prediction 결과를 이용하여 사람의 자세를 추정하는 attention-based framework인 CenterGroup을 제안했습니다. 이러한 접근 방식은 transformer를 사용하여 검출된 모든 keypoint 및 center에 대한 context-aware embedding을 얻은 후 multi-head attention을 적용하여 joint를 person center로 해당하는 곳에 직접 그룹화 시킵니다. 대부분의 bottom-up 방식은 inference 시에 non-learnable clustering에 의존하지만, CenterGroup 방식은 keypoint detector와 함께 end-to-end 방식으로 학습하는 메커니즘을 사용한다고 합니다. 결과적으로 top-down 방식보다 inference time이 최대 2.5x 더 빠르다고 하네요. person center로부터 joint를 grouping 시킨다는점이 main contribution 인 듯 합니다.

기존 문제

논문에서는 two-step 접근 방식이 별도의 person detector를 사용해야되기 때문에 효율성이 떨어지고, 심한 occlusion이 일어났을 경우 성능이 저하된다고 말하고 있습니다. bottom-up 방식은 먼저 identity-agnostic(정체성에 구애받지 않는) keypoint를 검출 한 다음 이들을 별도의 pose로 그룹화 하기 때문에 다른 접근 방식을 가진다고 합니다. 여기서 identity-agnostic keypoint 개념이 조금 어려운데, 쉽게 풀어 설명하면 이 키포인트가 아직 누구 것인지 모르는 채로 검출되는 것을 말하는 듯 합니다. 흔한 bottom-up 방식의 특징이라고 할 수 있죠! 어쨌든 최근 연구에서는 이러한 방식을 크게 발전 시키긴 했지만 여전히 그룹화 알고리즘은 최적화 알고리즘에 의존적이기 때문에 end-to-end 가 아니며, 느리기도 한 단점을 가지고 있습니다. 또한 일반적으로 학습의 목표가 실제 inference 절차와 잘 맞지 않습니다. keypoint 사이의 유사성을 학습할 수는 있지만, test time에서 그룹화는 differentiable(미분 할 수 없는) 별도의 알고리즘에 의해 수행되기 때문입니다.

Main contribution

end-to-end 방식으로 모델을 학습할 수 있는 multi-head attention formulation으로 keypoint 및 person center 예측을 그룹화하여 pose estimation을 수행
transformer를 사용하여 bottom-up 방식으로 검출된 keypoint와 ceneter 사이의 종속성을 encoding하여 context-enhanced embedding을 얻고, 제안된 그룹화 방식의 성능을 향상
SOTA 방법에 비해 최대 2.5x 속도 향상을 제공하는 end-to-end 프레임워크에서 최신 결과를 달성

Method

1. keypoint and center detection
identity-agnostic keypoint와 person ceneter의 위치는 HigherHRNet에 따른 heatmap regression으로 얻습니다. output은 가변적인 수의 high-scoring joint 및 person center detection입니다.

2. Encoding keypoints and centers
검출된 모든 keypoint와 center에 대해 CNN backbone에서 feature를 추출하고 spatial location을 encoding하는 additinal embedding으로 feature를 보완합니다. 이러한 embeddingdms transformer에 feed 되어 향상된 context information으로 업데이트된 embedding을 생성하게 됩니다.

3. Keypoint grouping
이전 단계에서 얻은 embedding을 사용하고 person center와 keypoint 사이의 내적 attention을 계산하고 soft-assignment를 얻기 위해 정규화 과정을 거칩니다. 또한 transformer embedding을 사용하여 center node를 true or false로 분류해내고 각 keypoint의 visible 정보를 결정하게 됩니다.

Experiments

각 COCO 및 CrowdPose 에서 벤치마크 한 결과는 아래와 같습니다.

논문 리뷰 full version
https://eehoeskrap.tistory.com/684

eehoeskrap · 2023-02-04T11:58:47Z

정성 평가 결과는 다음과 같습니다.

eehoeskrap added HPE ICCV International Conference on Computer Vision 2021 MPPE Multi-Person Pose Estimation labels Feb 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation #2

The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation #2

eehoeskrap commented Feb 4, 2023 •

edited

Loading

eehoeskrap commented Feb 4, 2023 •

edited

Loading

The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation #2

The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation #2

Comments

eehoeskrap commented Feb 4, 2023 • edited Loading

논문 개요

기존 문제

Main contribution

Method

Experiments

eehoeskrap commented Feb 4, 2023 • edited Loading

eehoeskrap commented Feb 4, 2023 •

edited

Loading

eehoeskrap commented Feb 4, 2023 •

edited

Loading