Back to Search
Start Over
A hybrid network for estimating 3D interacting hand pose from a single RGB image.
- Source :
- Signal, Image & Video Processing; Jun2024, Vol. 18 Issue 4, p3801-3814, 14p
- Publication Year :
- 2024
-
Abstract
- The estimation of 3D interacting hand pose from a single RGB image is a challenging problem. The hands tend to occlude each other and are self-similar in two-handed interactions. In this study, a simple, accurate end-to-end framework called HybridPoseNet is proposed for estimating 3D interactive hand pose. The hybrid network employs an encoder-decoder architecture. More specifically, the feature encoder is a hybrid structure that combines a convolutional neural network (CNN) with a transformer to accomplish the feature encoding of hand information. An ordinary CNN is employed to extract the local detailed features of a given image, and a vision transformer is used to capture the long-distance spatial interactions between the cross-positional feature vectors. Moreover, the 3D pose decoder is based on left and right network branches, which are fused via a feature enhancement module (FEM). The FEM helps reduce the ambiguity in appearance caused by the self-similarity of the hands. The decoder elevates the 2D pose to the 3D pose by estimating two depth components. The ablation experiments demonstrate the effectiveness of each module in the network. In addition, comprehensive experiments on the InterHand2.6M dataset show that the proposed method outperforms previous state-of-the-art methods for estimating interactive hand pose. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 18631703
- Volume :
- 18
- Issue :
- 4
- Database :
- Complementary Index
- Journal :
- Signal, Image & Video Processing
- Publication Type :
- Academic Journal
- Accession number :
- 176251116
- Full Text :
- https://doi.org/10.1007/s11760-024-03043-1