Back to Search Start Over

Scalable frame resolution for efficient continuous sign language recognition.

Authors :
Hu, Lianyu
Gao, Liqing
Liu, Zekang
Feng, Wei
Source :
Pattern Recognition. Jan2024, Vol. 145, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

In this paper, we explore the spatial redundancy in continuous sign language recognition (CSLR), aiming to improve its efficiency. Despite recent advances in accuracy in CSLR, state-of-the-art CSLR methods typically require large amounts of computations and memory occupation, which are not friendly towards fast inference under limited computation/memory budgets. Based on a simple observation that not all frames are equally important for CSLR, we propose AdaSize to handle this problem by modeling the frame resolution decision as an end-to-end learnable task to save unnecessary computations. Specifically, a lightweight 2D convolutional neural network (CNN) is first used to quickly browse input frames under a low resolution (e.g., 112 × 112). These extracted coarse and cheap features are sent into a recurrent policy network to dynamically determine the desired resolution for each frame. Once the optimal resolution for each frame is decided, frames with different resolutions are fed into the following backbones to extract representative features. Finally, these features pass through a sequence of temporal modules and a classifier to predict sentences. Extensive experiments on four large-scale datasets, including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL, demonstrate the effectiveness of AdaSize. AdaSize could consistently achieve comparable accuracy with state-of-the-art CSLR methods, with only 0.38 × computations, 0.41 × memory usage and 1.25 × throughput. Comparisons with commonly-used lightweight backbones and other efficient methods verify the superiority of AdaSize under similar computational/memory budgets. We finally plot the frame resolution decisions for AdaSize, hoping to provide insightful analysis of the inherent spatial redundancy in videos. • Sign language is the main communication tool of the hearing-impaired people in their daily life. • Current sign language systems consume large computational costs and own high inference latency. • A novel adaptive method to improve network efficiency by switching among various resolution. • Both online and offline ablations on the generalizability of the proposed method. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
00313203
Volume :
145
Database :
Academic Search Index
Journal :
Pattern Recognition
Publication Type :
Academic Journal
Accession number :
172778063
Full Text :
https://doi.org/10.1016/j.patcog.2023.109903