Back to Search
Start Over
End-to-End Video Text Spotting with Transformer.
- Source :
-
International Journal of Computer Vision . Sep2024, Vol. 132 Issue 9, p4019-4035. 17p. - Publication Year :
- 2024
-
Abstract
- Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. The previous methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines, which is an not effective solution. In this paper, rooted in Transformer sequence modeling, we propose a simple, yet effective end-to-end trainable video text DEtection, Tracking, and Recognition framework (TransDeTR), which views the VTS task as a direct long-range temporal modeling problem. TransDeTR mainly includes two advantages: (1) Different from the explicit match paradigm in the adjacent frame, the proposed TransDeTR tracks and recognizes each text implicitly by the different query termed 'text query' over long-range temporal sequence (more than 7 frames). (2) TransDeTR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments on four video text datasets (e.g., ICDAR2013 Video, ICDAR2015 Video) are conducted to demonstrate that TransDeTR achieves state-of-the-art performance with up to 11.0 % improvements on detection, tracking, and spotting tasks. Code can be found at: https://github.com/weijiawu/TransDETR. [ABSTRACT FROM AUTHOR]
- Subjects :
- *STREAMING media
*VIDEOS
Subjects
Details
- Language :
- English
- ISSN :
- 09205691
- Volume :
- 132
- Issue :
- 9
- Database :
- Academic Search Index
- Journal :
- International Journal of Computer Vision
- Publication Type :
- Academic Journal
- Accession number :
- 179277915
- Full Text :
- https://doi.org/10.1007/s11263-024-02063-1