Back to Search Start Over

Regularizing cross-attention learning for end-to-end speech translation with ASR and MT attention matrices.

Authors :
Zhao, Xiaohu
Sun, Haoran
Lei, Yikun
Xiong, Deyi
Source :
Expert Systems with Applications. Aug2024, Vol. 247, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

The cross-attention mechanism enables Transformer to capture correspondences between the input and output. However, in the domain of end-to-end (E2E) speech-to-text translation (ST), the learned cross-attention weights often struggle to accurately correspond with actual alignments, given the need to align speech and text across different modalities and languages. In this paper, we present a simple yet effective method called regularized cross-attention learning, for end-to-end speech translation in a multitask learning (MTL) framework. RCAL leverages the knowledge from auxiliary automatic speech recognition (ASR) and machine translation (MT) tasks to generate a teacher cross-attention matrix, serving as prior alignment knowledge to enhance cross-attention learning within the ST task. An additional loss function is introduced as part of the MTL framework to facilitate this process. We conducted experiments on the MuST-C benchmark dataset to evaluate the effectiveness of RCAL. The results demonstrate that the proposed approach yields significant improvements over the baseline, with an average enhancement of +0.8 BLEU across four translation directions in two experimental settings, outperforming state-of-the-art E2E and cascaded speech translation models. Further analysis and visualization reveal that the model with RCAL effectively learns high-quality alignment information from auxiliary ASR and ST tasks, thereby improving the ST alignment quality. Moreover, the experiments with different sizes of MT and ST data provide strong evidence supporting our model's robustness in various scenarios. • Exploring how to mitigate the modality gap for E2E ST in the decoder. • Transfer alignment knowledge from the ASR and MT tasks to the E2E ST task. • Proposing two strategies to filter low-quality attention weights. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09574174
Volume :
247
Database :
Academic Search Index
Journal :
Expert Systems with Applications
Publication Type :
Academic Journal
Accession number :
176407649
Full Text :
https://doi.org/10.1016/j.eswa.2024.123241