1. ASQ: An Ultra-Low Bit Rate ASR-Oriented Speech Quantization Method
- Author
-
Ye, Lingxuan, Gao, Changfeng, Cheng, Gaofeng, Luo, Liuping, and Zhao, Qingwei
- Abstract
For efficient transmission of speech signals, speech compression methodologies have attracted significant research attention for decades and are widely used in automatic speech recognition (ASR) services. However, most speech codecs are perception-oriented, leaving redundant information and introducing distortion, which harms ASR systems. Recently, the emergence of neural network-based models has significantly advanced the progress of ASR systems and speech coding, laying the foundation for building a speech compression method specially optimized for ASR systems. In this letter, we propose an ASR-oriented Speech Quantization (ASQ) method to reduce communication costs for speech recognition systems. In the proposed method, a speech quantization model first converts the speech into low bit rate tokens. Then the tokens are transmitted to the server and recognized by a quantized speech recognition model. The two models could be jointly trained in the end-to-end (E2E) style. To mitigate the performance degradation introduced by the quantization components, we design an entropy-guided 3-stage training method that encourages the model to fully utilize the token space and promote recognition accuracy. Experiment results on the LibriSpeech corpus show that compared to an existing non-quantized ASR model with a 256 kbps transmission bit rate, the proposed method can achieve a transmission bit rate of 0.6 kbps without any influence on word error rate (WER). It also significantly surpasses the 2-step pipeline that first performs speech codec and then recognizes with a several times lower bit rate.
- Published
- 2024
- Full Text
- View/download PDF