Back to Search Start Over

Cepstral and acoustic ternary pattern based hybrid feature extraction approach for end-to-end bangla speech recognition.

Authors :
Dua, Mohit
Akanksha
Dua, Shelza
Source :
Journal of Ambient Intelligence & Humanized Computing; Dec2023, Vol. 14 Issue 12, p16903-16919, 17p
Publication Year :
2023

Abstract

In the last three decades, a lot of work has been done for building Automatic Speech Recognition (ASR) systems for well-established languages such as English, Chinese, etc. However, for implementing a Large Vocabulary Continuous Speech Recognition (LVCSR) system for low resource languages, the research work is also growing rapidly in corpus-focused areas. Hence, there is a requirement of benchmarking large corpus in case of low-resource language like Bangla such that prejudice results can be avoided and limitations can be handled. In the proposed work, an openly available large-scale Bangla speech corpus provided by Google has been used. The work in this paper proposes a combination of image inspired features with well explored cepstral features to build front end feature extraction phase. It uses integration of Convolutional Neural Network (CNN) and bi-directional Long-short Term Memory (bi-LSTM) with Connectionist Temporal Classification (CTC) loss function to implement backend acoustic model. The experiments employ static and dynamic features of Mel-frequency Cepstral Coefficients (MFCC), Constant Q Cepstral Coefficients (CQCC), and Gammatone Cepstral Coefficients (GTCC) techniques, one by one, with Acoustic Ternary Patterns (ATP) features. The proposed work investigates the effect of these various hybrid front-end approaches with CNN, bi-LSTM and integration of these two models. The novelty of this paper lies in the fact that fusion of ATP with cepstral features improves performance of the proposed low resource language ASR system, where the proposed combination of ATP-dynamic CQCC features with integrated backend acoustic model shows a relative improvement of 10–15% in Word Error Rate (WER) over all other experimented combinations. Further to exploit the noise robust nature of GTCC features, the ATP-dynamic GTCC features with integrated CNN-bi-LSTM back-end model are evaluated in noisy scenario, also. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
18685137
Volume :
14
Issue :
12
Database :
Complementary Index
Journal :
Journal of Ambient Intelligence & Humanized Computing
Publication Type :
Academic Journal
Accession number :
174472438
Full Text :
https://doi.org/10.1007/s12652-023-04706-6