Descriptor: "Electrical Engineering and Systems Science - Audio and Speech Processing" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Electrical Engineering and Systems Science - Audio and Speech Processing"' showing total 41,389 results

Start Over Descriptor "Electrical Engineering and Systems Science - Audio and Speech Processing"

41,389 results on '"Electrical Engineering and Systems Science - Audio and Speech Processing"'

251. Mitigating Unauthorized Speech Synthesis for Voice Protection

Author: Zhang, Zhisheng, Yang, Qianyi, Wang, Derui, Huang, Pengyang, Cao, Yuxin, Ye, Kai, and Hao, Jie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: With just a few speech samples, it is possible to perfectly replicate a speaker's voice in recent years, while malicious voice exploitation (e.g., telecom fraud for illegal financial gain) has brought huge hazards in our daily lives. Therefore, it is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints. Most previous defense methods have focused on spoofing speaker verification systems in timbre similarity but the synthesized deepfake speech is still of high quality. In response to the rising hazards, we devise an effective, transferable, and robust proactive protection technology named Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples to prevent them from being effectively learned for text-to-speech (TTS) synthesis models so that high-quality deepfake speeches cannot be generated. We conduct extensive experiments on state-of-the-art (SOTA) TTS models utilizing objective and subjective metrics to comprehensively evaluate our proposed method. The experimental results demonstrate outstanding effectiveness and transferability across various models. Compared to the speech unclarity score of 21.94% from voice synthesizers trained on samples without protection, POP-protected samples significantly increase it to 127.31%. Moreover, our method shows robustness against noise reduction and data augmentation techniques, thereby greatly reducing potential hazards., Comment: Accepted to ACM CCS Workshop (LAMPS) 2024
Published: 2024
Full Text: View/download PDF

252. RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis

Author: Sui, Kehan, Xiang, Jinxu, and Jin, Fang
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores, requiring a detailed understanding of notes, pitch, and duration, unlike text-to-speech tasks. Although diffusion models have shown exceptional performance in various generative tasks like image and video creation, their application in SVS is hindered by time complexity and the challenge of capturing acoustic features, particularly during pitch transitions. Some networks learn from the prior distribution and use the compressed latent state as a better start in the diffusion model, but the denoising step doesn't consistently improve quality over the entire duration. We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks. Our approach is inspired by Animate Anyone, a diffusion image network that maintains intricate appearance features from reference images. RDSinger utilizes FastSpeech2 mel-spectrogram as a reference to mitigate denoising step artifacts. Additionally, existing models could be influenced by misleading information on the compressed latent state during pitch transitions. We address this issue by applying Gaussian blur on partial reference mel-spectrogram and adjusting loss weights in these regions. Extensive ablation studies demonstrate the efficiency of our method. Evaluations on OpenCpop, a Chinese singing dataset, show that RDSinger outperforms current state-of-the-art SVS methods in performance.
Published: 2024

253. A Tutorial on Clinical Speech AI Development: From Data Collection to Model Validation

Author: Ng, Si-Ioi, Xu, Lingfeng, Siegert, Ingo, Cummins, Nicholas, Benway, Nina R., Liss, Julie, and Berisha, Visar
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: There has been a surge of interest in leveraging speech as a marker of health for a wide spectrum of conditions. The underlying premise is that any neurological, mental, or physical deficits that impact speech production can be objectively assessed via automated analysis of speech. Recent advances in speech-based Artificial Intelligence (AI) models for diagnosing and tracking mental health, cognitive, and motor disorders often use supervised learning, similar to mainstream speech technologies like recognition and verification. However, clinical speech AI has distinct challenges, including the need for specific elicitation tasks, small available datasets, diverse speech representations, and uncertain diagnostic labels. As a result, application of the standard supervised learning paradigm may lead to models that perform well in controlled settings but fail to generalize in real-world clinical deployments. With translation into real-world clinical scenarios in mind, this tutorial paper provides an overview of the key components required for robust development of clinical speech AI. Specifically, this paper will cover the design of speech elicitation tasks and protocols most appropriate for different clinical conditions, collection of data and verification of hardware, development and validation of speech representations designed to measure clinical constructs of interest, development of reliable and robust clinical prediction models, and ethical and participant considerations for clinical speech AI. The goal is to provide comprehensive guidance on building models whose inputs and outputs link to the more interpretable and clinically meaningful aspects of speech, that can be interrogated and clinically validated on clinical datasets, and that adhere to ethical, privacy, and security considerations by design., Comment: 76 pages, 24 figures
Published: 2024

254. Audio Classification of Low Feature Spectrograms Utilizing Convolutional Neural Networks

Author: Elias, Noel
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Modern day audio signal classification techniques lack the ability to classify low feature audio signals in the form of spectrographic temporal frequency data representations. Additionally, currently utilized techniques rely on full diverse data sets that are often not representative of real-world distributions. This paper derives several first-of-its-kind machine learning methodologies to analyze these low feature audio spectrograms given data distributions that may have normalized, skewed, or even limited training sets. In particular, this paper proposes several novel customized convolutional architectures to extract identifying features using binary, one-class, and siamese approaches to identify the spectrographic signature of a given audio signal. Utilizing these novel convolutional architectures as well as the proposed classification methods, these experiments demonstrate state-of-the-art classification accuracy and improved efficiency than traditional audio classification methods.
Published: 2024
Full Text: View/download PDF

255. A Novel Score-CAM based Denoiser for Spectrographic Signature Extraction without Ground Truth

Author: Elias, Noel
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sonar based audio classification techniques are a growing area of research in the field of underwater acoustics. Usually, underwater noise picked up by passive sonar transducers contains all types of signals that travel through the ocean and is transformed into spectrographic images. As a result, the corresponding spectrograms intended to display the temporal-frequency data of a certain object often include the tonal regions of abundant extraneous noise that can effectively interfere with a 'contact'. So, a majority of spectrographic samples extracted from underwater audio signals are rendered unusable due to their clutter and lack the required indistinguishability between different objects. With limited clean true data for supervised training, creating classification models for these audio signals is severely bottlenecked. This paper derives several new techniques to combat this problem by developing a novel Score-CAM based denoiser to extract an object's signature from noisy spectrographic data without being given any ground truth data. In particular, this paper proposes a novel generative adversarial network architecture for learning and producing spectrographic training data in similar distributions to low-feature spectrogram inputs. In addition, this paper also a generalizable class activation mapping based denoiser for different distributions of acoustic data, even real-world data distributions. Utilizing these novel architectures and proposed denoising techniques, these experiments demonstrate state-of-the-art noise reduction accuracy and improved classification accuracy than current audio classification standards. As such, this approach has applications not only to audio data but for countless data distributions used all around the world for machine learning.
Published: 2024
Full Text: View/download PDF

256. Enhancing TTS Stability in Hebrew using Discrete Semantic Units

Author: Zeldes, Ella, Tal, Or, and Adi, Yossi
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the language modeling process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique vocal characteristics of the speaker, contributing to the naturalness of the synthesized speech. Our experimental results demonstrate that this approach not only maintains high performance in Hebrew but also shows adaptability to English, underscoring its effectiveness in enhancing stability in TTS systems universally. Our method, named LOTHM (Language of The Hebrew Man), outperforms existing methods in terms of stability while achieving naturalness and speaker similarity on par with previous methods, making it a compelling choice for future speech synthesis applications. Samples can be found in our page pages.cs.huji.ac.il/adiyoss-lab/LoTHM .
Published: 2024

257. Knowledge Distillation for Real-Time Classification of Early Media in Voice Communications

Author: Altwlkany, Kemal, Hadžić, Hadžem, Kurić, Amar, and Lacic, Emanuel
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.0
Abstract: This paper investigates the industrial setting of real-time classification of early media exchanged during the initialization phase of voice calls. We explore the application of state-of-the-art audio tagging models and highlight some limitations when applied to the classification of early media. While most existing approaches leverage convolutional neural networks, we propose a novel approach for low-resource requirements based on gradient-boosted trees. Our approach not only demonstrates a substantial improvement in runtime performance, but also exhibits a comparable accuracy. We show that leveraging knowledge distillation and class aggregation techniques to train a simpler and smaller model accelerates the classification of early media in voice calls. We provide a detailed analysis of the results on a proprietary and publicly available dataset, regarding accuracy and runtime performance. We additionally report a case study of the achieved performance improvements at a regional data center in India.
Published: 2024

258. Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

Author: Cord-Landwehr, Tobias, Boeddeker, Christoph, and Haeb-Umbach, Reinhold
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose an approach for simultaneous diarization and separation of meeting data. It consists of a complex Angular Central Gaussian Mixture Model (cACGMM) for speech source separation, and a von-Mises-Fisher Mixture Model (VMFMM) for diarization in a joint statistical framework. Through the integration, both spatial and spectral information are exploited for diarization and separation. We also develop a method for counting the number of active speakers in a segment of a meeting to support block-wise processing. While the total number of speakers in a meeting may be known, it is usually not known on a per-segment level. With the proposed speaker counting, joint diarization and source separation can be done segment-by-segment, and the permutation problem across segments is solved, thus allowing for block-online processing in the future. Experimental results on the LibriCSS meeting corpus show that the integrated approach outperforms a cascaded approach of diarization and speech enhancement in terms of WER, both on a per-segment and on a per-meeting level., Comment: Submitted to ICASSP2025
Published: 2024

259. OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup

Author: Cheng, Xize, Zheng, Siqi, Wang, Zehan, Fang, Minghui, Zhang, Ziang, Huang, Rongjie, Ma, Ziyang, Ji, Shengpeng, Zuo, Jialong, Jin, Tao, and Zhao, Zhou
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The scaling up has brought tremendous success in the fields of vision and language in recent years. When it comes to audio, however, researchers encounter a major challenge in scaling up the training data, as most natural audio contains diverse interfering signals. To address this limitation, we introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries, encompassing both single-modal and multi-modal composed queries. Specifically, we introduce the Query-Mixup strategy, which blends query features from different modalities during training. This enables OmniSep to optimize multiple modalities concurrently, effectively bringing all modalities under a unified framework for sound separation. We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds as desired. Finally, OmniSep employs a retrieval-augmented approach known as Query-Aug, which enables open-vocabulary sound separation. Experimental evaluations on MUSIC, VGGSOUND-CLEAN+, and MUSIC-CLEAN+ datasets demonstrate effectiveness of OmniSep, achieving state-of-the-art performance in text-, image-, and audio-queried sound separation tasks. For samples and further information, please visit the demo page at \url{https://omnisep.github.io/}., Comment: Working in progress
Published: 2024

260. ST-ITO: Controlling Audio Effects for Style Transfer with Inference-Time Optimization

Author: Steinmetz, Christian J., Singh, Shubhr, Comunità, Marco, Ibnyahya, Ilias, Yuan, Shanxin, Benetos, Emmanouil, and Reiss, Joshua D.
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects., Comment: Accepted to ISMIR 2024. Code available https://github.com/csteinmetz1/st-ito
Published: 2024

261. Multilingual Standalone Trustworthy Voice-Based Social Network for Disaster Situations

Author: Behravan, Majid, Mohammadrezaei, Elham, Azab, Mohamed, and Gracanin, Denis
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.7, K.4.4
Abstract: In disaster scenarios, effective communication is crucial, yet language barriers often hinder timely and accurate information dissemination, exacerbating vulnerabilities and complicating response efforts. This paper presents a novel, multilingual, voice-based social network specifically designed to address these challenges. The proposed system integrates advanced artificial intelligence (AI) with blockchain technology to enable secure, asynchronous voice communication across multiple languages. The application operates independently of external servers, ensuring reliability even in compromised environments by functioning offline through local networks. Key features include AI-driven real-time translation of voice messages, ensuring seamless cross-linguistic communication, and blockchain-enabled storage for secure, immutable records of all interactions, safeguarding message integrity. Designed for cross-platform use, the system offers consistent performance across devices, from mobile phones to desktops, making it highly adaptable in diverse disaster situations. Evaluation metrics demonstrate high accuracy in speech recognition and translation, low latency, and user satisfaction, validating the system's effectiveness in enhancing communication during crises. This solution represents a significant advancement in disaster communication, bridging language gaps to support more inclusive and efficient emergency response., Comment: Accepted for publication in IEEE UEMCON 2024, to appear in December 2024. 7 pages, 3 figures
Published: 2024

262. Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Author: Kukanov, Ivan, Laakkonen, Janne, Kinnunen, Tomi, and Hautamäki, Ville
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound
Abstract: Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn attack-invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high-scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few-shot adaptation ensures that the system remains up-to-date., Comment: 6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
Published: 2024

263. Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors

Author: Nowrin, Sadia and Vertanen, Keith
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Conversational systems rely heavily on speech recognition to interpret and respond to user commands and queries. Nevertheless, recognition errors may occur, which can significantly affect the performance of such systems. While visual feedback can help detect errors, it may not always be practical, especially for people who are blind or low-vision. In this study, we investigate ways to improve error detection by manipulating the audio output of the transcribed text based on the recognizer's confidence level in its result. Our findings show that selectively slowing down the audio when the recognizer exhibited uncertainty led to a relative increase of 12% in participants' error detection ability compared to uniformly slowing down the audio.
Published: 2024

264. Automatic Estimation of Singing Voice Musical Dynamics

Author: Narang, Jyoti, Tamer, Nazif Can, De La Vega, Viviana, and Serra, Xavier
Subjects: Computer Science - Sound, Computer Science - Information Retrieval, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Musical dynamics form a core part of expressive singing voice performances. However, automatic analysis of musical dynamics for singing voice has received limited attention partly due to the scarcity of suitable datasets and a lack of clear evaluation frameworks. To address this challenge, we propose a methodology for dataset curation. Employing the proposed methodology, we compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files, leveraging state-of-the-art source separation and alignment techniques. The scores are sourced from the OpenScore Lieder corpus of romantic-era compositions, widely known for its wealth of expressive annotations. Utilizing the curated dataset, we train a multi-head attention based CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics. We explored two distinct perceptually motivated input representations for the model training: log-Mel spectrum and bark-scale based features. For testing, we manually curate another dataset of 25 musical dynamics annotated performances in collaboration with a professional vocalist. We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction. The dataset along with the code is shared publicly for further research on the topic., Comment: To be published in ISMIR 2024, 6 pages
Published: 2024

265. MidiTok Visualizer: a tool for visualization and analysis of tokenized MIDI symbolic music

Author: Wiszenko, Michał, Stefański, Kacper, Malesa, Piotr, Pokorzyński, Łukasz, and Modrzejewski, Mateusz
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Symbolic music research plays a crucial role in music-related machine learning, but MIDI data can be complex for those without musical expertise. To address this issue, we present MidiTok Visualizer, a web application designed to facilitate the exploration and visualization of various MIDI tokenization methods from the MidiTok Python package. MidiTok Visualizer offers numerous customizable parameters, enabling users to upload MIDI files to visualize tokenized data alongside an interactive piano roll., Comment: in Extended Abstracts for the Late-Breaking Demo Sessionof the 25th Int. Society for Music Information Retrieval Conf., San Francisco, United States, 2024
Published: 2024

266. Symbotunes: unified hub for symbolic music generative models

Author: Skierś, Paweł, Łazarski, Maksymilian, Kopeć, Michał, and Modrzejewski, Mateusz
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Implementations of popular symbolic music generative models often differ significantly in terms of the libraries utilized and overall project structure. Therefore, directly comparing the methods or becoming acquainted with them may present challenges. To mitigate this issue we introduce Symbotunes, an open-source unified hub for symbolic music generative models. Symbotunes contains modern Python implementations of well-known methods for symbolic music generation, as well as a unified pipeline for generating and training.
Published: 2024

267. MusicFlow: Cascaded Flow Matching for Text Guided Music Generation

Author: Prajwal, K R, Shi, Bowen, Lee, Matthew, Vyas, Apoorv, Tjandra, Andros, Luthra, Mahi, Guo, Baishan, Wang, Huiyu, Afouras, Triantafyllos, Kant, David, and Hsu, Wei-Ning
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce MusicFlow, a cascaded text-to-music generation model based on flow matching. Based on self-supervised representations to bridge between text descriptions and music audios, we construct two flow matching networks to model the conditional distribution of semantic and acoustic features. Additionally, we leverage masked prediction as the training objective, enabling the model to generalize to other tasks such as music infilling and continuation in a zero-shot manner. Experiments on MusicCaps reveal that the music generated by MusicFlow exhibits superior quality and text coherence despite being over $2\sim5$ times smaller and requiring $5$ times fewer iterative steps. Simultaneously, the model can perform other music generation tasks and achieves competitive performance in music infilling and continuation. Our code and model will be publicly available., Comment: ICML 2024
Published: 2024

268. Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios

Author: Cheng, Yongkang, Liang, Mingjiang, Huang, Shaoli, Han, Gaoge, Ning, Jifeng, and Liu, Wei
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-driven simultaneous gesture generation is vital for human-computer communication, AI games, and film production. While previous research has shown promise, there are still limitations. Methods based on VAEs are accompanied by issues of local jitter and global instability, whereas methods based on diffusion models are hampered by low generation efficiency. This is because the denoising process of DDPM in the latter relies on the assumption that the noise added at each step is sampled from a unimodal distribution, and the noise values are small. DDIM borrows the idea from the Euler method for solving differential equations, disrupts the Markov chain process, and increases the noise step size to reduce the number of denoising steps, thereby accelerating generation. However, simply increasing the step size during the step-by-step denoising process causes the results to gradually deviate from the original data distribution, leading to a significant drop in the quality of the generated actions and the emergence of unnatural artifacts. In this paper, we break the assumptions of DDPM and achieves breakthrough progress in denoising speed and fidelity. Specifically, we introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps within the same sampling step, aiming to sample larger noise values and apply fewer denoising steps for high-speed generation., Comment: Accepted by WACV 2025 (Round 1)
Published: 2024

269. An approach to hummed-tune and song sequences matching

Author: Pham, Loc Bao, Luong, Huong Hoang, Tran, Phu Thien, Ngo, Phuc Hoang, Nguyen, Vi Hoang, and Nguyen, Thinh
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Melody stuck in your head, also known as "earworm", is tough to get rid of, unless you listen to it again or sing it out loud. But what if you can not find the name of that song? It must be an intolerable feeling. Recognizing a song name base on humming sound is not an easy task for a human being and should be done by machines. However, there is no research paper published about hum tune recognition. Adapting from Hum2Song Zalo AI Challenge 2021 - a competition about querying the name of a song by user's giving humming tune, which is similar to Google's Hum to Search. This paper covers details about the pre-processed data from the original type (mp3) to usable form for training and inference. In training an embedding model for the feature extraction phase, we ran experiments with some states of the art, such as ResNet, VGG, AlexNet, MobileNetV2. And for the inference phase, we use the Faiss module to effectively search for a song that matched the sequence of humming sound. The result comes at nearly 94\% in MRR@10 metric on the public test set, along with the top 1 result on the public leaderboard.
Published: 2024
Full Text: View/download PDF

270. Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Author: Shen, Maohao, Zhang, Shun, Wu, Jilong, Xiu, Zhiping, AlBadawy, Ehab, Lu, Yiting, Seltzer, Mike, and He, Qing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP) with impressive performance across various text-based tasks. However, the extension of text-dominant LLMs to with speech generation tasks remains under-explored. In this work, we introduce a text-to-speech (TTS) system powered by a fine-tuned Llama model, named TTS-Llama, that achieves state-of-the-art speech synthesis performance. Building on TTS-Llama, we further propose MoLE-Llama, a text-and-speech multimodal LLM developed through purely late-fusion parameter-efficient fine-tuning (PEFT) and a mixture-of-expert architecture. Extensive empirical results demonstrate MoLE-Llama's competitive performance on both text-only question-answering (QA) and TTS tasks, mitigating catastrophic forgetting issue in either modality. Finally, we further explore MoLE-Llama in text-in-speech-out QA tasks, demonstrating its great potential as a multimodal dialog system capable of speech generation.
Published: 2024

271. Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs

Author: Zhang, Enshi and Poellabauer, Christian
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech Emotion Recognition (SER) focuses on identifying emotional states from spoken language. The 2024 IEEE SLT-GenSEC Challenge on Post Automatic Speech Recognition (ASR) Emotion Recognition tasks participants to explore the capabilities of large language models (LLMs) for emotion recognition using only text data. We propose a novel approach that first refines all available transcriptions to ensure data reliability. We then segment each complete conversation into smaller dialogues and use these dialogues as context to predict the emotion of the target utterance within the dialogue. Finally, we investigated different context lengths and prompting techniques to improve prediction accuracy. Our best submission exceeded the baseline by 20% in unweighted accuracy, achieving the best performance in the challenge. All our experiments' codes, prediction results, and log files are publicly available.
Published: 2024

272. Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual Features

Author: Abdelwahab, Abdelrahman, Vishnubhatla, Akshaj, Vaswani, Ayaan, Bharathulwar, Advait, and Kommaraju, Arnav
Subjects: Computer Science - Multimedia, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Inaccuracies in polygraph tests often lead to wrongful convictions, false information, and bias, all of which have significant consequences for both legal and political systems. Recently, analyzing facial micro-expressions has emerged as a method for detecting deception; however, current models have not reached high accuracy and generalizability. The purpose of this study is to aid in remedying these problems. The unique multimodal transformer architecture used in this study improves upon previous approaches by using auditory inputs, visual facial micro-expressions, and manually transcribed gesture annotations, moving closer to a reliable non-invasive lie detection model. Visual and auditory features were extracted using the Vision Transformer and OpenSmile models respectively, which were then concatenated with the transcriptions of participants micro-expressions and gestures. Various models were trained for the classification of lies and truths using these processed and concatenated features. The CNN Conv1D multimodal model achieved an average accuracy of 95.4%. However, further research is still required to create higher-quality datasets and even more generalized models for more diverse applications., Comment: 11 pages, 18 figures
Published: 2024

273. Analyzing long-term rhythm variations in Mising and Assamese using frequency domain correlates

Author: Gogoi, Parismita, Sarmah, Priyankoo, and Prasanna, S. R. M.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The current work explores long-term speech rhythm variations to classify Mising and Assamese, two low-resourced languages from Assam, Northeast India. We study the temporal information of speech rhythm embedded in low-frequency (LF) spectrograms derived from amplitude (AM) and frequency modulation (FM) envelopes. This quantitative frequency domain analysis of rhythm is supported by the idea of rhythm formant analysis (RFA), originally proposed by Gibbon [1]. We attempt to make the investigation by extracting features derived from trajectories of first six rhythm formants along with two-dimensional discrete cosine transform-based characterizations of the AM and FM LF spectrograms. The derived features are fed as input to a machine learning tool to contrast rhythms of Assamese and Mising. In this way, an improved methodology for empirically investigating rhythm variation structure without prior annotation of the larger unit of the speech signal is illustrated for two low-resourced languages of Northeast India., Comment: Submitted to International Journal of Asian Language Processing (IJALP)
Published: 2024

274. emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Author: Sivakumar, Viswanath, Seely, Jeffrey, Du, Alan, Bittner, Sean R, Berenzweig, Adam, Bolarinwa, Anuoluwapo, Gramfort, Alexandre, and Mandel, Michael I
Subjects: Computer Science - Machine Learning, Computer Science - Human-Computer Interaction, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.1, I.2.7, H.5.2, H.1.2
Abstract: Surface electromyography (sEMG) non-invasively measures signals generated by muscle activity with sufficient sensitivity to detect individual spinal neurons and richness to identify dozens of gestures and their nuances. Wearable wrist-based sEMG sensors have the potential to offer low friction, subtle, information rich, always available human-computer inputs. To this end, we introduce emg2qwerty, a large-scale dataset of non-invasive electromyographic signals recorded at the wrists while touch typing on a QWERTY keyboard, together with ground-truth annotations and reproducible baselines. With 1,135 sessions spanning 108 users and 346 hours of recording, this is the largest such public dataset to date. These data demonstrate non-trivial, but well defined hierarchical relationships both in terms of the generative process, from neurons to muscles and muscle combinations, as well as in terms of domain shift across users and user sessions. Applying standard modeling techniques from the closely related field of Automatic Speech Recognition (ASR), we show strong baseline performance on predicting key-presses using sEMG signals alone. We believe the richness of this task and dataset will facilitate progress in several problems of interest to both the machine learning and neuroscientific communities. Dataset and code can be accessed at https://github.com/facebookresearch/emg2qwerty., Comment: Submitted to NeurIPS 2024 Datasets and Benchmarks Track
Published: 2024

275. Personality Analysis from Online Short Video Platforms with Multi-domain Adaptation

Author: An, Sixu, Sun, Xiangguo, Li, Yicong, Yang, Yu, and Xu, Guandong
Subjects: Computer Science - Multimedia, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society, Computer Science - Machine Learning, Computer Science - Social and Information Networks, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Personality analysis from online short videos has gained prominence due to its applications in personalized recommendation systems, sentiment analysis, and human-computer interaction. Traditional assessment methods, such as questionnaires based on the Big Five Personality Framework, are limited by self-report biases and are impractical for large-scale or real-time analysis. Leveraging the rich, multi-modal data present in short videos offers a promising alternative for more accurate personality inference. However, integrating these diverse and asynchronous modalities poses significant challenges, particularly in aligning time-varying data and ensuring models generalize well to new domains with limited labeled data. In this paper, we propose a novel multi-modal personality analysis framework that addresses these challenges by synchronizing and integrating features from multiple modalities and enhancing model generalization through domain adaptation. We introduce a timestamp-based modality alignment mechanism that synchronizes data based on spoken word timestamps, ensuring accurate correspondence across modalities and facilitating effective feature integration. To capture temporal dependencies and inter-modal interactions, we employ Bidirectional Long Short-Term Memory networks and self-attention mechanisms, allowing the model to focus on the most informative features for personality prediction. Furthermore, we develop a gradient-based domain adaptation method that transfers knowledge from multiple source domains to improve performance in target domains with scarce labeled data. Extensive experiments on real-world datasets demonstrate that our framework significantly outperforms existing methods in personality prediction tasks, highlighting its effectiveness in capturing complex behavioral cues and robustness in adapting to new domains.
Published: 2024

276. Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions?

Author: Osakuade, Opeyemi and King, Simon
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks., Comment: Submitted to ICASSP 2025
Published: 2024

277. GPT-4o System Card

Author: OpenAI, Hurst, Aaron, Lerer, Adam, Goucher, Adam P., Perelman, Adam, Ramesh, Aditya, Clark, Aidan, Ostrow, AJ, Welihinda, Akila, Hayes, Alan, Radford, Alec, Mądry, Aleksander, Baker-Whitcomb, Alex, Beutel, Alex, Borzunov, Alex, Carney, Alex, Chow, Alex, Kirillov, Alex, Nichol, Alex, Paino, Alex, Renzin, Alex, Passos, Alex Tachard, Kirillov, Alexander, Christakis, Alexi, Conneau, Alexis, Kamali, Ali, Jabri, Allan, Moyer, Allison, Tam, Allison, Crookes, Amadou, Tootoochian, Amin, Tootoonchian, Amin, Kumar, Ananya, Vallone, Andrea, Karpathy, Andrej, Braunstein, Andrew, Cann, Andrew, Codispoti, Andrew, Galu, Andrew, Kondrich, Andrew, Tulloch, Andrew, Mishchenko, Andrey, Baek, Angela, Jiang, Angela, Pelisse, Antoine, Woodford, Antonia, Gosalia, Anuj, Dhar, Arka, Pantuliano, Ashley, Nayak, Avi, Oliver, Avital, Zoph, Barret, Ghorbani, Behrooz, Leimberger, Ben, Rossen, Ben, Sokolowsky, Ben, Wang, Ben, Zweig, Benjamin, Hoover, Beth, Samic, Blake, McGrew, Bob, Spero, Bobby, Giertler, Bogo, Cheng, Bowen, Lightcap, Brad, Walkin, Brandon, Quinn, Brendan, Guarraci, Brian, Hsu, Brian, Kellogg, Bright, Eastman, Brydon, Lugaresi, Camillo, Wainwright, Carroll, Bassin, Cary, Hudson, Cary, Chu, Casey, Nelson, Chad, Li, Chak, Shern, Chan Jun, Conger, Channing, Barette, Charlotte, Voss, Chelsea, Ding, Chen, Lu, Cheng, Zhang, Chong, Beaumont, Chris, Hallacy, Chris, Koch, Chris, Gibson, Christian, Kim, Christina, Choi, Christine, McLeavey, Christine, Hesse, Christopher, Fischer, Claudia, Winter, Clemens, Czarnecki, Coley, Jarvis, Colin, Wei, Colin, Koumouzelis, Constantin, Sherburn, Dane, Kappler, Daniel, Levin, Daniel, Levy, Daniel, Carr, David, Farhi, David, Mely, David, Robinson, David, Sasaki, David, Jin, Denny, Valladares, Dev, Tsipras, Dimitris, Li, Doug, Nguyen, Duc Phong, Findlay, Duncan, Oiwoh, Edede, Wong, Edmund, Asdar, Ehsan, Proehl, Elizabeth, Yang, Elizabeth, Antonow, Eric, Kramer, Eric, Peterson, Eric, Sigler, Eric, Wallace, Eric, Brevdo, Eugene, Mays, Evan, Khorasani, Farzad, Such, Felipe Petroski, Raso, Filippo, Zhang, Francis, von Lohmann, Fred, Sulit, Freddie, Goh, Gabriel, Oden, Gene, Salmon, Geoff, Starace, Giulio, Brockman, Greg, Salman, Hadi, Bao, Haiming, Hu, Haitang, Wong, Hannah, Wang, Haoyu, Schmidt, Heather, Whitney, Heather, Jun, Heewoo, Kirchner, Hendrik, Pinto, Henrique Ponde de Oliveira, Ren, Hongyu, Chang, Huiwen, Chung, Hyung Won, Kivlichan, Ian, O'Connell, Ian, Osband, Ian, Silber, Ian, Sohl, Ian, Okuyucu, Ibrahim, Lan, Ikai, Kostrikov, Ilya, Sutskever, Ilya, Kanitscheider, Ingmar, Gulrajani, Ishaan, Coxon, Jacob, Menick, Jacob, Pachocki, Jakub, Aung, James, Betker, James, Crooks, James, Lennon, James, Kiros, Jamie, Leike, Jan, Park, Jane, Kwon, Jason, Phang, Jason, Teplitz, Jason, Wei, Jason, Wolfe, Jason, Chen, Jay, Harris, Jeff, Varavva, Jenia, Lee, Jessica Gan, Shieh, Jessica, Lin, Ji, Yu, Jiahui, Weng, Jiayi, Tang, Jie, Yu, Jieqi, Jang, Joanne, Candela, Joaquin Quinonero, Beutler, Joe, Landers, Joe, Parish, Joel, Heidecke, Johannes, Schulman, John, Lachman, Jonathan, McKay, Jonathan, Uesato, Jonathan, Ward, Jonathan, Kim, Jong Wook, Huizinga, Joost, Sitkin, Jordan, Kraaijeveld, Jos, Gross, Josh, Kaplan, Josh, Snyder, Josh, Achiam, Joshua, Jiao, Joy, Lee, Joyce, Zhuang, Juntang, Harriman, Justyn, Fricke, Kai, Hayashi, Kai, Singhal, Karan, Shi, Katy, Karthik, Kavin, Wood, Kayla, Rimbach, Kendra, Hsu, Kenny, Nguyen, Kenny, Gu-Lemberg, Keren, Button, Kevin, Liu, Kevin, Howe, Kiel, Muthukumar, Krithika, Luther, Kyle, Ahmad, Lama, Kai, Larry, Itow, Lauren, Workman, Lauren, Pathak, Leher, Chen, Leo, Jing, Li, Guy, Lia, Fedus, Liam, Zhou, Liang, Mamitsuka, Lien, Weng, Lilian, McCallum, Lindsay, Held, Lindsey, Ouyang, Long, Feuvrier, Louis, Zhang, Lu, Kondraciuk, Lukas, Kaiser, Lukasz, Hewitt, Luke, Metz, Luke, Doshi, Lyric, Aflak, Mada, Simens, Maddie, Boyd, Madelaine, Thompson, Madeleine, Dukhan, Marat, Chen, Mark, Gray, Mark, Hudnall, Mark, Zhang, Marvin, Aljubeh, Marwan, Litwin, Mateusz, Zeng, Matthew, Johnson, Max, Shetty, Maya, Gupta, Mayank, Shah, Meghan, Yatbaz, Mehmet, Yang, Meng Jia, Zhong, Mengchao, Glaese, Mia, Chen, Mianna, Janner, Michael, Lampe, Michael, Petrov, Michael, Wu, Michael, Wang, Michele, Fradin, Michelle, Pokrass, Michelle, Castro, Miguel, de Castro, Miguel Oom Temudo, Pavlov, Mikhail, Brundage, Miles, Wang, Miles, Khan, Minal, Murati, Mira, Bavarian, Mo, Lin, Molly, Yesildal, Murat, Soto, Nacho, Gimelshein, Natalia, Cone, Natalie, Staudacher, Natalie, Summers, Natalie, LaFontaine, Natan, Chowdhury, Neil, Ryder, Nick, Stathas, Nick, Turley, Nick, Tezak, Nik, Felix, Niko, Kudige, Nithanth, Keskar, Nitish, Deutsch, Noah, Bundick, Noel, Puckett, Nora, Nachum, Ofir, Okelola, Ola, Boiko, Oleg, Murk, Oleg, Jaffe, Oliver, Watkins, Olivia, Godement, Olivier, Campbell-Moore, Owen, Chao, Patrick, McMillan, Paul, Belov, Pavel, Su, Peng, Bak, Peter, Bakkum, Peter, Deng, Peter, Dolan, Peter, Hoeschele, Peter, Welinder, Peter, Tillet, Phil, Pronin, Philip, Tillet, Philippe, Dhariwal, Prafulla, Yuan, Qiming, Dias, Rachel, Lim, Rachel, Arora, Rahul, Troll, Rajan, Lin, Randall, Lopes, Rapha Gontijo, Puri, Raul, Miyara, Reah, Leike, Reimar, Gaubert, Renaud, Zamani, Reza, Wang, Ricky, Donnelly, Rob, Honsby, Rob, Smith, Rocky, Sahai, Rohan, Ramchandani, Rohit, Huet, Romain, Carmichael, Rory, Zellers, Rowan, Chen, Roy, Chen, Ruby, Nigmatullin, Ruslan, Cheu, Ryan, Jain, Saachi, Altman, Sam, Schoenholz, Sam, Toizer, Sam, Miserendino, Samuel, Agarwal, Sandhini, Culver, Sara, Ethersmith, Scott, Gray, Scott, Grove, Sean, Metzger, Sean, Hermani, Shamez, Jain, Shantanu, Zhao, Shengjia, Wu, Sherwin, Jomoto, Shino, Wu, Shirong, Shuaiqi, Xia, Phene, Sonia, Papay, Spencer, Narayanan, Srinivas, Coffey, Steve, Lee, Steve, Hall, Stewart, Balaji, Suchir, Broda, Tal, Stramer, Tal, Xu, Tao, Gogineni, Tarun, Christianson, Taya, Sanders, Ted, Patwardhan, Tejal, Cunninghman, Thomas, Degry, Thomas, Dimson, Thomas, Raoux, Thomas, Shadwell, Thomas, Zheng, Tianhao, Underwood, Todd, Markov, Todor, Sherbakov, Toki, Rubin, Tom, Stasi, Tom, Kaftan, Tomer, Heywood, Tristan, Peterson, Troy, Walters, Tyce, Eloundou, Tyna, Qi, Valerie, Moeller, Veit, Monaco, Vinnie, Kuo, Vishal, Fomenko, Vlad, Chang, Wayne, Zheng, Weiyi, Zhou, Wenda, Manassra, Wesam, Sheu, Will, Zaremba, Wojciech, Patil, Yash, Qian, Yilei, Kim, Yongjik, Cheng, Youlong, Zhang, Yu, He, Yuchen, Zhang, Yuchen, Jin, Yujia, Dai, Yunxing, and Malkov, Yury
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computers and Society, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: GPT-4o is an autoregressive omni model that accepts as input any combination of text, audio, image, and video, and generates any combination of text, audio, and image outputs. It's trained end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50\% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models. In line with our commitment to building AI safely and consistent with our voluntary commitments to the White House, we are sharing the GPT-4o System Card, which includes our Preparedness Framework evaluations. In this System Card, we provide a detailed look at GPT-4o's capabilities, limitations, and safety evaluations across multiple categories, focusing on speech-to-speech while also evaluating text and image capabilities, and measures we've implemented to ensure the model is safe and aligned. We also include third-party assessments on dangerous capabilities, as well as discussion of potential societal impacts of GPT-4o's text and vision capabilities.
Published: 2024

278. Temporal Convolution-based Hybrid Model Approach with Representation Learning for Real-Time Acoustic Anomaly Detection

Author: Dissanayaka, Sahan, Wickramasinghe, Manjusri, and Marasinghe, Pasindu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The early detection of potential failures in industrial machinery components is paramount for ensuring the reliability and safety of operations, thereby preserving Machine Condition Monitoring (MCM). This research addresses this imperative by introducing an innovative approach to Real-Time Acoustic Anomaly Detection. Our method combines semi-supervised temporal convolution with representation learning and a hybrid model strategy with Temporal Convolutional Networks (TCN) to handle various intricate anomaly patterns found in acoustic data effectively. The proposed model demonstrates superior performance compared to established research in the field, underscoring the effectiveness of this approach. Not only do we present quantitative evidence of its superiority, but we also employ visual representations, such as t-SNE plots, to further substantiate the model's efficacy., Comment: 10 pages, 10 figures, ICMLC2024
Published: 2024
Full Text: View/download PDF

279. Arabic Music Classification and Generation using Deep Learning

Author: Elshaarawy, Mohamed, Saeed, Ashrakat, Sheta, Mariam, Said, Abdelrahman, Bakr, Asem, Bahaa, Omar, and Gomaa, Walid
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper proposes a machine learning approach for classifying classical and new Egyptian music by composer and generating new similar music. The proposed system utilizes a convolutional neural network (CNN) for classification and a CNN autoencoder for generation. The dataset used in this project consists of new and classical Egyptian music pieces composed by different composers. To classify the music by composer, each sample is normalized and transformed into a mel spectrogram. The CNN model is trained on the dataset using the mel spectrograms as input features and the composer labels as output classes. The model achieves 81.4\% accuracy in classifying the music by composer, demonstrating the effectiveness of the proposed approach. To generate new music similar to the original pieces, a CNN autoencoder is trained on a similar dataset. The model is trained to encode the mel spectrograms of the original pieces into a lower-dimensional latent space and then decode them back into the original mel spectrogram. The generated music is produced by sampling from the latent space and decoding the samples back into mel spectrograms, which are then transformed into audio. In conclusion, the proposed system provides a promising approach to classifying and generating classical Egyptian music, which can be applied in various musical applications, such as music recommendation systems, music production, and music education.
Published: 2024

280. Mask-Weighted Spatial Likelihood Coding for Speaker-Independent Joint Localization and Mask Estimation

Author: Kienegger, Jakob, Mannanova, Alina, and Gerkmann, Timo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Due to their robustness and flexibility, neural-driven beamformers are a popular choice for speech separation in challenging environments with a varying amount of simultaneous speakers alongside noise and reverberation. Time-frequency masks and relative directions of the speakers regarding a fixed spatial grid can be used to estimate the beamformer's parameters. To some degree, speaker-independence is achieved by ensuring a greater amount of spatial partitions than speech sources. In this work, we analyze how to encode both mask and positioning into such a grid to enable joint estimation of both quantities. We propose mask-weighted spatial likelihood coding and show that it achieves considerable performance in both tasks compared to baseline encodings optimized for either localization or mask estimation. In the same setup, we demonstrate superiority for joint estimation of both quantities. Conclusively, we propose a universal approach which can replace an upstream sound source localization system solely by adapting the training framework, making it highly relevant in performance-critical scenarios., Comment: \copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Published: 2024

281. CloserMusicDB: A Modern Multipurpose Dataset of High Quality Music

Author: Piekarzewicz, Aleksandra, Sroka, Tomasz, Tym, Aleksander, and Modrzejewski, Mateusz
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we introduce CloserMusicDB, a collection of full length studio quality tracks annotated by a team of human experts. We describe the selected qualities of our dataset, along with three example tasks possible to perform using this dataset: hook detection, contextual tagging and artist identification. We conduct baseline experiments and provide initial benchmarks for these tasks.
Published: 2024

282. Beyond Correlation: Evaluating Multimedia Quality Models with the Constrained Concordance Index

Author: Ragano, Alessandro, Martinez, Helard Becerra, and Hines, Andrew
Subjects: Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: This study investigates the evaluation of multimedia quality models, focusing on the inherent uncertainties in subjective Mean Opinion Score (MOS) ratings due to factors like rater inconsistency and bias. Traditional statistical measures such as Pearson's Correlation Coefficient (PCC), Spearman's Rank Correlation Coefficient (SRCC), and Kendall's Tau (KTAU) often fail to account for these uncertainties, leading to inaccuracies in model performance assessment. We introduce the Constrained Concordance Index (CCI), a novel metric designed to overcome the limitations of existing metrics by considering the statistical significance of MOS differences and excluding comparisons where MOS confidence intervals overlap. Through comprehensive experiments across various domains including speech and image quality assessment, we demonstrate that CCI provides a more robust and accurate evaluation of instrumental quality models, especially in scenarios of low sample sizes, rater group variability, and restriction of range. Our findings suggest that incorporating rater subjectivity and focusing on statistically significant pairs can significantly enhance the evaluation framework for multimedia quality prediction models. This work not only sheds light on the overlooked aspects of subjective rating uncertainties but also proposes a methodological advancement for more reliable and accurate quality model evaluation.
Published: 2024

283. We Augmented Whisper With kNN and You Won't Believe What Came Next

Author: Nachesa, Maya K. and Niculae, Vlad
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech recognition performance varies by language, domain, and speaker characteristics such as accent, and fine-tuning a model on any of these categories may lead to catastrophic forgetting. $k$ nearest neighbor search ($k$NN), first proposed for neural sequence decoders for natural language generation (NLG) and machine translation (MT), is a non-parametric method that can instead adapt by building an external datastore that can then be searched during inference time, without training the underlying model. We show that Whisper, a transformer end-to-end speech model, benefits from $k$NN. We investigate the differences between the speech and text setups. We discuss implications for speaker adaptation, and analyze improvements by gender, accent, and age., Comment: 6 pages incl. appendix, 2 figures, 6 tables
Published: 2024

284. Wavetable Synthesis Using CVAE for Timbre Control Based on Semantic Label

Author: Yutani, Tsugumasa, Yamamoto, Yuya, Nakatani, Shuyo, and Terasawa, Hiroko
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: Synthesizers are essential in modern music production. However, their complex timbre parameters, often filled with technical terms, require expertise. This research introduces a method of timbre control in wavetable synthesis that is intuitive and sensible and utilizes semantic labels. Using a conditional variational autoencoder (CVAE), users can select a wavetable and define the timbre with labels such as bright, warm, and rich. The CVAE model, featuring convolutional and upsampling layers, effectively captures the wavetable nuances, ensuring real-time performance owing to their processing in the time domain. Experiments demonstrate that this approach allows for real-time, effective control of the timbre of the wavetable using semantic inputs and aims for intuitive timbre control through data-based semantic control., Comment: 6 pages, 4 figures, Accepted at APSIPA ASC 2024
Published: 2024

285. STTATTS: Unified Speech-To-Text And Text-To-Speech Model

Author: Toyin, Hawau Olamide, Li, Hao, and Aldarmaki, Hanan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. Our evaluation demonstrates that the performance of our multi-task model is comparable to that of individually trained models while significantly saving computational and memory costs ($\sim$50\% reduction in the total number of parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model checkpoints are openly available for further research., Comment: 11 pages, 4 Figures, EMNLP 2024 Findings
Published: 2024

286. Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis

Author: De, Suparna, Bostan, Ionut, and Sastry, Nishanth
Subjects: Computer Science - Social and Information Networks, Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent studies have outlined the accessibility challenges faced by blind or visually impaired, and less-literate people, in interacting with social networks, in-spite of facilitating technologies such as monotone text-to-speech (TTS) screen readers and audio narration of visual elements such as emojis. Emotional speech generation traditionally relies on human input of the expected emotion together with the text to synthesise, with additional challenges around data simplification (causing information loss) and duration inaccuracy, leading to lack of expressive emotional rendering. In real-life communications, the duration of phonemes can vary since the same sentence might be spoken in a variety of ways depending on the speakers' emotional states or accents (referred to as the one-to-many problem of text to speech generation). As a result, an advanced voice synthesis system is required to account for this unpredictability. We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system that derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech, integrating advanced natural language processing (NLP) and speech synthesis techniques for real-time applications. Our system also showcases competitive inference time performance when benchmarked against the state-of-the-art TTS models, making it suitable for real-time accessibility applications.
Published: 2024

287. MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Author: Sakshi, S, Tyagi, Utkarsh, Kumar, Sonal, Seth, Ashish, Selvakumar, Ramaneswaran, Nieto, Oriol, Duraiswami, Ramani, Ghosh, Sreyan, and Manocha, Dinesh
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Sound
Abstract: The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music. It includes information extraction and reasoning questions, requiring models to demonstrate 27 distinct skills across unique and challenging tasks. Unlike existing benchmarks, MMAU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to tackle tasks akin to those faced by experts. We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini Pro v1.5 achieves only 52.97% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks., Comment: Project Website: https://sakshi113.github.io/mmau_homepage/
Published: 2024

288. AlignCap: Aligning Speech Emotion Captioning to Human Preferences

Author: Liang, Ziqi, Shi, Haoxiang, and Chen, Hanhui
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech Emotion Captioning (SEC) has gradually become an active research task. The emotional content conveyed through human speech are often complex, and classifying them into fixed categories may not be enough to fully capture speech emotions. Describing speech emotions through natural language may be a more effective approach. However, existing SEC methods often produce hallucinations and lose generalization on unseen speech. To overcome these problems, we propose AlignCap, which Aligning Speech Emotion Captioning to Human Preferences based on large language model (LLM) with two properties: 1) Speech-Text Alignment, which minimizing the divergence between the LLM's response prediction distributions for speech and text inputs using knowledge distillation (KD) Regularization. 2) Human Preference Alignment, where we design Preference Optimization (PO) Regularization to eliminate factuality and faithfulness hallucinations. We also extract emotional clues as a prompt for enriching fine-grained information under KD-Regularization. Experiments demonstrate that AlignCap presents stronger performance to other state-of-the-art methods on Zero-shot SEC task., Comment: Accepted to EMNLP2024 main conference
Published: 2024

289. A Survey on Speech Large Language Models

Author: Peng, Jing, Wang, Yucheng, Xi, Yu, Li, Xu, Zhang, Xizhuo, and Yu, Kai
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large Language Models (LLMs) exhibit strong contextual understanding and remarkable multi-task performance. Therefore, researchers have been seeking to integrate LLMs in the broad sense of Spoken Language Understanding (SLU) field. Different from the traditional method of cascading LLMs to process text generated by Automatic Speech Recognition(ASR), new efforts have focused on designing architectures centered around Audio Feature Extraction - Multimodal Information Fusion - LLM Inference(Speech LLMs). This approach enables richer audio feature extraction while simultaneously facilitating end-to-end fusion of audio and text modalities, thereby achieving deeper understanding and reasoning from audio data. This paper elucidates the development of Speech LLMs, offering an in-depth analysis of system architectures and training strategies. Through extensive research and a series of targeted experiments, the paper assesses Speech LLMs' advancements in Rich Audio Transcription and its potential for Cross-task Integration within the SLU field. Additionally, it indicates key challenges uncovered through experimentation, such as the Dormancy of LLMs under certain conditions. The paper further delves into the training strategies for Speech LLMs, proposing potential solutions based on these findings, and offering valuable insights and references for future research in this domain, as well as LLM applications in multimodal contexts.
Published: 2024

290. A contrastive-learning approach for auditory attention detection

Author: Bajestan, Seyed Ali Alavi, Pitt, Mark, and Williamson, Donald S.
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Carrying conversations in multi-sound environments is one of the more challenging tasks, since the sounds overlap across time and frequency making it difficult to understand a single sound source. One proposed approach to help isolate an attended speech source is through decoding the electroencephalogram (EEG) and identifying the attended audio source using statistical or machine learning techniques. However, the limited amount of data in comparison to other machine learning problems and the distributional shift between different EEG recordings emphasizes the need for a self supervised approach that works with limited data to achieve a more robust solution. In this paper, we propose a method based on self supervised learning to minimize the difference between the latent representations of an attended speech signal and the corresponding EEG signal. This network is further finetuned for the auditory attention classification task. We compare our results with previously published methods and achieve state-of-the-art performance on the validation set.
Published: 2024

291. Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining

Author: Cheng, Ruoxi, Ding, Yizhong, Cao, Shuirong, Shao, Shitong, and Wang, Zhiqiang
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models but still requires both audio and text of the individual as input. To address these limitations, we then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.
Published: 2024

292. Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model

Author: Lall, Vishakha and Liu, Yisi
Subjects: Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains. However, this broad adaptability can lead to diminished performance in tasks requiring recognition of specific vocabularies. Addressing this challenge typically involves fine-tuning the model, which demands extensive labeled audio data that is often difficult to acquire and unavailable for specific domains. In this study, we propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters, using a relatively small training dataset. Our method leverages contextual biasing, to direct Whisper model's output towards a specific vocabulary by integrating a neural-symbolic prefix tree structure to guide the model's transcription output. To validate our approach, we conducted experiments using a validation dataset comprising maritime data collected within a simulated training environment. A comparison between the original Whisper models of varying parameter sizes and our biased model revealed a notable reduction in transcription word error rate and enhanced performance of downstream applications. Our findings suggest that this methodology holds promise for improving speech-to-text translation performance in domains characterized by limited vocabularies.
Published: 2024

293. Unified Microphone Conversion: Many-to-Many Device Mapping via Feature-wise Linear Modulation

Author: Ryu, Myeonghoon, Oh, Hongseok, Lee, Suji, and Park, Han
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this study, we introduce Unified Microphone Conversion, a unified generative framework to enhance the resilience of sound event classification systems against device variability. Building on the limitations of previous works, we condition the generator network with frequency response information to achieve many-to-many device mapping. This approach overcomes the inherent limitation of CycleGAN, requiring separate models for each device pair. Our framework leverages the strengths of CycleGAN for unpaired training to simulate device characteristics in audio recordings and significantly extends its scalability by integrating frequency response related information via Feature-wise Linear Modulation. The experiment results show that our method outperforms the state-of-the-art method by 2.6% and reducing variability by 0.8% in macro-average F1 score., Comment: Currently under review for ICASSP 2025
Published: 2024

294. Robust and Explainable Depression Identification from Speech Using Vowel-Based Ensemble Learning Approaches

Author: Feng, Kexin and Chaspari, Theodora
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This study investigates explainable machine learning algorithms for identifying depression from speech. Grounded in evidence from speech production that depression affects motor control and vowel generation, pre-trained vowel-based embeddings, that integrate semantically meaningful linguistic units, are used. Following that, an ensemble learning approach decomposes the problem into constituent parts characterized by specific depression symptoms and severity levels. Two methods are explored: a "bottom-up" approach with 8 models predicting individual Patient Health Questionnaire-8 (PHQ-8) item scores, and a "top-down" approach using a Mixture of Experts (MoE) with a router module for assessing depression severity. Both methods depict performance comparable to state-of-the-art baselines, demonstrating robustness and reduced susceptibility to dataset mean/median values. System explainability benefits are discussed highlighting their potential to assist clinicians in depression diagnosis and screening., Comment: accepted at the IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2024)
Published: 2024

295. Optimizing the role of human evaluation in LLM-based spoken document summarization systems

Author: Kroll, Margaret and Kraus, Kelsey
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The emergence of powerful LLMs has led to a paradigm shift in abstractive summarization of spoken documents. The properties that make LLMs so valuable for this task -- creativity, ability to produce fluent speech, and ability to abstract information from large corpora -- also present new challenges to evaluating their content. Quick, cost-effective automatic evaluations such as ROUGE and BERTScore offer promise, but do not yet show competitive performance when compared to human evaluations. We draw on methodologies from the social sciences to propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content. We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluation studies. We additionally include two case studies that show how these human-in-the-loop evaluation methods have been implemented at a major U.S. technology company.
Published: 2024
Full Text: View/download PDF

296. Vocal Melody Construction for Persian Lyrics Using LSTM Recurrent Neural Networks

Author: Jafari, Farshad, Didehvar, Farzad, and Gheibi, Amin
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The present paper investigated automatic melody construction for Persian lyrics as an input. It was assumed that there is a phonological correlation between the lyric syllables and the melody in a song. A seq2seq neural network was developed to investigate this assumption, trained on parallel syllable and note sequences in Persian songs to suggest a pleasant melody for a new sequence of syllables. More than 100 pieces of Persian music were collected and converted from the printed version to the digital format due to the lack of a dataset on Persian digital music. Finally, 14 new lyrics were given to the model as input, and the suggested melodies were performed and recorded by music experts to evaluate the trained model. The evaluation was conducted using an audio questionnaire, which more than 170 persons answered. According to the answers about the pleasantness of melody, the system outputs scored an average of 3.005 from 5, while the human-made melodies for the same lyrics obtained an average score of 4.078.
Published: 2024

297. ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

Author: Anand, Srija, Varadhan, Praveen Srinivasa, Singal, Mehak, and Khapra, Mitesh M.
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advancements in Text-to-Speech (TTS) technology have led to natural-sounding speech for English, primarily due to the availability of large-scale, high-quality web data. However, many other languages lack access to such resources, relying instead on limited studio-quality data. This scarcity results in synthesized speech that often suffers from intelligibility issues, particularly with low-frequency character bigrams. In this paper, we propose three solutions to address this challenge. First, we leverage high-quality data from linguistically or geographically related languages to improve TTS for the target language. Second, we utilize low-quality Automatic Speech Recognition (ASR) data recorded in non-studio environments, which is refined using denoising and speech enhancement models. Third, we apply knowledge distillation from large-scale models using synthetic data to generate more robust outputs. Our experiments with Hindi demonstrate significant reductions in intelligibility issues, as validated by human evaluators. We propose this methodology as a viable alternative for languages with limited access to high-quality data, enabling them to collectively benefit from shared resources., Comment: 11 pages, 1 figure, 3 tables
Published: 2024

298. Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech

Author: de Oliveira, Danilo, Richter, Julius, Lemercier, Jean-Marie, Welker, Simon, and Gerkmann, Timo
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Diffusion models have found great success in generating high quality, natural samples of speech, but their potential for density estimation for speech has so far remained largely unexplored. In this work, we leverage an unconditional diffusion model trained only on clean speech for the assessment of speech quality. We show that the quality of a speech utterance can be assessed by estimating the likelihood of a corresponding sample in the terminating Gaussian distribution, obtained via a deterministic noising process. The resulting method is purely unsupervised, trained only on clean speech, and therefore does not rely on annotations. Our diffusion-based approach leverages clean speech priors to assess quality based on how the input relates to the learned distribution of clean data. Our proposed log-likelihoods show promising results, correlating well with intrusive speech quality metrics such as POLQA and SI-SDR.
Published: 2024

299. OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Author: Zhang, Qinglin, Cheng, Luyao, Deng, Chong, Chen, Qian, Wang, Wen, Zheng, Siqi, Liu, Jiaqing, Yu, Hai, and Tan, Chaohong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Full-duplex spoken dialogue systems significantly advance over traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex communication capabilities, we propose a multi-stage post-training scheme that progressively adapts a text-based large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. Throughout all training stages, we standardize the data using a flattening operation, which allows us to unify the training methods and the model architecture across different modalities and tasks. Our approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/)., Comment: Work in progress
Published: 2024

300. Regularized autoregressive modeling and its application to audio signal declipping

Author: Mokrý, Ondřej and Rajmic, Pavel
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Autoregressive (AR) modeling is invaluable in signal processing, in particular in speech and audio fields. Attempts in the literature can be found that regularize or constrain either the time-domain signal values or the AR coefficients, which is done for various reasons, including the incorporation of prior information or numerical stabilization. Although these attempts are appealing, an encompassing and generic modeling framework is still missing. We propose such a framework and the related optimization problem and algorithm. We discuss the computational demands of the algorithm and explore the effects of various improvements on its convergence speed. In the experimental part, we demonstrate the usefulness of our approach on the audio declipping problem. We compare its performance against the state-of-the-art methods and demonstrate the competitiveness of the proposed method, especially for mildly clipped signals. The evaluation is extended by considering a heuristic algorithm of generalized linear prediction (GLP), a strong competitor which has only been presented as a patent and is new in the scientific community.
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

41,389 results on '"Electrical Engineering and Systems Science - Audio and Speech Processing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources