Author: "HUCK, P" / Topic: computer science - machine learning - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"HUCK, P"' showing total 58 results

Start Over Author "HUCK, P" Topic computer science - machine learning

58 results on '"HUCK, P"'

1. Towards Neural Scaling Laws for Time Series Foundation Models

Author: Yao, Qingren, Yang, Chao-Han Huck, Jiang, Renhe, Liang, Yuxuan, Jin, Ming, and Pan, Shirui
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD) scaling behavior and the influence of model architectures less explored. In this work, we examine two common TSFM architectures, encoder-only and decoder-only Transformers, and investigate their scaling behavior on both ID and OOD data. These models are trained and evaluated across varying parameter counts, compute budgets, and dataset sizes. Our experiments reveal that the log-likelihood loss of TSFMs exhibits similar scaling behavior in both OOD and ID settings. We further compare the scaling properties across different architectures, incorporating two state-of-the-art TSFMs as case studies, showing that model architecture plays a significant role in scaling. The encoder-only Transformers demonstrate better scalability than the decoder-only Transformers, while the architectural enhancements in the two advanced TSFMs primarily improve ID performance but reduce OOD scalability. While scaling up TSFMs is expected to drive performance breakthroughs, the lack of a comprehensive understanding of TSFM scaling laws has hindered the development of a robust framework to guide model scaling. We fill this gap in this work by synthesizing our findings and providing practical guidelines for designing and scaling larger TSFMs with enhanced model capabilities.
Published: 2024

2. Post-edits Are Preferences Too

Author: Berger, Nathaniel, Riezler, Stefan, Exel, Miriam, and Huck, Matthias
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreutzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings. We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, %$s_1 > s_2$; while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit-like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit-like hypotheses to the top output ranks., Comment: To appear at the Ninth Conference on Machine Translation (WMT24)
Published: 2024

3. Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Author: Yang, Chao-Han Huck, Park, Taejin, Gong, Yuan, Li, Yuanchao, Chen, Zhehuai, Lin, Yen-Ting, Chen, Chen, Hu, Yuchen, Dhawan, Kunal, Żelasko, Piotr, Zhang, Chao, Chen, Yun-Nung, Tsao, Yu, Balam, Jagadeesh, Ginsburg, Boris, Siniscalchi, Sabato Marco, Chng, Eng Siong, Bell, Peter, Lai, Catherine, Watanabe, Shinji, and Stolcke, Andreas
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations., Comment: IEEE SLT 2024. The initial draft version has been done in December 2023. Post-ASR Text Processing and Understanding Community and LlaMA-7B pre-training correction model: https://huggingface.co/GenSEC-LLM/SLT-Task1-Llama2-7b-HyPo-baseline
Published: 2024

4. Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Author: Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Qin, Chengwei, Chen, Pin-Yu, Chng, Eng Siong, and Zhang, Chao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities., Comment: 23 pages, Preprint
Published: 2024

5. GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators

Author: Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Li, Ruizhe, Zhang, Dong, Chen, Zhehuai, and Chng, Eng Siong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advances in large language models (LLMs) have stepped forward the development of multilingual speech and machine translation by its reduced representation errors and incorporated external knowledge. However, both translation tasks typically utilize beam search decoding and top-1 hypothesis selection for inference. These techniques struggle to fully exploit the rich information in the diverse N-best hypotheses, making them less optimal for translation tasks that require a single, high-quality output sequence. In this paper, we propose a new generative paradigm for translation tasks, namely "GenTranslate", which builds upon LLMs to generate better results from the diverse translation versions in N-best list. Leveraging the rich linguistic knowledge and strong reasoning abilities of LLMs, our new paradigm can integrate the rich information in N-best candidates to generate a higher-quality translation result. Furthermore, to support LLM finetuning, we build and release a HypoTranslate dataset that contains over 592K hypotheses-translation pairs in 11 languages. Experiments on various speech and machine translation benchmarks (e.g., FLEURS, CoVoST-2, WMT) demonstrate that our GenTranslate significantly outperforms the state-of-the-art model., Comment: 18 pages, Accepted by ACL 2024. This work is open sourced at: https://github.com/YUCHEN005/GenTranslate
Published: 2024

6. Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Author: Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Li, Ruizhe, Zhang, Chao, Chen, Pin-Yu, and Chng, EnSiong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising., Comment: Accepted to ICLR 2024, Spotlight top 5%, 24 pages. This work will be open sourced at: https://github.com/YUCHEN005/RobustGER under MIT license
Published: 2024

7. Investigating Training Strategies and Model Robustness of Low-Rank Adaptation for Language Modeling in Speech Recognition

Author: Yu, Yu, Yang, Chao-Han Huck, Dinh, Tuan, Ryu, Sungho, Kolehmainen, Jari, Ren, Roger, Filimonov, Denis, Shivakumar, Prashanth G., Gandhe, Ankur, Rastow, Ariya, Xu, Jia, Bulyko, Ivan, and Stolcke, Andreas
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The use of low-rank adaptation (LoRA) with frozen pretrained language models (PLMs) has become increasing popular as a mainstream, resource-efficient modeling approach for memory-constrained hardware. In this study, we first explore how to enhance model performance by introducing various LoRA training strategies, achieving relative word error rate reductions of 3.50\% on the public Librispeech dataset and of 3.67\% on an internal dataset in the messaging domain. To further characterize the stability of LoRA-based second-pass speech recognition models, we examine robustness against input perturbations. These perturbations are rooted in homophone replacements and a novel metric called N-best Perturbation-based Rescoring Robustness (NPRR), both designed to measure the relative degradation in the performance of rescoring models. Our experimental results indicate that while advanced variants of LoRA, such as dynamic rank-allocated LoRA, lead to performance degradation in $1$-best perturbation, they alleviate the degradation in $N$-best perturbation. This finding is in comparison to fully-tuned models and vanilla LoRA tuning baselines, suggesting that a comprehensive selection is needed when using LoRA-based adaptation for compute-cost savings and robust language modeling.
Published: 2024

8. Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

Author: Sundar, Anirudh S., Yang, Chao-Han Huck, Chan, David M., Ghosh, Shalini, Ravichandran, Venkatesh, and Nidadavolu, Phani Sankar
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Training large foundation models using self-supervised objectives on unlabeled data, followed by fine-tuning on downstream tasks, has emerged as a standard procedure. Unfortunately, the efficacy of this approach is often constrained by both limited fine-tuning compute and scarcity in labeled downstream data. We introduce Multimodal Attention Merging (MAM), an attempt that facilitates direct knowledge transfer from attention matrices of models rooted in high resource modalities, text and images, to those in resource-constrained domains, speech and audio, employing a zero-shot paradigm. MAM reduces the relative Word Error Rate (WER) of an Automatic Speech Recognition (ASR) model by up to 6.70%, and relative classification error of an Audio Event Classification (AEC) model by 10.63%. In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning., Comment: 5 pages, 1 figure, ICASSP 2024 Workshop on Self-supervision in Audio, Speech and Beyond
Published: 2023

9. Conditional Modeling Based Automatic Video Summarization

Author: Huang, Jia-Hong, Yang, Chao-Han Huck, Chen, Pin-Yu, Chen, Min-Hung, and Worring, Marcel
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods mainly rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. There are other non-visual factors, such as interestingness, representativeness, and storyline consistency that should also be considered for generating high-quality video summaries. Current methods do not adequately take into account these non-visual factors, resulting in suboptimal performance. In this work, a new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries. The method utilizes a conditional modeling perspective and introduces multiple meaningful random variables and joint distributions to characterize the key components of video summarization. Helper distributions are employed to improve the training of the model. A conditional attention module is designed to mitigate potential performance degradation in the presence of multi-modal input. The proposed video summarization method incorporates the above innovative design choices that aim to narrow the gap between human-generated and machine-generated video summaries. Extensive experiments show that the proposed approach outperforms existing methods and achieves state-of-the-art performance on commonly used video summarization datasets., Comment: This work has been submitted to the IEEE for possible publication. arXiv admin note: substantial text overlap with arXiv:2305.00455
Published: 2023

10. Reconstruction of Patient-Specific Confounders in AI-based Radiologic Image Interpretation using Generative Pretraining

Author: Han, Tianyu, Žigutytė, Laura, Huck, Luisa, Huppertz, Marc, Siepmann, Robert, Gandelsman, Yossi, Blüthgen, Christian, Khader, Firas, Kuhl, Christiane, Nebelung, Sven, Kather, Jakob, and Truhn, Daniel
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Detecting misleading patterns in automated diagnostic assistance systems, such as those powered by Artificial Intelligence, is critical to ensuring their reliability, particularly in healthcare. Current techniques for evaluating deep learning models cannot visualize confounding factors at a diagnostic level. Here, we propose a self-conditioned diffusion model termed DiffChest and train it on a dataset of 515,704 chest radiographs from 194,956 patients from multiple healthcare centers in the United States and Europe. DiffChest explains classifications on a patient-specific level and visualizes the confounding factors that may mislead the model. We found high inter-reader agreement when evaluating DiffChest's capability to identify treatment-related confounders, with Fleiss' Kappa values of 0.8 or higher across most imaging findings. Confounders were accurately captured with 11.1% to 100% prevalence rates. Furthermore, our pretraining process optimized the model to capture the most relevant information from the input radiographs. DiffChest achieved excellent diagnostic accuracy when diagnosing 11 chest conditions, such as pleural effusion and cardiac insufficiency, and at least sufficient diagnostic accuracy for the remaining conditions. Our findings highlight the potential of pretraining based on diffusion models in medical image classification, specifically in providing insights into confounding factors and model robustness.
Published: 2023

11. Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting

Author: Yang, Chao-Han Huck, Gu, Yile, Liu, Yi-Chieh, Ghosh, Shalini, Bulyko, Ivan, and Stolcke, Andreas
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We explore the ability of large language models (LLMs) to act as speech recognition post-processors that perform rescoring and error correction. Our first focus is on instruction prompting to let LLMs perform these task without fine-tuning, for which we evaluate different prompting schemes, both zero- and few-shot in-context learning, and a novel task activation prompting method that combines causal instructions and demonstration to increase its context windows. Next, we show that rescoring only by in-context learning with frozen LLMs achieves results that are competitive with rescoring by domain-tuned LMs, using a pretrained first-pass recognition system and rescoring output on two out-of-domain tasks (ATIS and WSJ). By combining prompting techniques with fine-tuning we achieve error rates below the N-best oracle level, showcasing the generalization power of the LLMs., Comment: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2023. 8 pages. 2nd version revised from Sep 29th's version
Published: 2023
Full Text: View/download PDF

12. HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

Author: Chen, Chen, Hu, Yuchen, Yang, Chao-Han Huck, Siniscalchi, Sabato Macro, Chen, Pin-Yu, and Chng, Eng Siong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs., Comment: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 21
Published: 2023

13. Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Author: Yu, Yu, Yang, Chao-Han Huck, Kolehmainen, Jari, Shivakumar, Prashanth G., Gu, Yile, Ryu, Sungho, Ren, Roger, Luo, Qi, Gourav, Aditya, Chen, I-Fan, Liu, Yi-Chieh, Dinh, Tuan, Gandhe, Ankur, Filimonov, Denis, Ghosh, Shalini, Stolcke, Andreas, Rastow, Ariya, and Bulyko, Ivan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6., Comment: Accepted to IEEE ASRU 2023. Internal Review Approved. Revised 2nd version with Andreas and Huck. The first version is in Sep 29th. 8 pages
Published: 2023
Full Text: View/download PDF

14. Explainability Techniques for Chemical Language Models

Author: Hödl, Stefan, Robinson, William, Bachrach, Yoram, Huck, Wilhelm, and Kachman, Tal
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Physics - Chemical Physics, Quantitative Biology - Quantitative Methods
Abstract: Explainability techniques are crucial in gaining insights into the reasons behind the predictions of deep learning models, which have not yet been applied to chemical language models. We propose an explainable AI technique that attributes the importance of individual atoms towards the predictions made by these models. Our method backpropagates the relevance information towards the chemical input string and visualizes the importance of individual atoms. We focus on self-attention Transformers operating on molecular string representations and leverage a pretrained encoder for finetuning. We showcase the method by predicting and visualizing solubility in water and organic solvents. We achieve competitive model performance while obtaining interpretable predictions, which we use to inspect the pretrained model.
Published: 2023

15. A Parameter-Efficient Learning Approach to Arabic Dialect Identification with Pre-Trained General-Purpose Speech Model

Author: Radhakrishnan, Srijith, Yang, Chao-Han Huck, Khan, Sumeer Ahmad, Kiani, Narsis A., Gomez-Cabrero, David, and Tegner, Jesper N.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we explore Parameter-Efficient-Learning (PEL) techniques to repurpose a General-Purpose-Speech (GSM) model for Arabic dialect identification (ADI). Specifically, we investigate different setups to incorporate trainable features into a multi-layer encoder-decoder GSM formulation under frozen pre-trained settings. Our architecture includes residual adapter and model reprogramming (input-prompting). We design a token-level label mapping to condition the GSM for Arabic Dialect Identification (ADI). This is challenging due to the high variation in vocabulary and pronunciation among the numerous regional dialects. We achieve new state-of-the-art accuracy on the ADI-17 dataset by vanilla fine-tuning. We further reduce the training budgets with the PEL method, which performs within 1.86% accuracy to fine-tuning using only 2.5% of (extra) network trainable parameters. Our study demonstrates how to identify Arabic dialects using a small dataset and limited computation with open source code and pre-trained models., Comment: Accepted to Interspeech 2023, 5 pages. Code is available at: https://github.com/Srijith-rkr/KAUST-Whisper-Adapter under MIT license
Published: 2023
Full Text: View/download PDF

16. Differentially Private Adapters for Parameter Efficient Acoustic Modeling

Author: Ho, Chun-Wei, Yang, Chao-Han Huck, and Siniscalchi, Sabato Marco
Subjects: Computer Science - Sound, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pre-trained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-trained acoustic model and attain superior performance than DP-based stochastic gradient descent (DPSGD). Next, we insert residual adapters (RA) between layers of the frozen pre-trained acoustic model. The RAs reduce training cost and time significantly with a negligible performance drop. Evaluated on the open-access Multilingual Spoken Words (MLSW) dataset, our solution reduces the number of trainable parameters by 97.5% using the RAs with only a 4% performance drop with respect to fine-tuning the cross-lingual speech classifier while preserving DP guarantees., Comment: Accepted to Interspeech 2023. Code will be available at: https://github.com/Chun-wei-Ho/Private-Speech-Adapter. The authors would like to express their gratitude to Prof. Chin-Hui Lee from Georgia Tech for providing helpful insights and suggestions
Published: 2023
Full Text: View/download PDF

17. Classical-to-Quantum Transfer Learning Facilitates Machine Learning with Variational Quantum Circuit

Author: Qi, Jun, Yang, Chao-Han Huck, Chen, Pin-Yu, Hsieh, Min-Hsiu, Zenil, Hector, and Tegner, Jesper
Subjects: Quantum Physics, Computer Science - Machine Learning
Abstract: While Quantum Machine Learning (QML) is an exciting emerging area, the accuracy of the loss function still needs to be improved by the number of available qubits. Here, we reformulate the QML problem such that the approximation error (representation power) does not depend on the number of qubits. We prove that a classical-to-quantum transfer learning architecture using a Variational Quantum Circuit (VQC) improves the representation and generalization (estimation error) capabilities of the VQC model. We derive analytical bounds for the approximation and estimation error. We show that the architecture of classical-to-quantum transfer learning leverages pre-trained classical generative AI models, making it easier to find the optimal parameters for the VQC in the training stage. To validate our theoretical analysis, we perform experiments on single-dot and double-dot binary classification tasks for charge stability diagrams in semiconductor quantum dots, where the related empirical results support our theoretical findings. Our analytical and empirical results demonstrate the effectiveness of classical-to-quantum transfer learning architecture in realistic tasks. This sets the stage for accelerating QML applications beyond the current limits of available qubits., Comment: The paper needs a major revision before it could be submitted to a new journal, and the authors agree that the latest version could not be open to public at the moment
Published: 2023

18. Improved distinct bone segmentation in upper-body CT through multi-resolution networks

Author: Schnider, Eva, Wolleb, Julia, Huck, Antal, Toranelli, Mireille, Rauter, Georg, Müller-Gerbl, Magdalena, and Cattin, Philippe C.
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Purpose: Automated distinct bone segmentation from CT scans is widely used in planning and navigation workflows. U-Net variants are known to provide excellent results in supervised semantic segmentation. However, in distinct bone segmentation from upper body CTs a large field of view and a computationally taxing 3D architecture are required. This leads to low-resolution results lacking detail or localisation errors due to missing spatial context when using high-resolution inputs. Methods: We propose to solve this problem by using end-to-end trainable segmentation networks that combine several 3D U-Nets working at different resolutions. Our approach, which extends and generalizes HookNet and MRN, captures spatial information at a lower resolution and skips the encoded information to the target network, which operates on smaller high-resolution inputs. We evaluated our proposed architecture against single resolution networks and performed an ablation study on information concatenation and the number of context networks. Results: Our proposed best network achieves a median DSC of 0.86 taken over all 125 segmented bone classes and reduces the confusion among similar-looking bones in different locations. These results outperform our previously published 3D U-Net baseline results on the task and distinct-bone segmentation results reported by other groups. Conclusion: The presented multi-resolution 3D U-Nets address current shortcomings in bone segmentation from upper-body CT scans by allowing for capturing a larger field of view while avoiding the cubic growth of the input pixels and intermediate computations that quickly outgrow the computational capacities in 3D. The approach thus improves the accuracy and efficiency of distinct bone segmentation from upper-body CT., Comment: Under submission
Published: 2023

19. From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Author: Yang, Chao-Han Huck, Li, Bo, Zhang, Yu, Chen, Nanxin, Prabhavalkar, Rohit, Sainath, Tara N., and Strohman, Trevor
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency., Comment: Submitted to ICASSP 2023. The project was initiated in May 2022 during a research internship at Google Research
Published: 2023
Full Text: View/download PDF

20. Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming

Author: Hung, Yun-Ning, Yang, Chao-Han Huck, Chen, Pin-Yu, and Lerch, Alexander
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a pre-trained model from a source domain to a target domain by modifying the input of a frozen pre-trained model. In addition to the known, input-independent, reprogramming method, we propose an advanced reprogramming paradigm: Input-dependent NMR, to increase adaptability to complex input data such as musical audio. Experimental results suggest that a neural model pre-trained on large-scale datasets can successfully perform music genre classification by using this reprogramming method. The two proposed Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a small genre classification dataset., Comment: Accepted to IEEE ICASSP 2023. The implementation is available at https://github.com/biboamy/music-repro
Published: 2022

21. A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition

Author: Yang, Chao-Han Huck, Li, Bo, Zhang, Yu, Chen, Nanxin, Sainath, Tara N., Siniscalchi, Sabato Marco, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Quantum Physics
Abstract: We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers. Experimental results on challenging spoken command recognition tasks for a few low-resource languages, such as Arabic, Georgian, Chuvash, and Lithuanian, show that the proposed QKL-based hybrid approach attains good improvements over existing classical and quantum solutions., Comment: Submitted to ICASSP 2023
Published: 2022
Full Text: View/download PDF

22. Inference and Denoise: Causal Inference-based Neural Speech Enhancement

Author: Hsieh, Tsun-An, Yang, Chao-Han Huck, Chen, Pin-Yu, Siniscalchi, Sabato Marco, and Tsao, Yu
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use the presence of noise as guidance for EM selection during training, and the noise detector selects the enhancement module according to the prediction of the presence of noise for each frame. Moreover, we derived a SE-specific average treatment effect to quantify the causal effect adequately. Experimental evidence demonstrates that CISE outperforms a non-causal mask-based SE approach in the studied settings and has better performance and efficiency than more complex SE models., Comment: Submitted to ICASSP 2023
Published: 2022

23. Certified Robustness of Quantum Classifiers against Adversarial Examples through Quantum Noise

Author: Huang, Jhih-Cing, Tsai, Yu-Lin, Yang, Chao-Han Huck, Su, Cheng-Fang, Yu, Chia-Mu, Chen, Pin-Yu, and Kuo, Sy-Yen
Subjects: Quantum Physics, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Signal Processing
Abstract: Recently, quantum classifiers have been found to be vulnerable to adversarial attacks, in which quantum classifiers are deceived by imperceptible noises, leading to misclassification. In this paper, we propose the first theoretical study demonstrating that adding quantum random rotation noise can improve robustness in quantum classifiers against adversarial attacks. We link the definition of differential privacy and show that the quantum classifier trained with the natural presence of additive noise is differentially private. Finally, we derive a certified robustness bound to enable quantum classifiers to defend against adversarial examples, supported by experimental results simulated with noises from IBM's 7-qubits device., Comment: Accepted to IEEE ICASSP 2023
Published: 2022

24. An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition

Author: Yang, Chao-Han Huck, Qi, Jun, Siniscalchi, Sabato Marco, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: We propose an ensemble learning framework with Poisson sub-sampling to effectively train a collection of teacher models to issue some differential privacy (DP) guarantee for training data. Through boosting under DP, a student model derived from the training data suffers little model degradation from the models trained with no privacy protection. Our proposed solution leverages upon two mechanisms, namely: (i) a privacy budget amplification via Poisson sub-sampling to train a target prediction model that requires less noise to achieve a same level of privacy budget, and (ii) a combination of the sub-sampling technique and an ensemble teacher-student learning framework that introduces DP-preserving noise at the output of the teacher models and transfers DP-preserving properties via noisy labels. Privacy-preserving student models are then trained with the noisy labels to learn the knowledge with DP-protection from the teacher model ensemble. Experimental evidences on spoken command recognition and continuous speech recognition of Mandarin speech show that our proposed framework greatly outperforms existing DP-preserving algorithms in both speech processing tasks., Comment: Accepted to ISCA, ISCSLP 2022, Singapore. 5 Pages
Published: 2022

25. An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

Author: Yang, Chao-Han Huck, Chen, I-Fan, Stolcke, Andreas, Siniscalchi, Sabato Marco, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilities to improve ASR accuracy when dealing with the noise effects controlled by small values of $\varepsilon$. We extend PATE learning to work with dynamic patterns, namely speech utterances, and perform a first experimental demonstration that it prevents acoustic data leakage in ASR training. We evaluate three end-to-end deep models, including LAS, hybrid CTC/attention, and RNN transducer, on the open-source LibriSpeech and TIMIT corpora. PATE learning-enhanced ASR models outperform the benchmark DP-SGD mechanisms, especially under strict DP budgets, giving relative word error rate reductions between 26.2% and 27.5% for an RNN transducer model evaluated with LibriSpeech. We also introduce a DP-preserving ASR solution for pretraining on public speech corpora., Comment: 5 pages. Accepted to IEEE SLT 2022. A first version draft was finished in Aug 2021
Published: 2022

26. Theoretical Error Performance Analysis for Variational Quantum Circuit Based Functional Regression

Author: Qi, Jun, Yang, Chao-Han Huck, Chen, Pin-Yu, and Hsieh, Min-Hsiu
Subjects: Quantum Physics, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: The noisy intermediate-scale quantum (NISQ) devices enable the implementation of the variational quantum circuit (VQC) for quantum neural networks (QNN). Although the VQC-based QNN has succeeded in many machine learning tasks, the representation and generalization powers of VQC still require further investigation, particularly when the dimensionality of classical inputs is concerned. In this work, we first put forth an end-to-end quantum neural network, TTN-VQC, which consists of a quantum tensor network based on a tensor-train network (TTN) for dimensionality reduction and a VQC for functional regression. Then, we aim at the error performance analysis for the TTN-VQC in terms of representation and generalization powers. We also characterize the optimization properties of TTN-VQC by leveraging the Polyak-Lojasiewicz (PL) condition. Moreover, we conduct the experiments of functional regression on a handwritten digit classification dataset to justify our theoretical analysis., Comment: Preprint version. 16 pages
Published: 2022

27. Treatment Learning Causal Transformer for Noisy Image Classification

Author: Yang, Chao-Han Huck, Hung, I-Te Danny, Liu, Yi-Chieh, and Chen, Pin-Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Current top-notch deep learning (DL) based vision models are primarily based on exploring and exploiting the inherent correlations between training data samples and their associated labels. However, a known practical challenge is their degraded performance against "noisy" data, induced by different circumstances such as spurious correlations, irrelevant contexts, domain shift, and adversarial attacks. In this work, we incorporate this binary information of "existence of noise" as treatment into image classification tasks to improve prediction accuracy by jointly estimating their treatment effects. Motivated from causal variational inference, we propose a transformer-based architecture, Treatment Learning Causal Transformer (TLT), that uses a latent generative model to estimate robust feature representations from current observational input for noise image classification. Depending on the estimated noise level (modeled as a binary treatment factor), TLT assigns the corresponding inference network trained by the designed causal loss for prediction. We also create new noisy image datasets incorporating a wide range of noise factors (e.g., object masking, style transfer, and adversarial perturbation) for performance benchmarking. The superior performance of TLT in noisy image classification is further validated by several refutation evaluation metrics. As a by-product, TLT also improves visual salience methods for perceiving noisy images., Comment: Accepted to IEEE WACV 2023. The first version was finished in May 2018
Published: 2022
Full Text: View/download PDF

28. Exploiting Low-Rank Tensor-Train Deep Neural Networks Based on Riemannian Gradient Descent With Illustrations of Speech Processing

Author: Qi, Jun, Yang, Chao-Han Huck, Chen, Pin-Yu, and Tejedor, Javier
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This work focuses on designing low complexity hybrid tensor networks by considering trade-offs between the model complexity and practical performance. Firstly, we exploit a low-rank tensor-train deep neural network (TT-DNN) to build an end-to-end deep learning pipeline, namely LR-TT-DNN. Secondly, a hybrid model combining LR-TT-DNN with a convolutional neural network (CNN), which is denoted as CNN+(LR-TT-DNN), is set up to boost the performance. Instead of randomly assigning large TT-ranks for TT-DNN, we leverage Riemannian gradient descent to determine a TT-DNN associated with small TT-ranks. Furthermore, CNN+(LR-TT-DNN) consists of convolutional layers at the bottom for feature extraction and several TT layers at the top to solve regression and classification problems. We separately assess the LR-TT-DNN and CNN+(LR-TT-DNN) models on speech enhancement and spoken command recognition tasks. Our empirical evidence demonstrates that the LR-TT-DNN and CNN+(LR-TT-DNN) models with fewer model parameters can outperform the TT-DNN and CNN+(TT-DNN) counterparts., Comment: 10 pages, 10 Figures
Published: 2022

29. Mitigating Closed-model Adversarial Examples with Bayesian Neural Modeling for Enhanced End-to-End Speech Recognition

Author: Yang, Chao-Han Huck, Ahmed, Zeeshan, Gu, Yile, Szurley, Joseph, Ren, Roger, Liu, Linda, Stolcke, Andreas, and Bulyko, Ivan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: In this work, we aim to enhance the system robustness of end-to-end automatic speech recognition (ASR) against adversarially-noisy speech examples. We focus on a rigorous and empirical "closed-model adversarial robustness" setting (e.g., on-device or cloud applications). The adversarial noise is only generated by closed-model optimization (e.g., evolutionary and zeroth-order estimation) without accessing gradient information of a targeted ASR model directly. We propose an advanced Bayesian neural network (BNN) based adversarial detector, which could model latent distributions against adaptive adversarial perturbation with divergence measurement. We further simulate deployment scenarios of RNN Transducer, Conformer, and wav2vec-2.0 based ASR systems with the proposed adversarial detection system. Leveraging the proposed BNN based detection system, we improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets compared to the current model enhancement methods against the adversarial speech examples., Comment: Accepted to ICASSP 2022
Published: 2022

30. A Study of Designing Compact Audio-Visual Wake Word Spotting System Based on Iterative Fine-Tuning in Neural Network Pruning

Author: Zhou, Hengshun, Du, Jun, Yang, Chao-Han Huck, Xiong, Shifu, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-only-based wake word spotting (WWS) is challenging under noisy conditions due to environmental interference in signal transmission. In this paper, we investigate on designing a compact audio-visual WWS system by utilizing visual information to alleviate the degradation. Specifically, in order to use visual information, we first encode the detected lips to fixed-size vectors with MobileNet and concatenate them with acoustic features followed by the fusion network for WWS. However, the audio-visual model based on neural networks requires a large footprint and a high computational complexity. To meet the application requirements, we introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF), to the single-modal and multi-modal models, respectively. Tested on our in-house corpus for audio-visual WWS in a home TV scene, the proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions. Moreover, LTH-IF pruning can largely reduce the network parameters and computations with no degradation of WWS performance, leading to a potential product solution for the TV wake-up scenario., Comment: Accepted to ICASSP 2022. H. Zhou et al
Published: 2022

31. Pessimistic Model Selection for Offline Deep Reinforcement Learning

Author: Yang, Chao-Han Huck, Qi, Zhengling, Cui, Yifan, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computational Engineering, Finance, and Science, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Systems and Control
Abstract: Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods., Comment: Preprint. A non-archival and preliminary venue was presented at NeurIPS 2021 Offline Reinforcement Learning Workshop
Published: 2021

32. A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer

Author: Hu, Hu, Siniscalchi, Sabato Marco, Yang, Chao-Han Huck, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms., Comment: Accepted to ICASSP 2022. Code is available at https://github.com/MihawkHu/ASC_Knowledge_Transfer
Published: 2021

33. Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition

Author: Yen, Hao, Ku, Pin-Jui, Yang, Chao-Han Huck, Hu, Hu, Siniscalchi, Sabato Marco, Chen, Pin-Yu, and Tsao, Yu
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR), and build an AR-SCR system. The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model (from the source domain). To solve the label mismatches between source and target domains, and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Arabic and Lithuanian speech commands datasets, with only a limited amount of training data., Comment: Accepted to Interspeech 2023. Code is available at: https://github.com/dodohow1011/SpeechAdvReprogram. Selected as Best Student Paper Candidate
Published: 2021
Full Text: View/download PDF

34. QTN-VQC: An End-to-End Learning framework for Quantum Neural Networks

Author: Qi, Jun, Yang, Chao-Han Huck, and Chen, Pin-Yu
Subjects: Quantum Physics, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: The advent of noisy intermediate-scale quantum (NISQ) computers raises a crucial challenge to design quantum neural networks for fully quantum learning tasks. To bridge the gap, this work proposes an end-to-end learning framework named QTN-VQC, by introducing a trainable quantum tensor network (QTN) for quantum embedding on a variational quantum circuit (VQC). The architecture of QTN is composed of a parametric tensor-train network for feature extraction and a tensor product encoding for quantum embedding. We highlight the QTN for quantum embedding in terms of two perspectives: (1) we theoretically characterize QTN by analyzing its representation power of input features; (2) QTN enables an end-to-end parametric model pipeline, namely QTN-VQC, from the generation of quantum embedding to the output measurement. Our experiments on the MNIST dataset demonstrate the advantages of QTN for quantum embedding over other quantum embedding approaches., Comment: Preprint. A Non-archival and preliminary venue was presented in NeurIPS 2021, Quantum Tensor Networks in Machine Learning Workshop
Published: 2021

35. A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification

Author: Yen, Hao, Yang, Chao-Han Huck, Hu, Hu, Siniscalchi, Sabato Marco, Wang, Qing, Wang, Yuyang, Xia, Xianjun, Zhao, Yuanjun, Wu, Yuzhong, Wang, Yannan, Du, Jun, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called \emph{Acoustic Lottery}. Acoustic Lottery could compress an ASC model up to $1/10^{4}$ and attain a superior performance (validation accuracy of 79.4% and Log loss of 0.64) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a., Comment: 5 figures. DCASE 2021. The project started in November 2020. Revised version
Published: 2021

36. Voice2Series: Reprogramming Acoustic Models for Time Series Classification

Author: Yang, Chao-Han Huck, Tsai, Yun-Yun, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Learning to classify time series with limited data is a practical yet challenging problem. Current methods are primarily based on hand-designed feature extraction rules or domain-specific data augmentation. Motivated by the advances in deep speech processing models and the fact that voice data are univariate temporal signals, in this paper, we propose Voice2Series (V2S), a novel end-to-end approach that reprograms acoustic models for time series classification, through input transformation learning and output label mapping. Leveraging the representation learning power of a large-scale pre-trained speech processing model, on 30 different time series tasks we show that V2S performs competitive results on 19 time series classification tasks. We further provide a theoretical justification of V2S by proving its population risk is upper bounded by the source risk and a Wasserstein distance accounting for feature alignment via reprogramming. Our results offer new and effective means to time series classification., Comment: Updated version with a correction. The full draft was submitted in Jan 2021. The Voice2Series project initially was launched in Sep 2020. Accepted to ICML 2021, 16 Pages
Published: 2021

37. Robust Unsupervised Multi-Object Tracking in Noisy Environments

Author: Yang, C. -H. Huck, Chhabra, Mohit, Liu, Y. -C., Kong, Quan, Yoshinaga, Tomoaki, and Murakami, Tomokazu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Neural and Evolutionary Computing
Abstract: Physical processes, camera movement, and unpredictable environmental conditions like the presence of dust can induce noise and artifacts in video feeds. We observe that popular unsupervised MOT methods are dependent on noise-free inputs. We show that the addition of a small amount of artificial random noise causes a sharp degradation in model performance on benchmark metrics. We resolve this problem by introducing a robust unsupervised multi-object tracking (MOT) model: AttU-Net. The proposed single-head attention model helps limit the negative impact of noise by learning visual representations at different segment scales. AttU-Net shows better unsupervised MOT tracking performance over variational inference-based state-of-the-art baselines. We evaluate our method in the MNIST-MOT and the Atari game video benchmark. We also provide two extended video datasets: ``Kuzushiji-MNIST MOT'' which consists of moving Japanese characters and ``Fashion-MNIST MOT'' to validate the effectiveness of the MOT models., Comment: Accepted to IEEE ICIP 2021
Published: 2021
Full Text: View/download PDF

38. PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification

Author: Yang, Chao-Han Huck, Siniscalchi, Sabato Marco, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose using an adversarial autoencoder (AAE) to replace generative adversarial network (GAN) in the private aggregation of teacher ensembles (PATE), a solution for ensuring differential privacy in speech applications. The AAE architecture allows us to obtain good synthetic speech leveraging upon a discriminative training of latent vectors. Such synthetic speech is used to build a privacy-preserving classifier when non-sensitive data is not sufficiently available in the public domain. This classifier follows the PATE scheme that uses an ensemble of noisy outputs to label the synthetic samples and guarantee $\varepsilon$-differential privacy (DP) on its derived classifiers. Our proposed framework thus consists of an AAE-based generator and a PATE-based classifier (PATE-AAE). Evaluated on the Google Speech Commands Dataset Version II, the proposed PATE-AAE improves the average classification accuracy by +$2.11\%$ and +$6.60\%$, respectively, when compared with alternative privacy-preserving solutions, namely PATE-GAN and DP-GAN, while maintaining a strong level of privacy target at $\varepsilon$=0.01 with a fixed $\delta$=10$^{-5}$., Comment: Accepted to Interspeech 2021
Published: 2021
Full Text: View/download PDF

39. Training a Resilient Q-Network against Observational Interference

Author: Yang, Chao-Han Huck, Hung, I-Te Danny, Ouyang, Yi, and Chen, Pin-Yu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing, Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control
Abstract: Deep reinforcement learning (DRL) has demonstrated impressive performance in various gaming simulators and real-world applications. In practice, however, a DRL agent may receive faulty observation by abrupt interferences such as black-out, frozen-screen, and adversarial perturbation. How to design a resilient DRL algorithm against these rare but mission-critical and safety-crucial scenarios is an essential yet challenging task. In this paper, we consider a deep q-network (DQN) framework training with an auxiliary task of observational interferences such as artificial noises. Inspired by causal inference for observational interference, we propose a causal inference based DQN algorithm called causal inference Q-network (CIQ). We evaluate the performance of CIQ in several benchmark DQN environments with different types of interferences as auxiliary labels. Our experimental results show that the proposed CIQ method could achieve higher performance and more resilience against observational interferences., Comment: Accepted to AAAI 2022. 9 pages
Published: 2021

40. Multi-task Language Modeling for Improving Speech Recognition of Rare Words

Author: Yang, Chao-Han Huck, Liu, Linda, Gandhe, Ankur, Gu, Yile, Raju, Anirudh, Filimonov, Denis, and Bulyko, Ivan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: End-to-end automatic speech recognition (ASR) systems are increasingly popular due to their relative architectural simplicity and competitive performance. However, even though the average accuracy of these systems may be high, the performance on rare content words often lags behind hybrid ASR systems. To address this problem, second-pass rescoring is often applied leveraging upon language modeling. In this paper, we propose a second-pass system with multi-task learning, utilizing semantic targets (such as intent and slot prediction) to improve speech recognition performance. We show that our rescoring model trained with these additional tasks outperforms the baseline rescoring model, trained with only the language modeling task, by 1.4% on a general test and by 2.6% on a rare word test set in terms of word-error-rate relative (WERR). Our best ASR system with multi-task LM shows 4.6% WERR deduction compared with RNN Transducer only ASR baseline for rare words recognition., Comment: Accepted to IEEE Automatic Speech Recognition and Understanding (ASRU) 2021
Published: 2020

41. A Two-Stage Approach to Device-Robust Acoustic Scene Classification

Author: Hu, Hu, Yang, Chao-Han Huck, Xia, Xianjun, Bai, Xue, Tang, Xin, Wang, Yajian, Niu, Shutong, Chai, Li, Li, Juanjuan, Zhu, Hongning, Bao, Feng, Zhao, Yuanjun, Siniscalchi, Sabato Marco, Wang, Yannan, Du, Jun, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: To improve device robustness, a highly desirable key feature of a competitive data-driven acoustic scene classification (ASC) system, a novel two-stage system based on fully convolutional neural networks (CNNs) is proposed. Our two-stage system leverages on an ad-hoc score combination based on two CNN classifiers: (i) the first CNN classifies acoustic inputs into one of three broad classes, and (ii) the second CNN classifies the same inputs into one of ten finer-grained classes. Three different CNN architectures are explored to implement the two-stage classifiers, and a frequency sub-sampling scheme is investigated. Moreover, novel data augmentation schemes for ASC are also investigated. Evaluated on DCASE 2020 Task 1a, our results show that the proposed ASC system attains a state-of-the-art accuracy on the development set, where our best system, a two-stage fusion of CNN ensembles, delivers a 81.9% average accuracy among multi-device test data, and it obtains a significant improvement on unseen devices. Finally, neural saliency analysis with class activation mapping (CAM) gives new insights on the patterns learnt by our models., Comment: Submitted to ICASSP 2021. Code available: https://github.com/MihawkHu/DCASE2020_task1
Published: 2020
Full Text: View/download PDF

42. DeepOpht: Medical Report Generation for Retinal Images via Deep Models and Visual Explanation

Author: Huang, Jia-Hong, Yang, Chao-Han Huck, Liu, Fangyu, Tian, Meng, Liu, Yi-Chieh, Wu, Ting-Wei, Lin, I-Hung, Wang, Kang, Morikawa, Hiromasa, Chang, Hernghua, Tegner, Jesper, and Worring, Marcel
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: In this work, we propose an AI-based method that intends to improve the conventional retinal disease treatment procedure and help ophthalmologists increase diagnosis efficiency and accuracy. The proposed method is composed of a deep neural networks-based (DNN-based) module, including a retinal disease identifier and clinical description generator, and a DNN visual explanation module. To train and validate the effectiveness of our DNN-based module, we propose a large-scale retinal disease image dataset. Also, as ground truth, we provide a retinal image dataset manually labeled by ophthalmologists to qualitatively show, the proposed AI-based method is effective. With our experimental results, we show that the proposed method is quantitatively and qualitatively effective. Our method is capable of creating meaningful retinal image descriptions and visual explanations that are clinically relevant., Comment: Accepted to IEEE WACV 2021
Published: 2020

43. Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition

Author: Yang, Chao-Han Huck, Qi, Jun, Chen, Samuel Yen-Chi, Chen, Pin-Yu, Siniscalchi, Sabato Marco, Ma, Xiaoli, and Lee, Chin-Hui
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Audio and Speech Processing, Quantum Physics
Abstract: We propose a novel decentralized feature extraction approach in federated learning to address privacy-preservation issues for speech recognition. It is built upon a quantum convolutional neural network (QCNN) composed of a quantum circuit encoder for feature extraction, and a recurrent neural network (RNN) based end-to-end acoustic model (AM). To enhance model parameter protection in a decentralized architecture, an input speech is first up-streamed to a quantum computing server to extract Mel-spectrogram, and the corresponding convolutional features are encoded using a quantum circuit algorithm with random parameters. The encoded features are then down-streamed to the local RNN model for the final recognition. The proposed decentralized framework takes advantage of the quantum learning progress to secure models and to avoid privacy leakage attacks. Testing on the Google Speech Commands Dataset, the proposed QCNN encoder attains a competitive accuracy of 95.12% in a decentralized model, which is better than the previous architectures using centralized RNN models with convolutional features. We also conduct an in-depth study of different quantum circuit encoder architectures to provide insights into designing QCNN-based feature extractors. Neural saliency analyses demonstrate a correlation between the proposed QCNN features, class activation maps, and input spectrograms. We provide an implementation for future studies., Comment: Accepted to IEEE ICASSP 2021. Code is available: https://github.com/huckiyang/QuantumSpeech-QCNN
Published: 2020
Full Text: View/download PDF

44. Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement

Author: Qi, Jun, Hu, Hu, Wang, Yannan, Yang, Chao-Han Huck, Siniscalchi, Sabato Marco, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality and a tensor-train (TT) output layer on the top to reduce model parameters. We first derive a new upper bound on the generalization power of the convolutional neural network (CNN) based vector-to-vector regression models. Then, we provide experimental evidence on the Edinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at the expense of a small increment of model sizes. Besides, CNN-TT slightly outperforms the CNN counterpart by utilizing only 32\% of the CNN model parameters. Besides, further performance improvement can be attained if the number of CNN-TT parameters is increased to 44\% of the CNN model size. Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes., Comment: Accepted to InterSpeech 2020
Published: 2020

45. Wavelet Channel Attention Module with a Fusion Network for Single Image Deraining

Author: Yang, Hao-Hsiang, Yang, Chao-Han Huck, and Wang, Yu-Chiang Frank
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Single image deraining is a crucial problem because rain severely degenerates the visibility of images and affects the performance of computer vision tasks like outdoor surveillance systems and intelligent vehicles. In this paper, we propose the new convolutional neural network (CNN) called the wavelet channel attention module with a fusion network. Wavelet transform and the inverse wavelet transform are substituted for down-sampling and up-sampling so feature maps from the wavelet transform and convolutions contain different frequencies and scales. Furthermore, feature maps are integrated by channel attention. Our proposed network learns confidence maps of four sub-band images derived from the wavelet transform of the original images. Finally, the clear image can be well restored via the wavelet reconstruction and fusion of the low-frequency part and high-frequency parts. Several experimental results on synthetic and real images present that the proposed algorithm outperforms state-of-the-art methods., Comment: Accepted to IEEE ICIP 2020
Published: 2020
Full Text: View/download PDF

46. Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Author: Hu, Hu, Yang, Chao-Han Huck, Xia, Xianjun, Bai, Xue, Tang, Xin, Wang, Yajian, Niu, Shutong, Chai, Li, Li, Juanjuan, Zhu, Hongning, Bao, Feng, Zhao, Yuanjun, Siniscalchi, Sabato Marco, Wang, Yannan, Du, Jun, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9\% is attained using our best single classifier and data augmentation. An accuracy of 81.9\% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7\% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1., Comment: Revised Technical Report. Proposed systems attain 2nds in both Task-1a and Task-1b in the official DCASE challenge 2020
Published: 2020

47. Y-net: Multi-scale feature aggregation network with wavelet structure similarity loss function for single image dehazing

Author: Yang, Hao-Hsiang, Yang, Chao-Han Huck, and Tsai, Yi-Chang James
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Single image dehazing is the ill-posed two-dimensional signal reconstruction problem. Recently, deep convolutional neural networks (CNN) have been successfully used in many computer vision problems. In this paper, we propose a Y-net that is named for its structure. This network reconstructs clear images by aggregating multi-scale features maps. Additionally, we propose a Wavelet Structure SIMilarity (W-SSIM) loss function in the training step. In the proposed loss function, discrete wavelet transforms are applied repeatedly to divide the image into differently sized patches with different frequencies and scales. The proposed loss function is the accumulation of SSIM loss of various patches with respective ratios. Extensive experimental results demonstrate that the proposed Y-net with the W-SSIM loss function restores high-quality clear images and outperforms state-of-the-art algorithms. Code and models are available at https://github.com/dectrfov/Y-net., Comment: Accepted to IEEE ICASSP 2020
Published: 2020

48. Characterizing Speech Adversarial Examples Using Self-Attention U-Net Enhancement

Author: Yang, Chao-Han Huck, Qi, Jun, Chen, Pin-Yu, Ma, Xiaoli, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Recent studies have highlighted adversarial examples as ubiquitous threats to the deep neural network (DNN) based speech recognition systems. In this work, we present a U-Net based attention model, U-Net$_{At}$, to enhance adversarial speech signals. Specifically, we evaluate the model performance by interpretable speech recognition metrics and discuss the model performance by the augmented adversarial training. Our experiments show that our proposed U-Net$_{At}$ improves the perceptual evaluation of speech quality (PESQ) from 1.13 to 2.78, speech transmission index (STI) from 0.65 to 0.75, short-term objective intelligibility (STOI) from 0.83 to 0.96 on the task of speech enhancement with adversarial speech examples. We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks. We find that (i) temporal features learned by the attention network are capable of enhancing the robustness of DNN based ASR models; (ii) the generalization power of DNN based ASR model could be enhanced by applying adversarial training with an additive adversarial data augmentation. The ASR metric on word-error-rates (WERs) shows that there is an absolute 2.22 $\%$ decrease under gradient-based perturbation, and an absolute 2.03 $\%$ decrease, under evolutionary-optimized perturbation, which suggests that our enhancement models with adversarial training can further secure a resilient ASR system., Comment: The authors have revised some annotations in Table 4 to improve the clarity. The authors thank reading feedbacks from Jonathan Le Roux. The first draft was finished in August 2019. Accepted to IEEE ICASSP 2020
Published: 2020
Full Text: View/download PDF

49. Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement Learning

Author: Yang, Chao-Han Huck, Qi, Jun, Chen, Pin-Yu, Ouyang, Yi, Hung, I-Te Danny, Lee, Chin-Hui, and Ma, Xiaoli
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Signal Processing, Statistics - Machine Learning
Abstract: Recent deep neural networks based techniques, especially those equipped with the ability of self-adaptation in the system level such as deep reinforcement learning (DRL), are shown to possess many advantages of optimizing robot learning systems (e.g., autonomous navigation and continuous robot arm control.) However, the learning-based systems and the associated models may be threatened by the risks of intentionally adaptive (e.g., noisy sensor confusion) and adversarial perturbations from real-world scenarios. In this paper, we introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames. To study the vulnerability of learning-based navigation systems, we propose two adversarial agent models: one refers to online learning; another one is based on evolutionary learning. Besides, three open-source robot learning and navigation control environments are employed to study the vulnerability under adversarial timing attacks. Our experimental results show that the adversarial timing attacks can lead to a significant performance drop, and also suggest the necessity of enhancing the robustness of robot learning systems., Comment: Accepted to IEEE ICASSP 2020
Published: 2020
Full Text: View/download PDF

50. Tensor-to-Vector Regression for Multi-channel Speech Enhancement based on Tensor-Train Network

Author: Qi, Jun, Hu, Hu, Wang, Yannan, Yang, Chao-Han Huck, Siniscalchi, Sabato Marco, and Lee, Chin-Hui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound
Abstract: We propose a tensor-to-vector regression approach to multi-channel speech enhancement in order to address the issue of input size explosion and hidden-layer size expansion. The key idea is to cast the conventional deep neural network (DNN) based vector-to-vector regression formulation under a tensor-train network (TTN) framework. TTN is a recently emerged solution for compact representation of deep models with fully connected hidden layers. Thus TTN maintains DNN's expressive power yet involves a much smaller amount of trainable parameters. Furthermore, TTN can handle a multi-dimensional tensor input by design, which exactly matches the desired setting in multi-channel speech enhancement. We first provide a theoretical extension from DNN to TTN based regression. Next, we show that TTN can attain speech enhancement quality comparable with that for DNN but with much fewer parameters, e.g., a reduction from 27 million to only 5 million parameters is observed in a single-channel scenario. TTN also improves PESQ over DNN from 2.86 to 2.96 by slightly increasing the number of trainable parameters. Finally, in 8-channel conditions, a PESQ of 3.12 is achieved using 20 million parameters for TTN, whereas a DNN with 68 million parameters can only attain a PESQ of 3.06. Our implementation is available online https://github.com/uwjunqi/Tensor-Train-Neural-Network., Comment: Accepted to ICASSP 2020. Update reproducible code
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

58 results on '"HUCK, P"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources