Author: "Spyros Matsoukas" / Topic: 0202 electrical engineering, electronic engineering, information engineering - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Spyros Matsoukas"' showing total 14 results

Start Over Author "Spyros Matsoukas" Topic 0202 electrical engineering, electronic engineering, information engineering

14 results on '"Spyros Matsoukas"'

1. Joint Turn and Dialogue level User Satisfaction Estimation on Multi-Domain Conversations

Author: Josep Valls-Vargas, Lazaros Polymenakos, Spyros Matsoukas, Aditya Tiwari, and Praveen Kumar Bodigutla
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer science, Computer Science - Artificial Intelligence, media_common.quotation_subject, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Machine Learning (cs.LG), 0202 electrical engineering, electronic engineering, information engineering, Generalizability theory, Quality (business), 0105 earth and related environmental sciences, media_common, Computer Science - Computation and Language, Artificial neural network, End user, business.industry, User satisfaction, Artificial Intelligence (cs.AI), Benchmark (computing), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Computation and Language (cs.CL)
Abstract: Dialogue level quality estimation is vital for optimizing data driven dialogue management. Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features and rely on complex annotation schemes, which reduce the generalizability of the trained models. We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function in order to jointly predict turn-level Response Quality labels provided by experts and explicit dialogue-level ratings provided by end users. The proposed BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating, implicitly encodes temporal dependencies, and removes the need to hand-craft features. On dialogues sampled from 28 Alexa domains, two dialogue systems and three user groups, the joint dialogue-level satisfaction estimation model achieved up to an absolute 27% (0.43->0.70) and 7% (0.63->0.70) improvement in linear correlation performance over baseline deep neural net and benchmark Gradient boosting regression models, respectively., Findings of EMNLP, 2020
Published: 2020

2. Towards Data-Efficient Modeling for Wake Word Spotting

Author: Shiv Naga Prasad Vitaladevuni, Yuriy Mishchenko, Anish Shah, Yixin Gao, and Spyros Matsoukas
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Computer science, Speech recognition, 020206 networking & telecommunications, Speech corpus, 02 engineering and technology, Semi-supervised learning, Wake, Spotting, Computer Science - Sound, Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Wake word (WW) spotting is challenging in far-field not only because of the interference in signal transmission but also the complexity in acoustic environment. Traditional WW model training requires large amount of in-domain WW-specific data with substantial human annotations. This prevents the model building in the situation of lacking such data. In this paper we present data-efficient solutions to address the challenges in WW modeling, such as domain-mismatch, noisy conditions, limited annotation, etc. Our proposed system is composed of a multi-condition training pipeline with stratified data augmentation, which improves the model robustness to a variety of predefined acoustic conditions, together with a semi-supervised learning pipeline to extract the WW and adversarial examples from untranscribed speech corpus. Starting from only 10 hours of domain-mismatched WW audio, we are able to enlarge and enrich the training dataset by 20-100 times to capture the complexity in acoustic environments. Our experiments on real user data show that the proposed solutions can achieve comparable performance of a production-grade model by saving 97% of the amount of WW-specific data to collect and 86% of the bandwidth for annotation.
Published: 2020
Full Text: View/download PDF

3. Low-Bit Quantization and Quantization-Aware Training for Small-Footprint Keyword Spotting

Author: Chris Beauchene, Yuriy Mishchenko, Oleg Rybakov, Shiv Naga Prasad Vitaladevuni, Spyros Matsoukas, Ming Sun, and Yusuf Goren
Subjects: Artificial neural network, Computer science, Low bit, Quantization (signal processing), Small footprint, 020208 electrical & electronic engineering, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Quantization (physics), Keyword spotting, 0202 electrical engineering, electronic engineering, information engineering, Algorithm, 0105 earth and related environmental sciences
Abstract: In this paper, we investigate novel quantization approaches to reduce memory and computational footprint of deep neural network (DNN) based keyword spotters (KWS). We propose a new method for KWS offline and online quantization, which we call dynamic quantization, where we quantize DNN weight matrices column-wise, using each column's exact individual min-max range, and the DNN layers' inputs and outputs are quantized for every input audio frame individually, using the exact min-max range of each input and output vector. We further apply a new quantization-aware training approach that allows us to incorporate quantization errors into KWS model during training. Together, these approaches allow us to significantly improve the performance of KWS in 4-bit and 8-bit quantized precision, achieving the end-to-end accuracy close to that of full precision models while reducing the models' on-device memory footprint by up to 80%.
Published: 2019
Full Text: View/download PDF

4. Compression of Acoustic Event Detection Models With Quantized Distillation

Author: Ming Sun, Chao Wang, Chieh-Chi Kao, Bowen Shi, Spyros Matsoukas, and Viktor Rozgic
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio signal, Scale (ratio), Artificial neural network, Computer science, Intelligent decision support system, Word error rate, 02 engineering and technology, Field (computer science), Computer Science - Sound, law.invention, Machine Learning (cs.LG), Quantization (physics), Computer engineering, law, Audio and Speech Processing (eess.AS), 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Distillation, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Acoustic Event Detection (AED), aiming at detecting categories of events based on audio signals, has found application in many intelligent systems. Recently deep neural network significantly advances this field and reduces detection errors to a large scale. However how to efficiently execute deep models in AED has received much less attention. Meanwhile state-of-the-art AED models are based on large deep models, which are computational demanding and challenging to deploy on devices with constrained computational resources. In this paper, we present a simple yet effective compression approach which jointly leverages knowledge distillation and quantization to compress larger network (teacher model) into compact network (student model). Experimental results show proposed technique not only lowers error rate of original compact network by 15% through distillation but also further reduces its model size to a large extent (2% of teacher, 12% of full-precision student) through quantization., Interspeech 2019
Published: 2019

5. Semi-supervised Acoustic Event Detection Based on Tri-training

Author: Spyros Matsoukas, Ming Sun, Chao Wang, Viktor Rozgic, Bowen Shi, and Chieh-Chi Kao
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, 02 engineering and technology, Semi-supervised learning, 010501 environmental sciences, 01 natural sciences, Computer Science - Sound, Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), Acoustic event detection, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 0105 earth and related environmental sciences, Artificial neural network, Ensemble forecasting, business.industry, Detector, 020206 networking & telecommunications, Pattern recognition, ComputingMethodologies_PATTERNRECOGNITION, Labeled data, Artificial intelligence, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents our work of training acoustic event detection (AED) models using unlabeled dataset. Recent acoustic event detectors are based on large-scale neural networks, which are typically trained with huge amounts of labeled data. Labels for acoustic events are expensive to obtain, and relevant acoustic event audios can be limited, especially for rare events. In this paper we leverage an Internet-scale unlabeled dataset with potential domain shift to improve the detection of acoustic events. Based on the classic tri-training approach, our proposed method shows accuracy improvement over both the supervised training baseline, and semisupervised self-training set-up, in all pre-defined acoustic event detection tasks. As our approach relies on ensemble models, we further show the improvements can be distilled to a single model via knowledge distillation, with the resulting single student model maintaining high accuracy of teacher ensemble models., Comment: 5 pages
Published: 2019
Full Text: View/download PDF

6. Device-directed Utterance Detection

Author: Bjorn Hoffmeister, Kyle Goehner, Roland Maas, Ariya Rastrow, Sri Harish Mallidi, and Spyros Matsoukas
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial neural network, Repetition (rhetorical device), Computer science, Speech recognition, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, Audio and Speech Processing (eess.AS), Classifier (linguistics), FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Active listening, State (computer science), User needs, Computation and Language (cs.CL), Utterance, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as well as enabling wake-word free follow-up queries. Consider the example interaction: $"Computer,~play~music", "Computer,~reduce~the~volume"$. In this interaction, the user needs to repeat the wake-word ($Computer$) for the second query. To allow for more natural interactions, the device could immediately re-enter listening state after the first query (without wake-word repetition) and accept or reject a potential follow-up as device-directed or background speech. The proposed model consists of two long short-term memory (LSTM) neural networks trained on acoustic features and automatic speech recognition (ASR) 1-best hypotheses, respectively. A feed-forward deep neural network (DNN) is then trained to combine the acoustic and 1-best embeddings, derived from the LSTMs, with features from the ASR decoder. Experimental results show that ASR decoder, acoustic embeddings, and 1-best embeddings yield an equal-error-rate (EER) of $9.3~\%$, $10.9~\%$ and $20.1~\%$, respectively. Combination of the features resulted in a $44~\%$ relative improvement and a final EER of $5.2~\%$., Interspeech 2018 (accepted)
Published: 2018
Full Text: View/download PDF

7. Fast and Scalable Expansion of Natural Language Understanding Functionality for Intelligent Agents

Author: Spyros Matsoukas, Angeliki Metallinou, and Anuj Goyal
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial neural network, Computer science, Process (engineering), Natural language understanding, 02 engineering and technology, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, Intelligent agent, Human–computer interaction, Scalability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0305 other medical science, Transfer of learning, Computation and Language (cs.CL), computer, Natural language
Abstract: Fast expansion of natural language functionality of intelligent virtual agents is critical for achieving engaging and informative interactions. However, developing accurate models for new natural language domains is a time and data intensive process. We propose efficient deep neural network architectures that maximally re-use available resources through transfer learning. Our methods are applied for expanding the understanding capabilities of a popular commercial agent and are evaluated on hundreds of new domains, designed by internal or external developers. We demonstrate that our proposed methods significantly increase accuracy in low resource settings and enable rapid development of accurate models with less data., To appear in the Proceedings of NAACL-HLT 2018 (Industry Track)
Published: 2018

8. The Alexa Meaning Representation Language

Author: Karolina Owczarzak, Thomas Kollar, Bradford Snow, Lauren M. Stuart, Tagyoung Chung, Danielle Berry, Lambert Mathias, Spyros Matsoukas, and Michael Kayser
Subjects: Computer science, 02 engineering and technology, Representation (arts), Ontology (information science), Linguistics, 020204 information systems, Factor (programming language), 0202 electrical engineering, electronic engineering, information engineering, Representation language, 020201 artificial intelligence & image processing, Composition (language), computer, Meaning (linguistics), computer.programming_language, Spoken language
Abstract: This paper introduces a meaning representation for spoken language understanding. The Alexa meaning representation language (AMRL), unlike previous approaches, which factor spoken utterances into domains, provides a common representation for how people communicate in spoken language. AMRL is a rooted graph, links to a large-scale ontology, supports cross-domain queries, fine-grained types, complex utterances and composition. A spoken language dataset has been collected for Alexa, which contains ∼20k examples across eight domains. A version of this meaning representation was released to developers at a trade show in 2016.
Published: 2018
Full Text: View/download PDF

9. A Re-ranker Scheme for Integrating Large Scale NLU models

Author: Shankar Ananthakrishnan, Spyros Matsoukas, Rahul Gupta, and Chengwei Su
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer science, business.industry, Natural language understanding, Word error rate, 020206 networking & telecommunications, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Named entity, Error analysis, Scalability, 0202 electrical engineering, electronic engineering, information engineering, Entropy (information theory), Artificial intelligence, business, computer, Computation and Language (cs.CL), 0105 earth and related environmental sciences
Abstract: Large scale Natural Language Understanding (NLU) systems are typically trained on large quantities of data, requiring a fast and scalable training strategy. A typical design for NLU systems consists of domain-level NLU modules (domain classifier, intent classifier and named entity recognizer). Hypotheses (NLU interpretations consisting of various intent+slot combinations) from these domain specific modules are typically aggregated with another downstream component. The re-ranker integrates outputs from domain-level recognizers, returning a scored list of cross domain hypotheses. An ideal re-ranker will exhibit the following two properties: (a) it should prefer the most relevant hypothesis for the given input as the top hypothesis and, (b) the interpretation scores corresponding to each hypothesis produced by the re-ranker should be calibrated. Calibration allows the final NLU interpretation score to be comparable across domains. We propose a novel re-ranker strategy that addresses these aspects, while also maintaining domain specific modularity. We design optimization loss functions for such a modularized re-ranker and present results on decreasing the top hypothesis error rate as well as maintaining the model calibration. We also experiment with an extension involving training the domain specific re-rankers on datasets curated independently by each domain to allow further asynchronization. %The proposed re-ranker design showcases the following: (i) improved NLU performance over an unweighted aggregation strategy, (ii) cross-domain calibrated performance and, (iii) support for use cases involving training each re-ranker on datasets curated by each domain independently., Comment: 7 pages, Accepted to IEEE SLT-2018
Published: 2018
Full Text: View/download PDF

10. An Empirical Study of Cross-Lingual Transfer Learning Techniques for Small-Footprint Keyword Spotting

Author: Nikko Strom, Spyros Matsoukas, Ming Sun, Andreas Schwarz, Minhua Wu, and Shiv Naga Prasad Vitaladevuni
Subjects: Empirical research, Artificial neural network, Computer science, 020204 information systems, Test set, Speech recognition, Keyword spotting, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, 02 engineering and technology, Transfer of learning, Hidden Markov model
Abstract: This paper presents our work on building a small-footprint keyword spotting system for a resource-limited language, which requires low CPU, memory and latency. Our keyword spotting system consists of deep neural network (DNN) and hidden Markov model (HMM), which is a hybrid DNN-HMM decoder. We investigate different transfer learning techniques to leverage knowledge and data from a resource-abundant source language to improve the keyword DNN training for a target language which has limited in-domain data. The approaches employed in this paper include training a DNN using source language data to initialize the target language DNN training, mixing data from source and target languages together in a multi-task DNN training setup, using logits computed from a DNN trained on the source language data to regularize the keyword DNN training in the target language, as well as combinations of these techniques. Given different amounts of target language training data, our experimental results show that these transfer learning techniques successfully improve keyword spotting performance for the target language, measured by the area under the curve (AUC) of DNN-HMM decoding detection error tradeoff (DET) curves using a large in-house far-field test set.
Published: 2017
Full Text: View/download PDF

11. Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

Author: Nikko Strom, Arindam Mandal, Anirudh Raju, Shiv Naga Prasad Vitaladevuni, Geng-Shen Fu, Spyros Matsoukas, Sankaran Panchapagesan, George Tucker, and Ming Sun
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial neural network, Computer science, Speech recognition, education, Initialization, Machine Learning (stat.ML), Context (language use), 02 engineering and technology, Machine Learning (cs.LG), Reduction (complexity), 030507 speech-language pathology & audiology, 03 medical and health sciences, Computer Science - Learning, Statistics - Machine Learning, Keyword spotting, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Latency (engineering), 0305 other medical science, Hidden Markov model, Computation and Language (cs.CL), Smoothing
Abstract: We propose a max-pooling based loss function for training Long Short-Term Memory (LSTM) networks for small-footprint keyword spotting (KWS), with low CPU, memory, and latency requirements. The max-pooling loss training can be further guided by initializing with a cross-entropy loss trained network. A posterior smoothing based evaluation approach is employed to measure keyword spotting performance. Our experimental results show that LSTM models trained using cross-entropy loss or max-pooling loss outperform a cross-entropy loss trained baseline feed-forward Deep Neural Network (DNN). In addition, max-pooling loss trained LSTM with randomly initialized network performs better compared to cross-entropy loss trained LSTM. Finally, the max-pooling loss trained LSTM initialized with a cross-entropy pre-trained network shows the best performance, which yields $67.6\%$ relative reduction compared to baseline feed-forward DNN in Area Under the Curve (AUC) measure.
Published: 2017

12. Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting

Author: Shiv Naga Prasad Vitaladevuni, Ming Sun, Aparna Khare, Sankaran Panchapagesan, Spyros Matsoukas, Arindam Mandal, and Bjorn Hoffmeister
Subjects: Cross entropy, Computer science, Speech recognition, Keyword spotting, 0202 electrical engineering, electronic engineering, information engineering, Multi-task learning, 020206 networking & telecommunications, 020201 artificial intelligence & image processing, 02 engineering and technology
Published: 2016
Full Text: View/download PDF

13. Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

Author: Richard Schwartz, Bing Xiang, Jeff Z. Ma, Fabrice Lefèvre, Lori Lamel, Rohit Prasad, Jean-Luc Gauvain, Spyros Matsoukas, Holger Schwenk, Chia-Lin Kao, Owen Kimball, Thomas Colthurst, Long Nguyen, John Makhoul, Gilles Adda, Déposants HAL-Avignon, bibliothèque Universitaire, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris Saclay (COmUE)-Centre National de la Recherche Scientifique (CNRS)-Sorbonne Université - UFR d'Ingénierie (UFR 919), Sorbonne Université (SU)-Sorbonne Université (SU)-Université Paris-Saclay-Université Paris-Sud - Paris 11 (UP11), Institut des Technologies Multilingues et Multimédias de l'Information (IMMI), Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Department of Mathematics, Brown University, Brown University, Laboratoire d'Informatique de l'Université du Mans (LIUM), and Le Mans Université (UM)
Subjects: Acoustics and Ultrasonics, business.industry, Computer science, Speech recognition, Speech coding, Word error rate, 020206 networking & telecommunications, Speech synthesis, 02 engineering and technology, [INFO] Computer Science [cs], Broadcasting, Speech processing, computer.software_genre, Cable television, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], Telephony, Electrical and Electronic Engineering, Transcription (software), 0305 other medical science, business, computer, ComputingMilieux_MISCELLANEOUS
Abstract: This paper describes the progress made in the transcription of broadcast news (BN) and conversational telephone speech (CTS) within the combined BBN/LIMSI system from May 2002 to September 2004. During that period, BBN and LIMSI collaborated in an effort to produce significant reductions in the word error rate (WER), as directed by the aggressive goals of the Effective, Affordable, Reusable, Speech-to-text [Defense Advanced Research Projects Agency (DARPA) EARS] program. The paper focuses on general modeling techniques that led to recognition accuracy improvements, as well as engineering approaches that enabled efficient use of large amounts of training data and fast decoding architectures. Special attention is given on efforts to integrate components of the BBN and LIMSI systems, discussing the tradeoff between speed and accuracy for various system combination strategies. Results on the EARS progress test sets show that the combined BBN/LIMSI system achieved relative reductions of 47% and 51% on the BN and CTS domains, respectively
Published: 2006
Full Text: View/download PDF

14. Patrol team language identification system for DARPA RATS P1 evaluation

Author: Mehdi Soufifar, Frantisek Grezl, Ondrej Glembek, Spyros Matsoukas, Oldrich Plchot, Jeff Z. Ma, Najim Dehak, Luis Fernando D'Haro, Pavel Matejka, and Karel Veselý
Subjects: Telecomunicaciones, Artificial neural network, Language identification, business.industry, Computer science, Speech recognition, Wiener filter, 020206 networking & telecommunications, 02 engineering and technology, Machine learning, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, 0202 electrical engineering, electronic engineering, information engineering, symbols, NIST, Artificial intelligence, Transcription (software), 0305 other medical science, business, computer
Abstract: This paper describes the language identification (LID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We show that techniques originally developed for LID on telephone speech (e.g., for the NIST language recognition evaluations) remain effective on the noisy RATS data, provided that careful consideration is applied when designing the training and development sets. In addition, we show significant improvements from the use of Wiener filtering, neural network based and language dependent i-vector modeling, and fusion.
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

14 results on '"Spyros Matsoukas"'

1. Joint Turn and Dialogue level User Satisfaction Estimation on Multi-Domain Conversations

2. Towards Data-Efficient Modeling for Wake Word Spotting

3. Low-Bit Quantization and Quantization-Aware Training for Small-Footprint Keyword Spotting

4. Compression of Acoustic Event Detection Models With Quantized Distillation

5. Semi-supervised Acoustic Event Detection Based on Tri-training

6. Device-directed Utterance Detection

7. Fast and Scalable Expansion of Natural Language Understanding Functionality for Intelligent Agents

8. The Alexa Meaning Representation Language

9. A Re-ranker Scheme for Integrating Large Scale NLU models

10. An Empirical Study of Cross-Lingual Transfer Learning Techniques for Small-Footprint Keyword Spotting

11. Max-Pooling Loss Training of Long Short-Term Memory Networks for Small-Footprint Keyword Spotting

12. Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting

13. Advances in transcription of broadcast news and conversational telephone speech within the combined EARS BBN/LIMSI system

14. Patrol team language identification system for DARPA RATS P1 evaluation

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

14 results on '"Spyros Matsoukas"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources