Author: "Takaaki Hori" / Topic: computer.software_genre - Searchworks@Jio Institute Digital Library Search Results

1. Overview of the sixth dialog system technology challenge: DSTC6

Author: Y-Lan Boureau, Ryuichiro Higashinaka, Michimasa Inaba, Julien Perez, Chiori Hori, Yuiko Tsunomori, Takaaki Hori, Seokhwan Kim, Koichiro Yoshino, and Tetsuro Takahashi
Subjects: Class (computer programming), Goal orientation, Computer science, media_common.quotation_subject, Natural language generation, 020206 networking & telecommunications, 02 engineering and technology, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Domain (software engineering), Task (project management), Human-Computer Interaction, Human–computer interaction, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Conversation, Dialog box, Dialog system, 010301 acoustics, computer, Software, media_common
Abstract: This paper describes the experimental setups and the evaluation results of the sixth Dialog System Technology Challenges (DSTC6) aiming to develop end-to-end dialogue systems. Neural network models have become a recent focus of investigation in dialogue technologies. Previous models required training data to be manually annotated with word meanings and dialogue states, but end-to-end neural network dialogue systems learn to directly output natural-language system responses without needing training data to be manually annotated. Thus, this approach allows us to scale up the size of training data and cover more dialog domains. In addition, dialogue systems require a meta-function to avoid deploying inappropriate responses generated by themselves. To challenge such issues, the DSTC6 consists of three tracks, (1). End-to-End Goal Oriented dialogue Learning to select system responses, (2). End-to-End Conversation Modeling to generate system responses using Natural Language Generation (NLG) and (3). Dialogue Breakdown Detection. Since each domain has different issues to be addressed to develop dialogue systems, we targeted restaurant retrieval dialogues to fill slot-value in Track 1, customer services on Twitter by combining goal-oriented dialogues and ChitChat in Track 2 and human-machine dialogue data for ChitChat in Track 3. DSTC6 had 141 people declaring their interests and 23 teams submitted their final results. 18 scientific papers were presented in the wrap-up workshop. We find the blending end-to-end trainable models associated to meaningful prior knowledge performs the best for the restaurant retrieval for Track 1. Indeed, Hybrid Code Network and Memory Network have been the best models for this task. In Track 2, 78.5% of the system responses automatically generated by the best system were rated better than acceptable by humans and this achieves 89% of the number of the human responses rated in the same class. In Track3, the dialogue breakdown detection technologies performed as well as human agreements, in both data-sets of English and Japanese.
Published: 2019
Full Text: View/download PDF

2. Adversarial training and decoding strategies for end-to-end neural conversation models

Author: John R. Hershey, Bret Harsham, Wen Wang, Koji Yusuke, Takaaki Hori, and Chiori Hori
Subjects: Computer science, media_common.quotation_subject, 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, Theoretical Computer Science, Task (project management), Adversarial system, End-to-end principle, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Conversation, Dialog system, Dialog box, Set (psychology), 010301 acoustics, media_common, business.industry, 020206 networking & telecommunications, Human-Computer Interaction, Artificial intelligence, business, computer, Software, Decoding methods
Abstract: This paper presents adversarial training and decoding methods for neural conversation models that can generate natural responses given dialog contexts. In our prior work, we built several end-to-end conversation systems for the 6th Dialog System Technology Challenges (DSTC6) Twitter help-desk dialog task. These systems included novel extensions of sequence adversarial training, example-based response extraction, and Minimum Bayes-Risk based system combination. In DSTC6, our systems achieved the best performance in most objective measures such as BLEU and METEOR scores and decent performance in a subjective measure based on human rating. In this paper, we provide a complete set of our experiments for DSTC6 and further extend the training and decoding strategies more focusing on improving the subjective measure, where we combine responses of three adversarial models. Experimental results demonstrate that the extended methods improve the human rating score and outperform the best score in DSTC6.
Published: 2019
Full Text: View/download PDF

3. Capturing Multi-Resolution Context by Dilated Self-Attention

Author: Jonathan Le Roux, Niko Moritz, and Takaaki Hori
Subjects: FOS: Computer and information sciences, Signal processing, Computer Science - Machine Learning, Sound (cs.SD), Artificial neural network, Machine translation, Computational complexity theory, Computer science, Speech recognition, Pooling, Context (language use), computer.software_genre, Speech processing, Computer Science - Sound, Machine Learning (cs.LG), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Dilation (morphology), computer, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Self-attention has become an important and widely used neural network component that helped to establish new state-of-the-art results for various applications, such as machine translation and automatic speech recognition (ASR). However, the computational complexity of self-attention grows quadratically with the input sequence length. This can be particularly problematic for applications such as ASR, where an input sequence generated from an utterance can be relatively long. In this work, we propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention. The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution. Different methods for summarizing distant frames are studied, such as subsampling, mean-pooling, and attention-based pooling. ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs., Comment: In Proc. ICASSP 2021
Published: 2021
Full Text: View/download PDF

4. The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Author: Wen-Chin Huang, Shinji Watanabe, Xuankai Chang, Jing Shi, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Takaaki Hori, Pengcheng Guo, Yosuke Higuchi, Aswin Shanmugam Subramanian, Wangyou Zhang, Tomoki Hayashi, Florian Boyer, and Chenda Li
Subjects: Beamforming, FOS: Computer and information sciences, Sound (cs.SD), Multimedia, Computer science, media_common.quotation_subject, Speech synthesis, computer.software_genre, Speech processing, Computer Science - Sound, Speech enhancement, Sequence modeling, Audio and Speech Processing (eess.AS), Speech translation, FOS: Electrical engineering, electronic engineering, information engineering, Conversation, computer, Transformer (machine learning model), media_common, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.
Published: 2020
Full Text: View/download PDF

5. A Comparative Study on Transformer vs RNN in Speech Applications

Author: Takenori Yoshimura, Shinji Watanabe, Ryuichi Yamamoto, Wangyou Zhang, Shigeki Karita, Hirofumi Inaguma, Nelson Yalta, Takaaki Hori, Nanxin Chen, Xiaofei Wang, Tomoki Hayashi, Masao Someki, and Ziyan Jiang
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Machine translation, Computer science, Speech recognition, Speech synthesis, Speech applications, computer.software_genre, Speech processing, Computer Science - Sound, Recurrent neural network, Open source, Audio and Speech Processing (eess.AS), Speech translation, FOS: Electrical engineering, electronic engineering, information engineering, computer, Computation and Language (cs.CL), Transformer (machine learning model), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes., Accepted at ASRU 2019
Published: 2019

6. Back-Translation-Style Data Augmentation for End-to-End ASR

Author: Ramón Fernandez Astudillo, Kazuya Takeda, Shinji Watanabe, Tomoki Toda, Yu Zhang, Tomoki Hayashi, and Takaaki Hori
Subjects: FOS: Computer and information sciences, Paired Data, Computer Science - Computation and Language, Machine translation, Computer science, Speech recognition, Feature extraction, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Field (computer science), Data modeling, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0305 other medical science, Hidden Markov model, Encoder, computer, Computation and Language (cs.CL), Decoding methods, 0105 earth and related environmental sciences
Abstract: In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.
Published: 2018
Full Text: View/download PDF

7. Dialog state tracking with attention-based sequence-to-sequence learning

Author: Takaaki Hori, Bret Harsham, Jonathan Le Roux, Shinji Watanabe, Koji Yusuke, Chiori Hori, John R. Hershey, Yi Jing, Takeyuki Aikawa, Zhaocheng Zhu, and Hai Wang
Subjects: BitTorrent tracker, business.industry, Computer science, Speech recognition, Frame (networking), Tracking system, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Set (abstract data type), 020204 information systems, Test set, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Sequence learning, Pattern matching, Dialog box, business, computer, Natural language processing, 0105 earth and related environmental sciences
Abstract: We present an advanced dialog state tracking system designed for the 5th Dialog State Tracking Challenge (DSTC5). The main task of DSTC5 is to track the dialog state in a human-human dialog. For each utterance, the tracker emits a frame of slot-value pairs considering the full history of the dialog up to the current turn. Our system includes an encoder-decoder architecture with an attention mechanism to map an input word sequence to a set of semantic labels, i.e., slot-value pairs. This handles the problem of the unknown alignment between the utterances and the labels. By combining the attention-based tracker with rule-based trackers elaborated for English and Chinese, the F-score for the development set improved from 0.475 to 0.507 compared to the rule-only trackers. Moreover, we achieved 0.517 F-score by refining the combination strategy based on the topic and slot level performance of each tracker. In this paper, we also validate the efficacy of each technique and report the test set results submitted to the challenge.
Published: 2016
Full Text: View/download PDF

8. Automated Structure Discovery and Parameter Tuning of Neural Network Language Model based on Evolution Strategy

Author: Tomohiro Tanaka, Takaaki Hori, Shinji Watanabe, Kevin Duh, Takahiro Shinozaki, and Takafumi Moriya
Subjects: Artificial neural network, Computer science, business.industry, Time delay neural network, Evolutionary algorithm, 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Multi-objective optimization, Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, Feedforward neural network, 020201 artificial intelligence & image processing, Artificial intelligence, Language model, Evolution strategy, business, computer, 0105 earth and related environmental sciences
Abstract: Long short-term memory (LSTM) recurrent neural network based language models are known to improve speech recognition performance. However, significant effort is required to optimize network structures and training configurations. In this study, we automate the development process using evolutionary algorithms. In particular, we apply the covariance matrix adaptation-evolution strategy (CMA-ES), which has demonstrated robustness in other black box hyper-parameter optimization problems. By flexibly allowing optimization of various meta-parameters including layer wise unit types, our method automatically finds a configuration that gives improved recognition performance. Further, by using a Pareto based multi-objective CMA-ES, both WER and computational time were reduced jointly: after 10 generations, relative WER and computational time reductions for decoding were 4.1% and 22.7% respectively, compared to an initial baseline system whose WER was 8.7%.
Published: 2016

9. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

Author: Shinji Watanabe, Takaaki Hori, and Suyoun Kim
Subjects: FOS: Computer and information sciences, Voice activity detection, Computer Science - Computation and Language, Noise measurement, business.industry, Computer science, Speech recognition, Word error rate, Multi-task learning, 020206 networking & telecommunications, 02 engineering and technology, Machine learning, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, Connectionism, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0305 other medical science, business, Hidden Markov model, Computation and Language (cs.CL), computer, Decoding methods
Abstract: Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4-14.6% relative improvements in Character Error Rate (CER).
Published: 2016

10. Efficient training of discriminative language models by sample selection

Author: Atsushi Nakamura, Takaaki Hori, and Takanobu Oba
Subjects: Sample selection, Linguistics and Language, Computer science, business.industry, Communication, Computation, Machine learning, computer.software_genre, Language and Linguistics, Computer Science Applications, Discriminative model, Modeling and Simulation, Memory footprint, Computer Vision and Pattern Recognition, Artificial intelligence, Language model, Error detection and correction, business, computer, Software, Sentence, Utterance
Abstract: This paper focuses on discriminative language models (DLMs) for large vocabulary speech recognition tasks. To train such models, we usually use a large number of hypotheses generated for each utterance by a speech recognizer, namely an n-best list or a lattice. Since the data size is large, we usually need a high-end machine or a large-scale distributed computation system consisting of many computers for model training. However, it is still unclear whether or not such a large number of sentence hypotheses are necessary. Furthermore, we do not know which kinds of sentences are necessary. In this paper, we show that we can generate a high performance model using small subsets of the n-best lists by choosing samples properly, i.e., we describe a sample selection method for DLMs. Sample selection reduces the memory footprint needed for holding training samples and allows us to train models in a standard machine. Furthermore, it enables us to generate a highly accurate model using various types of features. Specifically, experimental results show that even training using two samples in each list can provide an accurate model with a small memory footprint.
Published: 2012
Full Text: View/download PDF

11. Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

Author: Shoko Araki, Masakiyo Fujimoto, Tomohiro Nakatani, Shinji Watanabe, Takanobu Oba, Takuya Yoshioka, Takaaki Hori, Atsunori Ogawa, Dan Mikami, Junji Yamato, Keisuke Kinoshita, Atsushi Nakamura, and Kazuhiro Otsuka
Subjects: Microphone array, Acoustics and Ultrasonics, Computer science, Speech recognition, Speech processing, Speaker recognition, computer.software_genre, Speaker diarisation, Speech enhancement, Electrical and Electronic Engineering, Latency (engineering), Transcription (software), Audio signal processing, computer
Abstract: This paper presents our real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to recognize automatically “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and face poses of each speaker using a microphone array and an omni-directional camera positioned at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speaker's channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g., speaking, laughing, watching someone) and the circumstances of the meeting (e.g., topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.
Published: 2012
Full Text: View/download PDF

12. Topic tracking language model for speech recognition

Author: Tomoharu Iwata, Takaaki Hori, Atsushi Sako, Shinji Watanabe, and Yasuo Ariki
Subjects: Topic model, business.industry, Computer science, Speech recognition, Speech corpus, computer.software_genre, Theoretical Computer Science, Human-Computer Interaction, Cache language model, Speaking style, Tracking (education), Artificial intelligence, Language model, business, Adaptation (computer science), computer, Software, Natural language processing
Abstract: In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. To accommodate these changes, speech recognition approaches that include the incremental tracking of changing environments have attracted attention. This paper proposes a topic tracking language model that can adaptively track changes in topics based on current text information and previously estimated topic models in an on-line manner. The proposed model is applied to language model adaptation in speech recognition. We use the MIT OpenCourseWare corpus and Corpus of Spontaneous Japanese in speech recognition experiments, and show the effectiveness of the proposed method.
Published: 2011
Full Text: View/download PDF

13. Improved Sequential Dependency Analysis Integrating Labeling-Based Sentence Boundary Detection

Author: Takanobu Oba, Takaaki Hori, and Atsushi Nakamura
Subjects: Conditional random field, Boundary detection, Sentence boundary disambiguation, Dependency (UML), Computer science, business.industry, Speech recognition, computer.software_genre, Sequential dependency, Artificial Intelligence, Hardware and Architecture, Computer Vision and Pattern Recognition, Artificial intelligence, Electrical and Electronic Engineering, Element (category theory), business, computer, Software, Sentence, Natural language processing
Abstract: A dependency structure interprets modification relationships between words or phrases and is recognized as an important element in semantic information analysis. With the conventional approaches for extracting this dependency structure, it is assumed that the complete sentence is known before the analysis starts. For spontaneous speech data, however, this assumption is not necessarily correct since sentence boundaries are not marked in the data. Although sentence boundaries can be detected before dependency analysis, this cascaded implementation is not suitable for online processing since it delays the responses of the application. To solve these problems, we proposed a sequential dependency analysis (SDA) method for online spontaneous speech processing, which enabled us to analyze incomplete sentences sequentially and detect sentence boundaries simultaneously. In this paper, we propose an improved SDA integrating a labeling-based sentence boundary detection (SntBD) technique based on Conditional Random Fields (CRFs). In the new method, we use CRF for soft decision of sentence boundaries and combine it with SDA to retain its online framework. Since CRF-based SntBD yields better estimates of sentence boundaries, SDA can provide better results in which the dependency structure and sentence boundaries are consistent. Experimental results using spontaneous lecture speech from the Corpus of Spontaneous Japanese show that our improved SDA outperforms the original SDA with SntBD accuracy providing better dependency analysis results.
Published: 2010
Full Text: View/download PDF

14. Sequential dependency analysis for online spontaneous speech processing

Author: Takaaki Hori, Atsushi Nakamura, and Takanobu Oba
Subjects: Linguistics and Language, Phrase, Parsing, Dependency (UML), Computer science, Communication, Speech recognition, Principle of maximum entropy, Semantic analysis (machine learning), Speech processing, computer.software_genre, Language and Linguistics, Edge detection, Computer Science Applications, Modeling and Simulation, Computer Vision and Pattern Recognition, computer, Software, Sentence
Abstract: A dependency structure interprets modification relationships between words and is often recognized as an important element in semantic information analysis. With conventional approaches for extracting this dependency structure, it is assumed that the complete sentence is known before the analysis starts. For spontaneous speech data, however, this assumption is not necessarily correct since sentence boundaries are not marked in the data and it is not easy to detect them correctly. Although sentence boundaries can be detected before dependency analysis, this cascaded implementation is not suitable for online processing since it delays the responses of the application. In this paper, we propose a sequential dependency analysis method for online spontaneous speech processing. The proposed method enables us to analyze incomplete sentences sequentially and detect sentence boundaries simultaneously. The analyzer can be trained using parsed data based on the maximum entropy principle. Experimental results using spontaneous lecture speech from the Corpus of Spontaneous Japanese show that our proposed method achieves online processing with an accuracy equivalent to that of offline processing in which boundary detection and dependency analysis are cascaded.
Published: 2008
Full Text: View/download PDF

15. Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition

Author: Takaaki Hori, Atsushi Nakamura, Chiori Hori, and Yasuhiro Minami
Subjects: Vocabulary, Finite-state machine, Acoustics and Ultrasonics, Computer science, business.industry, Computation, media_common.quotation_subject, Speech recognition, Speech coding, Speech processing, computer.software_genre, Viterbi decoder, Search algorithm, Artificial intelligence, Electrical and Electronic Engineering, business, computer, Decoding methods, Natural language processing, media_common
Abstract: This paper proposes a novel one-pass search algorithm with on-the-fly composition of weighted finite-state transducers (WFSTs) for large-vocabulary continuous-speech recognition. In the standard search method with on-the-fly composition, two or more WFSTs are composed during decoding, and a Viterbi search is performed based on the composed search space. With this new method, a Viterbi search is performed based on the first of the two WFSTs. The second WFST is only used to rescore the hypotheses generated during the search. Since this rescoring is very efficient, the total amount of computation required by the new method is almost the same as when using only the first WFST. In a 65k-word vocabulary spontaneous lecture speech transcription task, our proposed method significantly outperformed the standard search method. Furthermore, our method was faster than decoding with a single fully composed and optimized WFST, where our method used only 38% of the memory required for decoding with the single WFST. Finally, we have achieved high-accuracy one-pass real-time speech recognition with an extremely large vocabulary of 1.8 million words
Published: 2007
Full Text: View/download PDF

16. Research frontier - Advanced computational models and learning theories for spoken language processing

Author: Erik McDermott, Shinji Watanabe, Atsushi Nakamura, Shigeru Katagiri, and Takaaki Hori
Subjects: Computational model, Vocabulary, business.industry, Computer science, media_common.quotation_subject, Speech recognition, computer.software_genre, Theoretical Computer Science, Discriminative model, Artificial Intelligence, Learning theory, Speech analytics, Artificial intelligence, business, Hidden Markov model, computer, Natural language, Humanoid robot, Natural language processing, media_common
Abstract: Recent developments in research on humanoid robots and interactive agents have highlighted the importance of and expectation on automatic speech recognition (ASR) as a means of endowing such an agent with the ability to communicate via speech. This article describes some of the approaches pursued at NTT Communication Science Laboratories (NTT-CSL) for dealing with such challenges in ASR. In particular, we focus on methods for fast search through finite-state machines, Bayesian solutions for modeling and classification of speech, and a discriminative training approach for minimizing errors in large vocabulary continuous speech recognition
Published: 2006
Full Text: View/download PDF

17. Restructuring output layers of deep neural networks using minimum risk parameter clustering

Author: Jun Suzuki, Yotaro Kubo, Atsushi Nakamura, and Takaaki Hori
Subjects: Restructuring, business.industry, Computer science, Minimum risk, Deep neural networks, Data mining, Artificial intelligence, computer.software_genre, Machine learning, business, Cluster analysis, computer
Published: 2014
Full Text: View/download PDF

18. Evolutionary optimization of long short-term memory neural network language model

Author: Tomohiro Tanaka, Takafumi Moriya, Shinji Watanabe, Takaaki Hori, Kevin Duh, and Takahiro Shinozaki
Subjects: Acoustics and Ultrasonics, Artificial neural network, Process (engineering), business.industry, Time delay neural network, Computer science, Machine learning, computer.software_genre, Recurrent neural network, Arts and Humanities (miscellaneous), Benchmark (computing), Artificial intelligence, Language model, CMA-ES, Evolution strategy, business, computer
Abstract: Recurrent neural network language models (RNN-LMs) are recently proven to produce better performance than conventional N-gram based language models in various speech recognition tasks. Especially, long short-term memory recurrent neural network language models (LSTM-LMs) give superior performance for its ability to better modeling word history information. However, LSTM-LMs have complex network structure and training configurations, which are meta-parameters that need to be well tuned to achieve the state-of-the-art performance. The tuning is usually performed manually by humans, but it is not easy because it requires expert knowledge and intensive effort with many trials. In this study, we apply covariance matrix adaptation evolution strategy (CMA-ES) and automate the tuning process. CMA-ES is one of the most efficient global optimization techniques that has demonstrated superior performance in various benchmark tasks. In the experiments, the meta-parameters subject to the tuning included unit types at e...
Published: 2016
Full Text: View/download PDF

19. Handling uncertain observations in unsupervised topic-mixture language model adaptation

Author: James Glass, Ekapol Chuangsuwanich, Shinji Watanabe, Tomoharu Iwata, and Takaaki Hori
Subjects: Computer science, business.industry, Speech recognition, Extension (predicate logic), Machine learning, computer.software_genre, Latent Dirichlet allocation, symbols.namesake, Variable (computer science), ComputingMethodologies_PATTERNRECOGNITION, symbols, Selection (linguistics), Artificial intelligence, Language model, business, Hidden Markov model, Adaptation (computer science), computer, Interpolation
Abstract: We propose an extension to the recent approaches in topic-mixture modeling such as Latent Dirichlet Allocation and Topic Tracking Model for the purpose of unsupervised adaptation in speech recognition. Instead of using the 1-best input given by the speech recognizer, the proposed model takes confusion network as an input to alleviate recognition errors. We incorporate a selection variable which helps reweight the recognition output, thus creating a more accurate latent topic estimate. Compared to adapting based on just one recognition hypothesis, the proposed model show WER improvements on two different tasks.
Published: 2012
Full Text: View/download PDF

20. Low-latency meeting recognition and understanding using distant microphones

Author: Dan Mikami, Takanobu Oba, Kazuhiro Otsuka, Shinji Watanabe, Atsushi Nakamura, Takaaki Hori, Marc Delcroix, Takuya Yoshioka, Atsunori Ogawa, Masakiyo Fujimoto, Keisuke Kinoshita, Junji Yamato, Tomohiro Nakatani, and Shoko Araki
Subjects: Speech enhancement, Microphone array, Multimedia, Computer science, Meeting analysis, Table (database), Session (computer science), Latency (engineering), Speaker recognition, computer.software_genre, computer
Abstract: In this demonstration, we present our real-time meeting analyzer for group meetings. By using the audio and visual information captured by a microphone array and an omni-directional camera at the center of a table, our system automatically recognizes “who speaks what to whom and when” in an online manner. We will show some demo videos and our meeting browser to present how our system works in a meeting situation. The technical details will also be discussed at the demo session.
Published: 2011
Full Text: View/download PDF

21. Real-time meeting recognition and understanding using distant microphones and omni-directional camera

Author: Keisuke Kinoshita, Atsunori Ogawa, Masakiyo Fujimoto, Junji Yamato, Tomohiro Nakatani, Kazuhiro Otsuka, Atsushi Nakamura, Takanobu Oba, Shinji Watanabe, Dan Mikami, Shoko Araki, Takaaki Hori, and Takuya Yoshioka
Subjects: Speaker diarisation, Speech enhancement, Microphone array, Omnidirectional camera, Computer science, Speech recognition, Noise (video), Transcription (software), Hidden Markov model, Audio signal processing, computer.software_genre, computer
Abstract: This paper presents our newly developed real-time meeting analyzer for monitoring conversations in an ongoing group meeting. The goal of the system is to automatically recognize “who is speaking what” in an online manner for meeting assistance. Our system continuously captures the utterances and the face pose of each speaker using a distant microphone array and an omni-directional camera at the center of the meeting table. Through a series of advanced audio processing operations, an overlapping speech signal is enhanced and the components are separated into individual speaker's channels. Then the utterances are sequentially transcribed by our speech recognizer with low latency. In parallel with speech recognition, the activity of each participant (e.g. speaking, laughing, watching someone) and the situation of the meeting (e.g. topic, activeness, casualness) are detected and displayed on a browser together with the transcripts. In this paper, we describe our techniques and our attempt to achieve the low-latency monitoring of meetings, and we show our experimental results for real-time meeting transcription.
Published: 2010
Full Text: View/download PDF

22. Application of topic tracking model to language model adaptation and meeting analysis

Author: Shinji Watanabe, Yasuo Ariki, Tomoharu Iwata, Atsushi Sako, and Takaaki Hori
Subjects: Topic model, Tracking model, Computer science, business.industry, Speech recognition, computer.software_genre, Data modeling, Meeting analysis, Artificial intelligence, Language model, Hidden Markov model, Adaptation (computer science), business, computer, Natural language processing, Word (computer architecture)
Abstract: In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. This paper focuses on changes in the language environment, and applies a topic tracking model to language model adaptation for speech recognition and topic word extraction for meeting analysis. The topic tracking model can adaptively track changes in topics based on current text information and previously estimated topic models in an online manner. The effectiveness of the proposed method is shown experimentally by the improvement in speech recognition performance achieved with the Corpus of Spontaneous Japanese and by providing appropriate topic information in an automatic meeting analyzer.
Published: 2010
Full Text: View/download PDF

23. Round-robin discrimination model for reranking ASR hypotheses

Author: Atsushi Nakamura, Takanobu Oba, and Takaaki Hori
Subjects: business.industry, Computer science, Speech recognition, Artificial intelligence, business, computer.software_genre, computer, Natural language processing
Published: 2010
Full Text: View/download PDF

24. A comparative study on methods of Weighted language model training for reranking lvcsr N-best hypotheses

Author: Atsushi Nakamura, Takaaki Hori, and Takanobu Oba
Subjects: Vocabulary, Boosting (machine learning), Computer science, business.industry, Estimation theory, media_common.quotation_subject, Word error rate, Machine learning, computer.software_genre, Discriminative model, Artificial intelligence, Language model, Hidden Markov model, business, computer, Natural language, media_common
Abstract: This paper focuses on discriminative n-gram language models for a large vocabulary speech recognition task. Specifically we compare three training methods, Reranking Boosting (ReBst), Minimum Error Rate Training (MERT) and the Weighted Global Log-Linear Model (W-GCLM). They have a mechanism for handling sample weights, which are useful for providing an accurate model and work as impact factors of hypotheses for training. W-GCLM is proposed in this paper. We discuss the relationship between the three methods by comparing their loss functions. We also compare them experimentally by reranking N-best hypotheses under several conditions. We show that MERT and W-GCLM are different types of expansion of ReBst and have different respective advantages. Our experimental results reveal that W-GCLM outperforms ReBst and whether MERT or W-GCLM is superior depends on the training and test conditions.
Published: 2010
Full Text: View/download PDF

25. Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks

Author: Takaaki Hori, I.L. Hetherington, Timothy J. Hazen, and James Glass
Subjects: Vocabulary, business.industry, Computer science, Speech recognition, media_common.quotation_subject, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Search engine indexing, computer.software_genre, Speech processing, Robustness (computer science), Phone, medicine, Artificial intelligence, medicine.symptom, business, computer, Natural language processing, Utterance, Confusion, media_common
Abstract: This paper presents a novel approach to open-vocabulary spoken utterance retrieval using confusion networks. If out-of-vocabulary (OOV) words are present in queries and the corpus, word-based indexing will not be sufficient. For this problem, we apply phone confusion networks and combine them with word confusion networks. With this approach, we can generate a more compact index table that enables robust keyword matching compared with typical lattice-based methods. In the retrieval experiments with speech recordings in MIT lecture corpus, our method using phone confusion networks outperformed lattice-based methods especially for OOV queries.
Published: 2007
Full Text: View/download PDF

26. Sentence boundary detection using sequential dependency analysis combined with CRF-based chunking

Author: Takanobu Oba, Takaaki Hori, and Atsushi Nakamura
Subjects: Boundary detection, Sequential dependency, Computer science, business.industry, Speech recognition, Artificial intelligence, computer.software_genre, business, computer, Natural language processing, Sentence, Chunking (computing)
Published: 2006
Full Text: View/download PDF

27. An Extremely Large Vocabulary Approach to Named Entity Extraction from Speech

Author: Atsushi Nakamura and Takaaki Hori
Subjects: Text corpus, Vocabulary, Computer science, business.industry, Speech recognition, media_common.quotation_subject, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Feature extraction, Construct (python library), Speech processing, computer.software_genre, Lexicon, Language model, Artificial intelligence, business, computer, Natural language, Natural language processing, media_common
Abstract: This paper describes an approach to Named Entity (NE) extraction from speech data, in which an extremely large vocabulary lexicon including all NEs occurring in a large text corpus is used for Automatic Speech Recognition (ASR). Accordingly, NEs appear in the recognition results just as they are. Our approach is implemented by the following steps: (1) run an NE-tagger for a whole text corpus and make an NE-tagged corpus in which each NE is padded with its category, (2) construct a lexicon and a language model for ASR using the tagged corpus where each NE is considered as a regular word, and (3) run the speech recognizer in one pass. Although a very large vocabulary is necessary to ensure a high coverage of NEs, that is no longer a major problem since we recently achieved real-time extremely large vocabulary ASR using a WFST framework. In experiments on NE extraction from spoken queries for an open-domain question-answering system, our approach yielded higher F-measure values than a conventional approach.
Published: 2006
Full Text: View/download PDF

28. Generalized fast on-the-fly composition algorithm for WFST-based speech recognition

Author: Atsushi Nakamura and Takaaki Hori
Subjects: Computer science, On the fly, business.industry, Speech recognition, Artificial intelligence, computer.software_genre, business, computer, Composition (language), Natural language processing
Published: 2005
Full Text: View/download PDF

29. Sequential dependency analysis for spontaneous speech understanding

Author: Takanobu Oba, Atsushi Nakamura, and Takaaki Hori
Subjects: Parsing, business.industry, Computer science, Speech recognition, Principle of maximum entropy, Speech synthesis, computer.software_genre, Speech processing, Sequential dependency, Artificial intelligence, business, computer, Natural language processing, Utterance, Sentence, Spontaneous speech
Abstract: The dependency structure contains primary semantic information for interpreting sentences. In conventional approaches for extracting this dependency structure, it is assumed that the complete sentence is known before analysis starts. Therefore, in spontaneous speech, we must detect sentence boundaries. It is necessary for on-line applications to be able to extract the dependency structure from a partial recognition result of a long utterance, but conventional methods are not designed for analyzing such incomplete sentences. In this paper, we propose a sequential dependency analysis method for spontaneous speech. The proposed method enables us to analyze incomplete sentences sequentially and detects sentence boundaries simultaneously. The analyzer can be trained using parsed data based on the maximum entropy principle. Experimental results using spontaneous lecture speech from the CSJ corpus show that our proposed method significantly outperforms a conventional method for analyzing incomplete sentences and achieves nearly the same accuracy for complete sentences
Published: 2005
Full Text: View/download PDF

30. Construction of weighted finite state transducers for very wide context-dependent acoustic models

Author: Takaaki Hori and Mike Schuster
Subjects: Finite-state machine, Computer science, Phone, Speech recognition, Memory footprint, Decision tree, Context (language use), Speech synthesis, State (computer science), computer.software_genre, computer, Exponential function
Abstract: A previous paper by the authors described an algorithm for efficient construction of weighted finite state transducers for speech recognition when high-order context-dependent models of order K > 3 (triphones) with tied state observation distributions are used, and showed practical application of the algorithm up to K = 5 (quinphones). In this paper we give additional details of the improved implementation and analyze the algorithm's practical runtime requirements and memory footprint for context-orders up to K = 13 (+/-6 phones context) when building fully cross-word capable WFSTs for large vocabulary speech recognition tasks. We show that for typical systems it is possible to use any practical context-order K les 13 without having to fear an exponential explosion of the search space, since the necessary state ID to phone transducer (resembling a phone-loop observing all possible K-phone constraints) can be built in a few minutes at most. The paper also gives some implementation details of how we efficiently collect context statistics and build phonetic decision trees for very wide context-dependent acoustic models
Published: 2005
Full Text: View/download PDF

31. Language model adaptation using WFST-based speaking-style translation

Author: Daniel Willett, Yasuhiro Minami, and Takaaki Hori
Subjects: Context model, Vocabulary, business.industry, Computer science, media_common.quotation_subject, Speech recognition, Word error rate, computer.software_genre, Speech translation, Language model, Artificial intelligence, business, Adaptation (computer science), computer, Natural language processing, Natural language, Sentence, media_common
Abstract: This paper describes a new approach to language model adaptation for speech recognition based on the statistical framework of speech translation. The main idea of this approach is to compose a weighted finite-state transducer (WFST) that translates sentence styles from in-domain to out-of-domain. It enables to integrate language models of different styles of speaking or dialects and even of different vocabularies. The WFST is built by combining in-domain and out-of-domain models through the translation, while each model and the translation itself is expressed as a WFST. We apply this technique to building language models for spontaneous speech recognition using large written-style corpora. We conducted experiments on a 20k-word Japanese spontaneous speech recognition task. With a small in-domain corpus, a 2.9% absolute improvement in word error rate is achieved over the in-domain model.
Published: 2003
Full Text: View/download PDF

32. Deriving disambiguous queries in a spoken interactive ODQA system

Author: Chiori Hori, Takaaki Hori, Sadaoki Furui, Shigeru Katagiri, Eisaku Maeda, and Hideki Isozaki
Subjects: Text corpus, Information retrieval, Computer science, business.industry, Question answering, Artificial intelligence, business, computer.software_genre, computer, Natural language processing
Abstract: Recently, open-domain question answering (ODQA) systems that extract an exact answer from large text corpora based on text input are intensively being investigated. However, the information in the first question input by a user is not usually enough to yield the desired answer. Interactions for collecting additional information to accomplish QA is needed. This paper proposes an interactive approach for spoken interactive ODQA systems. When the reliabilities for answer hypotheses obtained by an ODQA system are low, the system automatically derives disambiguous queries (DQ) that draw out additional information. The additional information based on the DQ should contribute to distinguishing effectively an exact answer and to supplementing a lack of information by recognition errors. In our spoken interactive ODQA system, SPIQA, spoken questions are recognized by an ASR system, and DQ are automatically generated to disambiguate the transcribed questions. We confirmed the appropriateness of the derived DQ by comparing them with manually prepared ones.
Published: 2003
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

32 results on '"Takaaki Hori"'

1. Overview of the sixth dialog system technology challenge: DSTC6

2. Adversarial training and decoding strategies for end-to-end neural conversation models

3. Capturing Multi-Resolution Context by Dilated Self-Attention

4. The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

5. A Comparative Study on Transformer vs RNN in Speech Applications

6. Back-Translation-Style Data Augmentation for End-to-End ASR

7. Dialog state tracking with attention-based sequence-to-sequence learning

8. Automated Structure Discovery and Parameter Tuning of Neural Network Language Model based on Evolution Strategy

9. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

10. Efficient training of discriminative language models by sample selection

11. Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional Camera

12. Topic tracking language model for speech recognition

13. Improved Sequential Dependency Analysis Integrating Labeling-Based Sentence Boundary Detection

14. Sequential dependency analysis for online spontaneous speech processing

15. Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition

16. Research frontier - Advanced computational models and learning theories for spoken language processing

17. Restructuring output layers of deep neural networks using minimum risk parameter clustering

18. Evolutionary optimization of long short-term memory neural network language model

19. Handling uncertain observations in unsupervised topic-mixture language model adaptation

20. Low-latency meeting recognition and understanding using distant microphones

21. Real-time meeting recognition and understanding using distant microphones and omni-directional camera

22. Application of topic tracking model to language model adaptation and meeting analysis

23. Round-robin discrimination model for reranking ASR hypotheses

24. A comparative study on methods of Weighted language model training for reranking lvcsr N-best hypotheses

25. Open-Vocabulary Spoken Utterance Retrieval using Confusion Networks

26. Sentence boundary detection using sequential dependency analysis combined with CRF-based chunking

27. An Extremely Large Vocabulary Approach to Named Entity Extraction from Speech

28. Generalized fast on-the-fly composition algorithm for WFST-based speech recognition

29. Sequential dependency analysis for spontaneous speech understanding

30. Construction of weighted finite state transducers for very wide context-dependent acoustic models

31. Language model adaptation using WFST-based speaking-style translation

32. Deriving disambiguous queries in a spoken interactive ODQA system

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

32 results on '"Takaaki Hori"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources