77 results on '"Takaaki Hori"'
Search Results
2. Evaluation of microplate handling accuracy for applying robotic arms in laboratory automation
- Author
-
Yoritaka Harazono, Haruko Shimono, Kikumi Hata, Toutai Mitsuyama, and Takaaki Horinouchi
- Subjects
Laboratory automation ,4-axis arm robot ,Microplate ,Repeat accuracy ,Biotechnology ,TP248.13-248.65 ,Medical technology ,R855-855.5 - Abstract
An inexpensive single-arm robot is widely utilized for recent laboratory automation solutions. The integration of a single-arm robot as a transfer system into a semi-automatic liquid dispenser without a transfer system can be realized as an inexpensive alternative to a fully automated liquid handling system. However, there has been no quantitative investigation of the positional accuracy of robot arms required to transfer microplates. In this study, we constructed a platform comprising aluminum frames and digital gauges to facilitate such measurements. We measured the position repeatability of a robot arm equipped with a custom-made finger by repeatedly transferring microplates. Further, the acceptable misalignment of plate transfer was evaluated by adding an artificial offset to the microplate position using this platform. The results of these experiments are expected to serve as benchmarks for the selection of robot arms for laboratory automation in biology. Furthermore, all information for replicating this device will be made publicly available, thereby allowing many researchers to collaborate and accumulate knowledge, hopefully contributing to advances in this field.
- Published
- 2024
- Full Text
- View/download PDF
3. Regional developers’ community accelerates laboratory automation
- Author
-
Akari Kato, Takaaki Horinouchi, Haruka Ozaki, and Genki N. Kanda
- Subjects
Laboratory automation ,Community ,Japan ,Biotechnology ,TP248.13-248.65 ,Medical technology ,R855-855.5 - Published
- 2024
- Full Text
- View/download PDF
4. Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning
- Author
-
Ankit Shah, Shijie Geng, Peng Gao, Anoop Cherian, Takaaki Hori, Tim K. Marks, Jonathan Le Roux, and Chiori Hori
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Computation and Language (cs.CL) - Abstract
In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network., https://dstc10.dstc.community/home and https://github.com/dialogtekgeek/AVSD-DSTC10_Official/
- Published
- 2021
5. Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition
- Author
-
Jonathan Le Roux, Takaaki Hori, and Niko Moritz
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer science ,Speech recognition ,Self attention ,Context (language use) ,DUAL (cognitive architecture) ,Computer Science - Sound ,Machine Learning (cs.LG) ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Word (computer architecture) ,Single layer ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Transformer (machine learning model) - Abstract
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks., Accepted to Interspeech 2021
- Published
- 2021
- Full Text
- View/download PDF
6. Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers
- Author
-
Takaaki Hori, Niko Moritz, Chiori Hori, and Jonathan Le Roux
- Subjects
FOS: Computer and information sciences ,Reduction (complexity) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Process (computing) ,Word error rate ,Context (language use) ,Computation and Language (cs.CL) ,Word (computer architecture) ,Utterance ,Decoding methods ,Transformer (machine learning model) - Abstract
This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3% character error rate for the HKUST dataset and 12.0%/6.3% word error rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new decoding method reduces decoding time by more than 50% and further enables streaming ASR with limited accuracy degradation., Submitted to INTERSPEECH 2021
- Published
- 2021
- Full Text
- View/download PDF
7. Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers
- Author
-
Chiori Hori, Jonathan Le Roux, and Takaaki Hori
- Subjects
Closed captioning ,Computer science ,business.industry ,Event (computing) ,media_common.quotation_subject ,Detector ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Latency (audio) ,Computer vision ,Quality (business) ,Artificial intelligence ,CLIPS ,business ,computer ,Natural language ,Transformer (machine learning model) ,computer.programming_language ,media_common - Abstract
Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as possible. Low-latency captioning is needed to realize such functionality, but this research area for online video captioning has not been pursued yet. This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality. An audio-visual Trans-former is trained to generate ground-truth captions using only a small portion of all video frames, and to mimic outputs of a pre-trained Transformer to which all the frames are given. A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other. With the jointly trained Transformer and timing detector, a caption can be generated in the early stages of an event-triggered video clip, as soon as an event happens or when it can be forecasted. Experiments with the ActivityNet Captions dataset show that our approach achieves 94% of the caption quality of the upper bound given by the pre-trained Transformer using the entire video clips, using only 28% of frames from the beginning.
- Published
- 2021
- Full Text
- View/download PDF
8. Simultaneous Resection for Cancer of the Body of the Pancreas and an Isolated Jejunal Metastasis: A Case Report
- Author
-
Tadashi Tsukamoto, Eijiro Edagawa, Takaaki Hori, Ryoji Kaizaki, Satoshi Takatsuka, and Hiroko Fukushima
- Published
- 2020
- Full Text
- View/download PDF
9. Multi-Stream End-to-End Speech Recognition
- Author
-
Takaaki Hori, Sri Harish Mallidi, Xiaofei Wang, Hynek Hermansky, Ruizhi Li, and Shinji Watanabe
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Acoustics and Ultrasonics ,Microphone ,Computer science ,Speech recognition ,Word error rate ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Computer Science - Sound ,End-to-end principle ,Connectionism ,Audio and Speech Processing (eess.AS) ,Robustness (computer science) ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Electrical and Electronic Engineering ,0105 earth and related environmental sciences ,Computer Science - Computation and Language ,020206 networking & telecommunications ,Computational Mathematics ,Test set ,Computation and Language (cs.CL) ,Encoder ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end (E2E) Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a multi-stream framework based on joint CTC/Attention E2E ASR with parallel streams represented by separate encoders aiming to capture diverse information. On top of the regular attention networks, the Hierarchical Attention Network (HAN) is introduced to steer the decoder toward the most informative encoders. A separate CTC network is assigned to each stream to force monotonic alignments. Two representative framework have been proposed and discussed, which are Multi-Encoder Multi-Resolution (MEM-Res) framework and Multi-Encoder Multi-Array (MEM-Array) framework, respectively. In MEM-Res framework, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary information from same acoustics. Experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1% and the best WER of 3.6% in the WSJ eval92 test set. The MEM-Array framework aims at improving the far-field ASR robustness using multiple microphone arrays which are activated by separate encoders. Compared with the best single-array results, the proposed framework has achieved relative WER reduction of 3.7% and 9.7% in AMI and DIRHA multi-array corpora, respectively, which also outperforms conventional fusion strategies., submitted to IEEE TASLP (In review). arXiv admin note: substantial text overlap with arXiv:1811.04897, arXiv:1811.04903
- Published
- 2020
- Full Text
- View/download PDF
10. Overview of the sixth dialog system technology challenge: DSTC6
- Author
-
Y-Lan Boureau, Ryuichiro Higashinaka, Michimasa Inaba, Julien Perez, Chiori Hori, Yuiko Tsunomori, Takaaki Hori, Seokhwan Kim, Koichiro Yoshino, and Tetsuro Takahashi
- Subjects
Class (computer programming) ,Goal orientation ,Computer science ,media_common.quotation_subject ,Natural language generation ,020206 networking & telecommunications ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Theoretical Computer Science ,Domain (software engineering) ,Task (project management) ,Human-Computer Interaction ,Human–computer interaction ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Conversation ,Dialog box ,Dialog system ,010301 acoustics ,computer ,Software ,media_common - Abstract
This paper describes the experimental setups and the evaluation results of the sixth Dialog System Technology Challenges (DSTC6) aiming to develop end-to-end dialogue systems. Neural network models have become a recent focus of investigation in dialogue technologies. Previous models required training data to be manually annotated with word meanings and dialogue states, but end-to-end neural network dialogue systems learn to directly output natural-language system responses without needing training data to be manually annotated. Thus, this approach allows us to scale up the size of training data and cover more dialog domains. In addition, dialogue systems require a meta-function to avoid deploying inappropriate responses generated by themselves. To challenge such issues, the DSTC6 consists of three tracks, (1). End-to-End Goal Oriented dialogue Learning to select system responses, (2). End-to-End Conversation Modeling to generate system responses using Natural Language Generation (NLG) and (3). Dialogue Breakdown Detection. Since each domain has different issues to be addressed to develop dialogue systems, we targeted restaurant retrieval dialogues to fill slot-value in Track 1, customer services on Twitter by combining goal-oriented dialogues and ChitChat in Track 2 and human-machine dialogue data for ChitChat in Track 3. DSTC6 had 141 people declaring their interests and 23 teams submitted their final results. 18 scientific papers were presented in the wrap-up workshop. We find the blending end-to-end trainable models associated to meaningful prior knowledge performs the best for the restaurant retrieval for Track 1. Indeed, Hybrid Code Network and Memory Network have been the best models for this task. In Track 2, 78.5% of the system responses automatically generated by the best system were rated better than acceptable by humans and this achieves 89% of the number of the human responses rated in the same class. In Track3, the dialogue breakdown detection technologies performed as well as human agreements, in both data-sets of English and Japanese.
- Published
- 2019
- Full Text
- View/download PDF
11. Adversarial training and decoding strategies for end-to-end neural conversation models
- Author
-
John R. Hershey, Bret Harsham, Wen Wang, Koji Yusuke, Takaaki Hori, and Chiori Hori
- Subjects
Computer science ,media_common.quotation_subject ,02 engineering and technology ,Machine learning ,computer.software_genre ,01 natural sciences ,Theoretical Computer Science ,Task (project management) ,Adversarial system ,End-to-end principle ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Conversation ,Dialog system ,Dialog box ,Set (psychology) ,010301 acoustics ,media_common ,business.industry ,020206 networking & telecommunications ,Human-Computer Interaction ,Artificial intelligence ,business ,computer ,Software ,Decoding methods - Abstract
This paper presents adversarial training and decoding methods for neural conversation models that can generate natural responses given dialog contexts. In our prior work, we built several end-to-end conversation systems for the 6th Dialog System Technology Challenges (DSTC6) Twitter help-desk dialog task. These systems included novel extensions of sequence adversarial training, example-based response extraction, and Minimum Bayes-Risk based system combination. In DSTC6, our systems achieved the best performance in most objective measures such as BLEU and METEOR scores and decent performance in a subjective measure based on human rating. In this paper, we provide a complete set of our experiments for DSTC6 and further extend the training and decoding strategies more focusing on improving the subjective measure, where we combine responses of three adversarial models. Experimental results demonstrate that the extended methods improve the human rating score and outperform the best score in DSTC6.
- Published
- 2019
- Full Text
- View/download PDF
12. Capturing Multi-Resolution Context by Dilated Self-Attention
- Author
-
Jonathan Le Roux, Niko Moritz, and Takaaki Hori
- Subjects
FOS: Computer and information sciences ,Signal processing ,Computer Science - Machine Learning ,Sound (cs.SD) ,Artificial neural network ,Machine translation ,Computational complexity theory ,Computer science ,Speech recognition ,Pooling ,Context (language use) ,computer.software_genre ,Speech processing ,Computer Science - Sound ,Machine Learning (cs.LG) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Dilation (morphology) ,computer ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Self-attention has become an important and widely used neural network component that helped to establish new state-of-the-art results for various applications, such as machine translation and automatic speech recognition (ASR). However, the computational complexity of self-attention grows quadratically with the input sequence length. This can be particularly problematic for applications such as ASR, where an input sequence generated from an utterance can be relatively long. In this work, we propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention. The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution. Different methods for summarizing distant frames are studied, such as subsampling, mean-pooling, and attention-based pooling. ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs., Comment: In Proc. ICASSP 2021
- Published
- 2021
- Full Text
- View/download PDF
13. Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition
- Author
-
Takaaki Hori, Jonathan Le Roux, Niko Moritz, and Yosuke Higuchi
- Subjects
Online model ,Online and offline ,FOS: Computer and information sciences ,Momentum (technical analysis) ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer science ,Speech recognition ,Process (computing) ,Computer Science - Sound ,Domain (software engineering) ,Machine Learning (cs.LG) ,Connectionism ,Moving average ,Audio and Speech Processing (eess.AS) ,Scalability ,FOS: Electrical engineering, electronic engineering, information engineering ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch., Comment: Accepted to Interspeech 2021
- Published
- 2021
- Full Text
- View/download PDF
14. Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training
- Author
-
Niko Moritz, Sameer Khurana, Jonathan Le Roux, and Takaaki Hori
- Subjects
FOS: Computer and information sciences ,Domain adaptation ,Measure (data warehouse) ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Training (meteorology) ,Computer Science - Sound ,Domain (software engineering) ,Machine Learning (cs.LG) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Measurement uncertainty ,Set (psychology) ,Computation and Language (cs.CL) ,Dropout (neural networks) ,Test data ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model's uncertainty about its prediction. DUST excludes pseudo-labeled data with high uncertainties from the training, which leads to substantially improved ASR results compared to ST without filtering, and accelerates the training time due to a reduced training data set. Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD as the target domains show that up to 80% of the performance of a system trained on ground-truth data can be recovered., ICASSP 2021
- Published
- 2020
15. Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR
- Author
-
Leda Sari, Jonathan Le Roux, Niko Moritz, and Takaaki Hori
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Machine Learning ,Computer science ,Speech recognition ,010501 environmental sciences ,01 natural sciences ,Computer Science - Sound ,Machine Learning (cs.LG) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Connectionism ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Hidden Markov model ,0105 earth and related environmental sciences ,Computer Science - Computation and Language ,Training set ,Artificial neural network ,Embedding ,0305 other medical science ,Joint (audio engineering) ,Computation and Language (cs.CL) ,Encoder ,Word (computer architecture) ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoder-decoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes., To appear in Proc. ICASSP 2020
- Published
- 2020
- Full Text
- View/download PDF
16. A Case of Septic Pulmonary Embolism due to Local Recurrence of Cancer of the Head of the Pancreas
- Author
-
Shingo Togano, Eijiro Edagawa, Shintaro Kodai, Satoshi Takatsuka, Ryoji Kaizaki, Satoshi Shiraishi, Tadashi Tsukamoto, Takaaki Hori, and Akishige Kanazawa
- Subjects
medicine.medical_specialty ,medicine.anatomical_structure ,Head (linguistics) ,business.industry ,medicine ,Septic pulmonary embolism ,Cancer ,Radiology ,Pancreas ,medicine.disease ,business - Published
- 2019
- Full Text
- View/download PDF
17. A Case of Thoracolithiasis Diagnosed by CT and Confirmed on a Chest X-ray Taken 15 Years before
- Author
-
Takaaki Hori, Ryoji Kaizaki, Yoshimi Sugama, Satoshi Takatsuka, Shingo Togano, Eijiro Edagawa, and Tadashi Tsukamoto
- Subjects
business.industry ,X-ray ,Medicine ,business ,Nuclear medicine - Published
- 2018
- Full Text
- View/download PDF
18. Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend
- Author
-
Vikramjit Mitra, John R. Hershey, Takaaki Hori, Zhuo Chen, Hakan Erdogan, Shinji Watanabe, and Jonathan Le Roux
- Subjects
Beamforming ,Artificial neural network ,business.industry ,Computer science ,Speech recognition ,Feature extraction ,Word error rate ,020206 networking & telecommunications ,Pattern recognition ,02 engineering and technology ,Theoretical Computer Science ,Human-Computer Interaction ,Speech enhancement ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Recurrent neural network ,0202 electrical engineering, electronic engineering, information engineering ,Mel-frequency cepstrum ,Artificial intelligence ,Language model ,0305 other medical science ,business ,Software - Abstract
This paper gives an in-depth presentation of the multi-microphone speech recognition system we submitted to the 3rd CHiME speech separation and recognition challenge (CHiME-3) and its extension. The proposed system takes advantage of recurrent neural networks (RNNs) throughout the model from the front-end speech enhancement to the language modeling. Three different types of beamforming are used to combine multi-microphone signals to obtain a single higher-quality signal. The beamformed signal is further processed by a single-channel long short-term memory (LSTM) enhancement network, which is used to extract stacked mel-frequency cepstral coefficients (MFCC) features. In addition, the beamformed signal is processed by two proposed noise-robust feature extraction methods. All features are used for decoding in speech recognition systems with deep neural network (DNN) based acoustic models and large-scale RNN language models to achieve high recognition accuracy in noisy environments. Our training methodology includes multi-channel noisy data training and speaker adaptive training, whereas at test time model combination is used to improve generalization. Results on the CHiME-3 benchmark show that the full set of techniques substantially reduced the word error rate (WER). Combining hypotheses from different beamforming and robust-feature systems ultimately achieved 5.05% WER for the real-test data, an 84.7% reduction relative to the baseline of 32.99% WER and a 44.5% reduction from our official CHiME-3 challenge result of 9.1% WER. Furthermore, this final result is better than the best result (5.8% WER) reported in the CHiME-3 challenge.
- Published
- 2017
- Full Text
- View/download PDF
19. The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans
- Author
-
Wen-Chin Huang, Shinji Watanabe, Xuankai Chang, Jing Shi, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Takaaki Hori, Pengcheng Guo, Yosuke Higuchi, Aswin Shanmugam Subramanian, Wangyou Zhang, Tomoki Hayashi, Florian Boyer, and Chenda Li
- Subjects
Beamforming ,FOS: Computer and information sciences ,Sound (cs.SD) ,Multimedia ,Computer science ,media_common.quotation_subject ,Speech synthesis ,computer.software_genre ,Speech processing ,Computer Science - Sound ,Speech enhancement ,Sequence modeling ,Audio and Speech Processing (eess.AS) ,Speech translation ,FOS: Electrical engineering, electronic engineering, information engineering ,Conversation ,computer ,Transformer (machine learning model) ,media_common ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.
- Published
- 2020
- Full Text
- View/download PDF
20. Streaming automatic speech recognition with the transformer model
- Author
-
Jonathan Le, Niko Moritz, and Takaaki Hori
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,020206 networking & telecommunications ,Machine Learning (stat.ML) ,02 engineering and technology ,Computer Science - Sound ,Machine Learning (cs.LG) ,Recurrent neural network ,Statistics - Machine Learning ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Encoder ,Computation and Language (cs.CL) ,Utterance ,Transformer (machine learning model) ,Test data ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.3% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.
- Published
- 2020
- Full Text
- View/download PDF
21. Semi-Supervised Speech Recognition via Graph-based Temporal Classification
- Author
-
Jonathan Le Roux, Takaaki Hori, and Niko Moritz
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Signal processing ,Sequence ,Sound (cs.SD) ,Computer Science - Computation and Language ,Linear programming ,Computer science ,Speech recognition ,Probabilistic logic ,Computer Science - Sound ,Oracle ,Machine Learning (cs.LG) ,ComputingMethodologies_PATTERNRECOGNITION ,Connectionism ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,Computation and Language (cs.CL) ,Utterance ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training labels. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only a temporal alignment, similarly to CTC, but also a label alignment to obtain the optimal pseudo-label sequence from the weighted graph. Results show that this approach can effectively exploit an N-best list of pseudo-labels with associated scores, considerably outperforming standard pseudo-labeling, with ASR results approaching an oracle experiment in which the best hypotheses of the N-best lists are selected manually., Comment: ICASSP 2021
- Published
- 2020
- Full Text
- View/download PDF
22. Semi-Supervised Sequence-to-Sequence ASR Using Unpaired Speech and Text
- Author
-
Takaaki Hori, Murali Karthick Baskar, Ramón Fernandez Astudillo, Shinji Watanabe, Jan Cernocký, and Lukas Burget
- Subjects
Sequence ,End-to-end principle ,Computer science ,Speech recognition ,Code (cryptography) ,Leverage (statistics) - Abstract
Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such models. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniques derive training procedures and losses able to leverage unpaired speech and/or text data by combining ASR with Text-to-Speech (TTS) models. In particular, this work proposes a new semi-supervised loss combining an end-to-end differentiable ASR$\rightarrow$TTS loss with TTS$\rightarrow$ASR loss. The method is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of \%WER. We provide extensive results analyzing the impact of data quantity and speech and text modalities and show consistent gains across WSJ and Librispeech corpora. Our code is provided in ESPnet to reproduce the experiments.
- Published
- 2019
- Full Text
- View/download PDF
23. A Comparative Study on Transformer vs RNN in Speech Applications
- Author
-
Takenori Yoshimura, Shinji Watanabe, Ryuichi Yamamoto, Wangyou Zhang, Shigeki Karita, Hirofumi Inaguma, Nelson Yalta, Takaaki Hori, Nanxin Chen, Xiaofei Wang, Tomoki Hayashi, Masao Someki, and Ziyan Jiang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Machine translation ,Computer science ,Speech recognition ,Speech synthesis ,Speech applications ,computer.software_genre ,Speech processing ,Computer Science - Sound ,Recurrent neural network ,Open source ,Audio and Speech Processing (eess.AS) ,Speech translation ,FOS: Electrical engineering, electronic engineering, information engineering ,computer ,Computation and Language (cs.CL) ,Transformer (machine learning model) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes., Accepted at ASRU 2019
- Published
- 2019
24. Cycle-consistency Training for End-to-end Speech Recognition
- Author
-
Jonathan Le Roux, Shinji Watanabe, Takaaki Hori, Ramón Fernandez Astudillo, Tomoki Hayashi, and Yu Zhang
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Word error rate ,Pronunciation ,Computer Science - Sound ,Consistency (database systems) ,Transcription (linguistics) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Language model ,Computation and Language (cs.CL) ,Encoder ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech (TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining a loss based on the encoder reconstruction error. Experimental results on the LibriSpeech corpus show that the proposed cycle-consistency training reduced the word error rate by 14.7% from an initial model trained with 100-hour paired data, using an additional 360 hours of audio data without transcriptions. We also investigate the use of text-only data mainly for language modeling to further improve the performance in the unpaired data training scenario., Submitted to ICASSP'19
- Published
- 2019
- Full Text
- View/download PDF
25. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
- Author
-
Devi Parikh, Raphael Gontijo Lopes, Anoop Cherian, Gordon Wichern, Abhishek Das, Vincent Cartillier, Dhruv Batra, Jue Wang, Huda Alamri, Tim K. Marks, Takaaki Hori, Chiori Hori, and Irfan Essa
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Feature extraction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Computer Science - Computer Vision and Pattern Recognition ,020206 networking & telecommunications ,02 engineering and technology ,Human behavior ,Computer Science - Sound ,Visualization ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,Human–computer interaction ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Question answering ,020201 artificial intelligence & image processing ,Mel-frequency cepstrum ,Dialog box ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge., A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7
- Published
- 2019
- Full Text
- View/download PDF
26. Language Model Integration Based on Memory Control for Sequence to Sequence Speech Recognition
- Author
-
Najim Dehak, Hirofumi Inaguma, Takaaki Hori, Murali Karthick Baskar, Shinji Watanabe, Jesús Villalba, and Jaejin Cho
- Subjects
FOS: Computer and information sciences ,Scheme (programming language) ,Sound (cs.SD) ,Computer science ,Speech recognition ,Inference ,Computer Science - Sound ,Set (abstract data type) ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Language model ,State (computer science) ,Transfer of learning ,computer ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing ,computer.programming_language - Abstract
In this paper, we explore several new schemes to train a seq2seq model to integrate a pre-trained LM. Our proposed fusion methods focus on the memory cell state and the hidden state in the seq2seq decoder long short-term memory (LSTM), and the memory cell state is updated by the LM unlike the prior studies. This means the memory retained by the main seq2seq would be adjusted by the external LM. These fusion methods have several variants depending on the architecture of this memory cell update and the use of memory cell and hidden states which directly affects the final label inference. We performed the experiments to show the effectiveness of the proposed methods in a mono-lingual ASR setup on the Librispeech corpus and in a transfer learning setup from a multilingual ASR (MLASR) base model to a low-resourced language. In Librispeech, our best model improved WER by 3.7%, 2.4% for test clean, test other relatively to the shallow fusion baseline, with multi-level decoding. In transfer learning from an MLASR base model to the IARPA Babel Swahili model, the best scheme improved the transferred model on eval set by 9.9%, 9.8% in CER, WER relatively to the 2-stage transfer baseline., 4 pages, 1 figure, 5 tables, submitted to ICASSP 2019
- Published
- 2019
- Full Text
- View/download PDF
27. SAGAS: Simulated annealing and greedy algorithm scheduler for laboratory automation
- Author
-
Yuya Arai, Ko Takahashi, Takaaki Horinouchi, Koichi Takahashi, and Haruka Ozaki
- Subjects
Scheduling ,Laboratory automation ,Time constraint by mutual boundaries (TCMB) ,Scheduling for laboratory automation in biology (S-LAB) problem ,Simulated annealing (SA) ,Greedy algorithm ,Biotechnology ,TP248.13-248.65 ,Medical technology ,R855-855.5 - Abstract
During laboratory automation of life science experiments, coordinating specialized instruments and human experimenters for various experimental procedures is important to minimize the execution time. In particular, the scheduling of life science experiments requires the consideration of time constraints by mutual boundaries (TCMB) and can be formulated as the “scheduling for laboratory automation in biology” (S-LAB) problem. However, existing scheduling methods for the S-LAB problems have difficulties in obtaining a feasible solution for large-size scheduling problems at a time sufficient for real-time use. In this study, we proposed a fast schedule-finding method for S-LAB problems, SAGAS (Simulated annealing and greedy algorithm scheduler). SAGAS combines simulated annealing and the greedy algorithm to find a scheduling solution with the shortest possible execution time. We have performed scheduling on real experimental protocols and shown that SAGAS can search for feasible or optimal solutions in practicable computation time for various S-LAB problems. Furthermore, the reduced computation time by SAGAS enables us to systematically search for laboratory automation with minimum execution time by simulating scheduling for various laboratory configurations. This study provides a convenient scheduling method for life science automation laboratories and presents a new possibility for designing laboratory configurations.
- Published
- 2023
- Full Text
- View/download PDF
28. CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments
- Author
-
Takaaki Hori, Kazuhiro Nakadai, Nelson Yalta, Shinji Watanabe, and Tetsuya Ogata
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Artificial neural network ,Channel (digital image) ,Time delay neural network ,Computer science ,Speech recognition ,Word error rate ,020206 networking & telecommunications ,02 engineering and technology ,White noise ,Convolutional neural network ,Computer Science - Sound ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Encoder ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Casual conversations involving multiple speakers and noises from surrounding devices are common in everyday environments, which degrades the performances of automatic speech recognition systems. These challenging characteristics of environments are the target of the CHiME-5 challenge. By employing a convolutional neural network (CNN)-based multichannel end-to-end speech recognition system, this study attempts to overcome the presents difficulties in everyday environments. The system comprises of an attention-based encoder-decoder neural network that directly generates a text as an output from a sound input. The multichannel CNN encoder, which uses residual connections and batch renormalization, is trained with augmented data, including white noise injection. The experimental results show that the word error rate is reduced by 8.5% and 0.6% absolute from a single channel end-to-end and the best baseline (LF-MMI TDNN) on the CHiME-5 corpus, respectively., 5 pages, 1 figure, EUSIPCO 2019
- Published
- 2018
29. Back-Translation-Style Data Augmentation for End-to-End ASR
- Author
-
Ramón Fernandez Astudillo, Kazuya Takeda, Shinji Watanabe, Tomoki Toda, Yu Zhang, Tomoki Hayashi, and Takaaki Hori
- Subjects
FOS: Computer and information sciences ,Paired Data ,Computer Science - Computation and Language ,Machine translation ,Computer science ,Speech recognition ,Feature extraction ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,Field (computer science) ,Data modeling ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0305 other medical science ,Hidden Markov model ,Encoder ,computer ,Computation and Language (cs.CL) ,Decoding methods ,0105 earth and related environmental sciences - Abstract
In this paper we propose a novel data augmentation method for attention-based end-to-end automatic speech recognition (E2E-ASR), utilizing a large amount of text which is not paired with speech signals. Inspired by the back-translation technique proposed in the field of machine translation, we build a neural text-to-encoder model which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from a sequence of characters. By using hidden states as a target instead of acoustic features, it is possible to achieve faster attention learning and reduce computational cost, thanks to sub-sampling in E2E-ASR encoder, also the use of the hidden states can avoid to model speaker dependencies unlike acoustic features. After training, the text-to-encoder model generates the hidden states from a large amount of unpaired text, then E2E-ASR decoder is retrained using the generated hidden states as additional training data. Experimental evaluation using LibriSpeech dataset demonstrates that our proposed method achieves improvement of ASR performance and reduces the number of unknown words without the need for paired data.
- Published
- 2018
- Full Text
- View/download PDF
30. Promising Accurate Prefix Boosting for sequence-to-sequence ASR
- Author
-
Murali Karthick Baskar, Takaaki Hori, Martin Karafiat, Lukas Burget, Shinji Watanabe, and Jan Cernocky
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer Science - Computation and Language ,Boosting (machine learning) ,Computer science ,Speech recognition ,Word error rate ,020206 networking & telecommunications ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Computer Science - Sound ,Machine Learning (cs.LG) ,Prefix ,Discriminative model ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Beam search ,Sequence learning ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing ,0105 earth and related environmental sciences - Abstract
In this paper, we present promising accurate prefix boosting (PAPB), a discriminative training technique for attention based sequence-to-sequence (seq2seq) ASR. PAPB is devised to unify the training and testing scheme in an effective manner. The training procedure involves maximizing the score of each partial correct sequence obtained during beam search compared to other hypotheses. The training objective also includes minimization of token (character) error rate. PAPB shows its efficacy by achieving 10.8\% and 3.8\% WER with and without RNNLM respectively on Wall Street Journal dataset.
- Published
- 2018
- Full Text
- View/download PDF
31. Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling
- Author
-
Matthew Wiesner, Martin Karafiat, Nelson Yalta, Shinji Watanabe, Takaaki Hori, Sri Harish Mallidi, Murali Karthick Baskar, Ruizhi Li, and Jaejin Cho
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer science ,Speech recognition ,02 engineering and technology ,Lexicon ,Computer Science - Sound ,Data modeling ,Machine Learning (cs.LG) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,Sequence ,Computer Science - Computation and Language ,020206 networking & telecommunications ,Convolution (computer science) ,Recurrent neural network ,Language model ,0305 other medical science ,Transfer of learning ,Computation and Language (cs.CL) ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multi-lingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.
- Published
- 2018
- Full Text
- View/download PDF
32. Stream attention-based multi-array end-to-end speech recognition
- Author
-
Hynek Hermansky, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori, Ruizhi Li, and Shinji Watanabe
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Microphone ,Computer science ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Computer Science - Sound ,030507 speech-language pathology & audiology ,03 medical and health sciences ,End-to-end principle ,Robustness (computer science) ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,0305 other medical science ,Hidden Markov model ,Encoder ,Computation and Language (cs.CL) ,Decoding methods ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Automatic Speech Recognition (ASR) using multiple microphone arrays has achieved great success in the far-field robustness. Taking advantage of all the information that each array shares and contributes is crucial in this task. Motivated by the advances of joint Connectionist Temporal Classification (CTC)/attention mechanism in the End-to-End (E2E) ASR, a stream attention-based multi-array framework is proposed in this work. Microphone arrays, acting as information streams, are activated by separate encoders and decoded under the instruction of both CTC and attention networks. In terms of attention, a hierarchical structure is adopted. On top of the regular attention networks, stream attention is introduced to steer the decoder toward the most informative encoders. Experiments have been conducted on AMI and DIRHA multi-array corpora using the encoder-decoder architecture. Compared with the best single-array results, the proposed framework has achieved relative Word Error Rates (WERs) reduction of 3.7% and 9.7% in the two datasets, respectively, which is better than conventional strategies as well., Comment: Submitted to ICASSP 2019
- Published
- 2018
- Full Text
- View/download PDF
33. ESPnet: End-to-End Speech Processing Toolkit
- Author
-
Jahn Heymann, Nanxin Chen, Nelson Yalta, Matthew Wiesner, Jiro Nishitoba, Shinji Watanabe, Yuya Unno, Shigeki Karita, Tsubasa Ochiai, Takaaki Hori, Adithya Renduchintala, and Tomoki Hayashi
- Subjects
FOS: Computer and information sciences ,Data processing ,Computer Science - Computation and Language ,business.industry ,Computer science ,Speech recognition ,Deep learning ,Feature extraction ,020206 networking & telecommunications ,02 engineering and technology ,Speech processing ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Open source ,Software ,End-to-end principle ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,Architecture ,0305 other medical science ,business ,Computation and Language (cs.CL) - Abstract
This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.
- Published
- 2018
- Full Text
- View/download PDF
34. End-to-end Speech Recognition with Word-based RNN Language Models
- Author
-
Jaejin Cho, Takaaki Hori, and Shinji Watanabe
- Subjects
FOS: Computer and information sciences ,Vocabulary ,Computer Science - Computation and Language ,Computer science ,Computer Science - Artificial Intelligence ,Speech recognition ,media_common.quotation_subject ,020206 networking & telecommunications ,02 engineering and technology ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Recurrent neural network ,Artificial Intelligence (cs.AI) ,Test set ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Language model ,0305 other medical science ,Hidden Markov model ,Computation and Language (cs.CL) ,Word (computer architecture) ,Decoding methods ,media_common - Abstract
This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR). In our prior work, we have proposed a multi-level LM, in which character-based and word-based RNN-LMs are combined in hybrid CTC/attention-based ASR. Although this multi-level approach achieves significant error reduction in the Wall Street Journal (WSJ) task, two different LMs need to be trained and used for decoding, which increase the computational cost and memory usage. In this paper, we further propose a novel word-based RNN-LM, which allows us to decode with only the word-based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM. We demonstrate the efficacy of the word-based RNN-LMs using a larger corpus, LibriSpeech, in addition to WSJ we used in the prior work. Furthermore, we show that the proposed model achieves 5.1 %WER for WSJ Eval’92 test set when the vocabulary size is increased, which is the best WER reported for end-to-end ASR systems on this benchmark.
- Published
- 2018
- Full Text
- View/download PDF
35. A Purely End-to-End System for Multi-speaker Speech Recognition
- Author
-
Jonathan Le Roux, John R. Hershey, Shinji Watanabe, Hiroshi Seki, and Takaaki Hori
- Subjects
End to end system ,Sequence ,Training set ,Computer science ,Speech recognition ,020208 electrical & electronic engineering ,Contrast (statistics) ,02 engineering and technology ,Task (project management) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,Source separation ,0305 other medical science - Abstract
Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.
- Published
- 2018
- Full Text
- View/download PDF
36. Analysis of Multilingual Sequence-to-Sequence speech recognition systems
- Author
-
Martin Karafiat, Murali Karthick Baskar, Shinji Watanabe, Takaaki Hori, Jan Cernocký, and Matthew Wiesner
- Subjects
FOS: Computer and information sciences ,Sequence ,Computer Science - Machine Learning ,Training set ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Machine Learning (cs.LG) ,Set (abstract data type) ,Language transfer ,Audio and Speech Processing (eess.AS) ,Component (UML) ,Feature (machine learning) ,FOS: Electrical engineering, electronic engineering, information engineering ,Layer (object-oriented design) ,Hidden Markov model ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set composed of Babel data, we first show the effectiveness of multi-lingual training with stacked bottle-neck (SBN) features. Then we explore various architectures and training strategies of multi-lingual seq2seq models based on CTC-attention networks including combinations of output layer, CTC and/or attention component re-training. We also investigate the effectiveness of language-transfer learning in a very low resource scenario when the target language is not included in the original multi-lingual training data. Interestingly, we found multilingual features superior to multilingual models, and this finding suggests that we can efficiently combine the benefits of the HMM system with the seq2seq system through these multilingual feature techniques., Comment: arXiv admin note: text overlap with arXiv:1810.03459
- Published
- 2018
- Full Text
- View/download PDF
37. Attention-Based Multimodal Fusion for Video Description
- Author
-
Kazuhiko Sumi, Chiori Hori, Ziming Zhang, Teng-Yok Lee, Bret Harsham, John R. Hershey, Takaaki Hori, and Tim K. Marks
- Subjects
Artificial neural network ,Computer science ,business.industry ,Concatenation ,Feature extraction ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,02 engineering and technology ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Recurrent neural network ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer vision ,Relevance (information retrieval) ,Artificial intelligence ,0305 other medical science ,business ,Word (computer architecture) - Abstract
Current methods for video description are based on encoder-decoder sentence generation using recurrent neural networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by naive concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by naive concatenation may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
- Published
- 2017
- Full Text
- View/download PDF
38. Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
- Author
-
Shinji Watanabe, Yu Zhang, Takaaki Hori, and William Chan
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Computer science ,Speech recognition ,Process (computing) ,020206 networking & telecommunications ,02 engineering and technology ,Data_CODINGANDINFORMATIONTHEORY ,Convolutional neural network ,030507 speech-language pathology & audiology ,03 medical and health sciences ,End-to-end principle ,Connectionism ,0202 electrical engineering, electronic engineering, information engineering ,Beam search ,Language model ,0305 other medical science ,Joint (audio engineering) ,Encoder ,Computation and Language (cs.CL) - Abstract
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10\% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems., Accepted for INTERSPEECH 2017
- Published
- 2017
39. A Case of Multiple Small Intestinal Ulcers Induced by Adjuvant Therapy for Colon Cancer
- Author
-
Kenji Sugano, Ryoji Kaizaki, Takaaki Hori, Satoshi Takatsuka, Teruyuki Ikehara, and Yushi Fujiwara
- Subjects
Oncology ,medicine.medical_specialty ,Intestinal ulcers ,Colorectal cancer ,business.industry ,Internal medicine ,medicine ,Adjuvant therapy ,medicine.disease ,business ,Gastroenterology - Published
- 2014
- Full Text
- View/download PDF
40. Speech recognition in living rooms: Integrated speech enhancement and recognition system based on spatial, spectral and temporal modeling of sounds
- Author
-
Tomohiro Nakatani, Marc Delcroix, Masakiyo Fujimoto, Takuya Yoshioka, Seongjun Hahm, Mehrez Souden, Yotaro Kubo, Takaaki Hori, Shinji Watanabe, Atsushi Nakamura, Shoko Araki, Keisuke Kinoshita, Takanobu Oba, and Atsunori Ogawa
- Subjects
Voice activity detection ,Channel (digital image) ,Computer science ,Speech recognition ,Acoustic model ,Linear predictive coding ,Speech processing ,Theoretical Computer Science ,Compensation (engineering) ,Human-Computer Interaction ,Speech enhancement ,Noise ,Computer Science::Sound ,Software - Abstract
Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech-noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.
- Published
- 2013
- Full Text
- View/download PDF
41. Joint CTC/attention decoding for end-to-end speech recognition
- Author
-
Takaaki Hori, John R. Hershey, and Shinji Watanabe
- Subjects
Markov chain ,Computer science ,Speech recognition ,020206 networking & telecommunications ,02 engineering and technology ,Pronunciation ,Mandarin Chinese ,language.human_language ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Tokenization (data security) ,0202 electrical engineering, electronic engineering, information engineering ,language ,0305 other medical science ,Hidden Markov model ,Joint (audio engineering) ,Decoding methods - Abstract
End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.
- Published
- 2017
- Full Text
- View/download PDF
42. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning
- Author
-
Shinji Watanabe, Takaaki Hori, and Suyoun Kim
- Subjects
FOS: Computer and information sciences ,Voice activity detection ,Computer Science - Computation and Language ,Noise measurement ,business.industry ,Computer science ,Speech recognition ,Word error rate ,Multi-task learning ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Connectionism ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,0305 other medical science ,business ,Hidden Markov model ,Computation and Language (cs.CL) ,computer ,Decoding methods - Abstract
Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping between variable-length input and output sequences in one step using a purely data-driven method. The attention model has often been shown to improve the performance over another end-to-end approach, the Connectionist Temporal Classification (CTC), mainly because it explicitly uses the history of the target character without any conditional independence assumptions. However, we observed that the performance of the attention has shown poor results in noisy condition and is hard to learn in the initial training stage with long input sequences. This is because the attention model is too flexible to predict proper alignments in such cases due to the lack of left-to-right constraints as used in CTC. This paper presents a novel method for end-to-end speech recognition to improve robustness and achieve fast convergence by using a joint CTC-attention model within the multi-task learning framework, thereby mitigating the alignment issue. An experiment on the WSJ and CHiME-4 tasks demonstrates its advantages over both the CTC and attention-based encoder-decoder baselines, showing 5.4-14.6% relative improvements in Character Error Rate (CER).
- Published
- 2016
43. Small bowel obstruction due to meckel's diverticulum complicated by a true enterolith
- Author
-
Masashi Takemura, Genya Hamano, Katsuyuki Mayumi, Takashi Ikebe, and Takaaki Hori
- Subjects
Bowel obstruction ,medicine.medical_specialty ,Meckel's diverticulum ,Enterolith ,business.industry ,Medicine ,Radiology ,business ,medicine.disease - Published
- 2012
- Full Text
- View/download PDF
44. Model Shrinkage for Discriminative Language Models
- Author
-
Atsushi Nakamura, Takanobu Oba, Takaaki Hori, and Akinori Ito
- Subjects
Computer science ,business.industry ,Speech recognition ,Linear model ,Feature selection ,Pattern recognition ,Discriminative model ,Artificial Intelligence ,Hardware and Architecture ,Computer Vision and Pattern Recognition ,Language model ,Artificial intelligence ,Electrical and Electronic Engineering ,business ,Software ,Shrinkage - Published
- 2012
- Full Text
- View/download PDF
45. Topic tracking language model for speech recognition
- Author
-
Tomoharu Iwata, Takaaki Hori, Atsushi Sako, Shinji Watanabe, and Yasuo Ariki
- Subjects
Topic model ,business.industry ,Computer science ,Speech recognition ,Speech corpus ,computer.software_genre ,Theoretical Computer Science ,Human-Computer Interaction ,Cache language model ,Speaking style ,Tracking (education) ,Artificial intelligence ,Language model ,business ,Adaptation (computer science) ,computer ,Software ,Natural language processing - Abstract
In a real environment, acoustic and language features often vary depending on the speakers, speaking styles and topic changes. To accommodate these changes, speech recognition approaches that include the incremental tracking of changing environments have attracted attention. This paper proposes a topic tracking language model that can adaptively track changes in topics based on current text information and previously estimated topic models in an on-line manner. The proposed model is applied to language model adaptation in speech recognition. We use the MIT OpenCourseWare corpus and Corpus of Spontaneous Japanese in speech recognition experiments, and show the effectiveness of the proposed method.
- Published
- 2011
- Full Text
- View/download PDF
46. A Case of Endocrine Cell Carcinoma (Submucosal Depressed) in the Rectum Resected but Indicating an Extremely Poor Prognosis
- Author
-
Takayoshi Nishioka, Genya Hamano, Katsuyuki Mayumi, Masanobu Terakura, Masashi Takemura, Takashi Ikebe, and Takaaki Hori
- Subjects
Extremely Poor ,medicine.medical_specialty ,medicine.anatomical_structure ,business.industry ,Internal medicine ,medicine ,Carcinoma ,Rectum ,Enteroendocrine cell ,medicine.disease ,business ,Gastroenterology - Published
- 2011
- Full Text
- View/download PDF
47. A Case of Residual Esophageal Necrosis After Lower Esophagectomy for Early Esophageal Cancer
- Author
-
Keiichiro Morimura, Takaaki Hori, Yushi Fujiwara, and Masashi Takemura
- Subjects
medicine.medical_specialty ,business.industry ,Esophagectomy ,Internal medicine ,medicine.medical_treatment ,General surgery ,medicine ,Esophageal cancer ,Esophageal necrosis ,medicine.disease ,business ,Gastroenterology - Published
- 2011
- Full Text
- View/download PDF
48. Effect of leg immersion in mild warm carbonated water on skin and muscle blood flow
- Author
-
Shigehiko Ogoh, Keisuke Ikeda, Takaaki Hori, Yoshiho Muraoka, Niels D. Olesen, Kazuya Suzuki, and Takuro Washio
- Subjects
Male ,Hot Temperature ,near-infrared spectroscopy ,Physiology ,Carbonation ,Vasodilation ,030204 cardiovascular system & hematology ,Doppler ultrasound ,Carbonated water ,Young Adult ,03 medical and health sciences ,Gastrocnemius muscle ,0302 clinical medicine ,Tap water ,Skin Physiological Phenomena ,Physiology (medical) ,Laser-Doppler Flowmetry ,medicine ,Humans ,Muscle, Skeletal ,Skin ,Leg ,Spectroscopy, Near-Infrared ,business.industry ,Chemistry ,Ultrasound ,Blood flow ,Laser Doppler velocimetry ,medicine.disease ,popliteal artery ,Carbonated Water ,Regional Blood Flow ,Anesthesia ,Arterial stiffness ,business ,030217 neurology & neurosurgery - Abstract
Leg immersion in carbonated water improves endothelial-mediated vasodilator function and decreases arterial stiffness but the mechanism underlying this effect remains poorly defined. We hypothesized that carbonated water immersion increases muscle blood flow. To test this hypothesis, 10 men (age 21 ± 0 years; mean ± SD) underwent lower leg immersion in tap or carbonated water at 38°C. We evaluated gastrocnemius muscle oxyhemoglobin concentration and tissue oxygenation index using near-infrared spectroscopy, skin blood flow by laser Doppler flowmetry, and popliteal artery (PA) blood flow by duplex ultrasound. Immersion in carbonated, but not tap water elevated PA (from 38 ± 14 to 83 ± 31 mL/min; P < 0.001) and skin blood flow (by 779 ± 312%, P < 0.001). In contrast, lower leg immersion elevated oxyhemoglobin concentration and tissue oxygenation index with no effect of carbonation (P = 0.529 and P = 0.495). In addition, the change in PA blood flow in response to immersion in carbonated water correlated with those of skin blood flow (P = 0.005) but not oxyhemoglobin concentration (P = 0.765) and tissue oxygenation index (P = 0.136) while no relations was found for tap water immersion. These findings indicate that water carbonation has minimal effect on muscle blood flow. Furthermore, PA blood flow increases in response to lower leg immersion in carbonated water likely due to a large increase in skin blood flow.
- Published
- 2018
- Full Text
- View/download PDF
49. Improved Sequential Dependency Analysis Integrating Labeling-Based Sentence Boundary Detection
- Author
-
Takanobu Oba, Takaaki Hori, and Atsushi Nakamura
- Subjects
Conditional random field ,Boundary detection ,Sentence boundary disambiguation ,Dependency (UML) ,Computer science ,business.industry ,Speech recognition ,computer.software_genre ,Sequential dependency ,Artificial Intelligence ,Hardware and Architecture ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Electrical and Electronic Engineering ,Element (category theory) ,business ,computer ,Software ,Sentence ,Natural language processing - Abstract
A dependency structure interprets modification relationships between words or phrases and is recognized as an important element in semantic information analysis. With the conventional approaches for extracting this dependency structure, it is assumed that the complete sentence is known before the analysis starts. For spontaneous speech data, however, this assumption is not necessarily correct since sentence boundaries are not marked in the data. Although sentence boundaries can be detected before dependency analysis, this cascaded implementation is not suitable for online processing since it delays the responses of the application. To solve these problems, we proposed a sequential dependency analysis (SDA) method for online spontaneous speech processing, which enabled us to analyze incomplete sentences sequentially and detect sentence boundaries simultaneously. In this paper, we propose an improved SDA integrating a labeling-based sentence boundary detection (SntBD) technique based on Conditional Random Fields (CRFs). In the new method, we use CRF for soft decision of sentence boundaries and combine it with SDA to retain its online framework. Since CRF-based SntBD yields better estimates of sentence boundaries, SDA can provide better results in which the dependency structure and sentence boundaries are consistent. Experimental results using spontaneous lecture speech from the Corpus of Spontaneous Japanese show that our improved SDA outperforms the original SDA with SntBD accuracy providing better dependency analysis results.
- Published
- 2010
- Full Text
- View/download PDF
50. Structural Dependence on Excitation Energy Migration Processes in Artificial Light Harvesting Cyclic Zinc(II) Porphyrin Arrays
- Author
-
Sung Cho, Takaaki Hori, Min Chul Yoon, Atsuhiro Osuka, Pyosang Kim, Dongho Kim, and Naoki Aratani
- Subjects
Molecular Structure ,Metalloporphyrins ,Dimer ,Light-Harvesting Protein Complexes ,chemistry.chemical_element ,Zinc ,Photochemistry ,Porphyrin ,Molecular physics ,Surfaces, Coatings and Films ,chemistry.chemical_compound ,Cross-Linking Reagents ,Energy Transfer ,chemistry ,Biomimetic Materials ,Cyclization ,Femtosecond ,Ultrafast laser spectroscopy ,Materials Chemistry ,Molecule ,Physical and Theoretical Chemistry ,Anisotropy ,Excitation - Abstract
A series of covalently linked cyclic porphyrin arrays CNZ that consist of N/2 of meso-meso directly linked zinc(II) porphyrin dimer subunits Z2 bridged by 1,3-phenylene spacers have been prepared by Ag(I)-promoted oxidative coupling reaction. We have investigated the excitation energy migration processes of CNZ in toluene by using femtosecond transient absorption anisotropy decay measurements by taking 2Z2 composed of two Z2 units linked by 1,3-phenylene as a reference molecule. On the basis of the excitation energy transfer rate determined for 2Z2, we have revealed the excitation energy hopping rates in the cyclic arrays CNZ by using a regular polygon model. The number of excitation energy hopping sites N(flat) calculated by using a regular polygon model is close to the observed N(expt) value obtained from the transient absorption anisotropy decays for C12Z-C18Z with circular and well-ordered structures. On the other hand, a large discrepancy between N(flat) and N(expt) was found for smaller or larger arrays (C10Z, C24Z, and C32Z). In the case of C10Z, m-phenylene linked 2Z2 motif with the interchromophoric angle of 120 degrees is not well suited to make a cyclic pentagonal array C10Z based on planar pentagonal structure. This geometrical factor inevitably causes a structural distortion in C10Z, leading to a discrepancy between N(expt) and N(flat) values. On the contrary, the presence of highly distorted conformers such as figure-eight structures reduces the number of effective hopping sites N(expt) in large cyclic arrays C24Z and C32Z. Thus, our study demonstrates that not only the large number of porphyrin chromophores in the cyclic arrays CNZ but the overall rigidity and three-dimensional orientation in molecular architectures is a key factor to be considered in the preparation of artificial light harvesting porphyrin arrays.
- Published
- 2009
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.