Author: "Shi, Bowen" / Topic: computer science - computation and language - Searchworks@Jio Institute Digital Library Search Results

1. Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Author: Chien, Chung-Ming, Tjandra, Andros, Vyas, Apoorv, Le, Matt, Shi, Bowen, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups., Comment: Accepted by InterSpeech 2024
Published: 2024

2. XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Author: Han, HyoJung, Anwar, Mohamed, Pino, Juan, Hsu, Wei-Ning, Carpuat, Marine, Shi, Bowen, and Wang, Changhan
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning., Comment: ACL2024
Published: 2024

3. Towards Privacy-Aware Sign Language Translation at Scale

Author: Rust, Phillip, Shi, Bowen, Wang, Skyler, Camgöz, Necati Cihan, and Maillard, Jean
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development of SLT technologies should account for. In this work, we propose a two-stage framework for privacy-aware SLT at scale that addresses both of these issues. We introduce SSVP-SLT, which leverages self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance on the How2Sign dataset, outperforming the strongest respective baselines by over 3 BLEU-4. Based on controlled experiments, we further discuss the advantages and limitations of self-supervised pretraining and anonymization via facial obfuscation for SLT., Comment: ACL 2024
Published: 2024

4. Generative Pre-training for Speech with Flow Matching

Author: Liu, Alexander H., Le, Matt, Vyas, Apoorv, Shi, Bowen, Tjandra, Andros, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training., Comment: ICLR 2024
Published: 2023

5. Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Author: Yu, Lili, Shi, Bowen, Pasunuru, Ramakanth, Muller, Benjamin, Golovneva, Olga, Wang, Tianlu, Babu, Arun, Tang, Binh, Karrer, Brian, Sheynin, Shelly, Ross, Candace, Polyak, Adam, Howes, Russell, Sharma, Vasu, Xu, Puxin, Tamoyan, Hovhannes, Ashual, Oron, Singer, Uriel, Li, Shang-Wen, Zhang, Susan, James, Richard, Ghosh, Gargi, Taigman, Yaniv, Fazel-Zarandi, Maryam, Celikyilmaz, Asli, Zettlemoyer, Luke, and Aghajanyan, Armen
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.
Published: 2023

6. Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

Author: Shi, Bowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Sign language, which conveys meaning through gestures, is the chief means of communication among deaf people. Recognizing sign language in natural settings presents significant challenges due to factors such as lighting, background clutter, and variations in signer characteristics. In this thesis, I study automatic sign language processing in the wild, using signing videos collected from the Internet. This thesis contributes new datasets, tasks, and methods. Most chapters of this thesis address tasks related to fingerspelling, an important component of sign language and yet has not been studied widely by prior work. I present three new large-scale ASL datasets in the wild: ChicagoFSWild, ChicagoFSWild+, and OpenASL. Using ChicagoFSWild and ChicagoFSWild+, I address fingerspelling recognition, which consists of transcribing fingerspelling sequences into text. I propose an end-to-end approach based on iterative attention that allows recognition from a raw video without explicit hand detection. I further show that using a Conformer-based network jointly modeling handshape and mouthing can bring performance close to that of humans. Next, I propose two tasks for building real-world fingerspelling-based applications: fingerspelling detection and search. For fingerspelling detection, I introduce a suite of evaluation metrics and a new detection model via multi-task training. To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text. Finally, I will describe a benchmark for large-vocabulary open-domain sign language translation based on OpenASL. To address the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features., Comment: PhD thesis
Published: 2023

7. EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Author: Nguyen, Tu Anh, Hsu, Wei-Ning, D'Avirro, Antony, Shi, Bowen, Gat, Itai, Fazel-Zarani, Maryam, Remez, Tal, Copet, Jade, Synnaeve, Gabriel, Hassid, Michael, Kreuk, Felix, Adi, Yossi, and Dupoux, Emmanuel
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source
Published: 2023

8. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Author: Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, Williamson, Mary, Manohar, Vimal, Adi, Yossi, Mahadeokar, Jay, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}., Comment: Accepted to NeurIPS 2023
Published: 2023

9. Prompt to GPT-3: Step-by-Step Thinking Instructions for Humor Generation

Author: Chen, Yuetian, Shi, Bowen, and Si, Mei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Artificial intelligence has made significant progress in natural language processing, with models like GPT-3 demonstrating impressive capabilities. However, these models still have limitations when it comes to complex tasks that require an understanding of the user, such as mastering human comedy writing strategies. This paper explores humor generation using GPT-3 by modeling human comedy writing theory and leveraging step-by-step thinking instructions. In addition, we explore the role of cognitive distance in creating humor., Comment: 5 pages, 1 figure; ICCC '23 preprint
Published: 2023

10. Scaling Speech Technology to 1,000+ Languages

Author: Pratap, Vineel, Tjandra, Andros, Shi, Bowen, Tomasello, Paden, Babu, Arun, Kundu, Sayani, Elkahky, Ali, Ni, Zhaoheng, Vyas, Apoorv, Fazel-Zarandi, Maryam, Baevski, Alexei, Adi, Yossi, Zhang, Xiaohui, Hsu, Wei-Ning, Conneau, Alexis, and Auli, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
Published: 2023

11. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

Author: Anwar, Mohamed, Shi, Bowen, Goswami, Vedanuj, Hsu, Wei-Ning, Pino, Juan, and Wang, Changhan
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.
Published: 2023

12. Visual Story Generation Based on Emotion and Keywords

Author: Chen, Yuetian, Li, Ruohua, Shi, Bowen, Liu, Peiru, and Si, Mei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development. This work proposes a story generation pipeline to co-create visual stories with the users. The pipeline allows the user to control events and emotions on the generated content. The pipeline includes two parts: narrative and image generation. For narrative generation, the system generates the next sentence using user-specified keywords and emotion labels. For image generation, diffusion models are used to create a visually appealing image corresponding to each generated sentence. Further, object recognition is applied to the generated images to allow objects in these images to be mentioned in future story development., Comment: 8 pages, 8 figures, AIIDE INT 2022
Published: 2023

13. Comparative layer-wise analysis of self-supervised speech models

Author: Pasad, Ankita, Shi, Bowen, and Livescu, Karen
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Many self-supervised speech models, varying in their pre-training objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA). We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models., Comment: Accepted to ICASSP 2023. Code: https://github.com/ankitapasad/layerwise-analysis
Published: 2022

14. u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

Author: Hsu, Wei-Ning and Shi, Bowen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert, Comment: NeurIPS 2022
Published: 2022

15. Open-Domain Sign Language Translation Learned from Online Video

Author: Shi, Bowen, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Existing work on sign language translation - that is, translation from sign language videos into sentences in a written language - has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube). OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers and is the largest publicly available ASL translation dataset to date. To tackle the challenges of sign language translation in realistic settings and without glosses, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features. The proposed techniques produce consistent and large improvements in translation quality, over baseline models based on prior work. Our data and code are publicly available at https://github.com/chevalierNoir/OpenASL, Comment: EMNLP 2022
Published: 2022

16. Searching for fingerspelled content in American Sign Language

Author: Shi, Bowen, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Natural language processing for sign language video - including tasks like recognition, translation, and search - is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled key-words or key phrases in raw sign language videos. This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before. We propose an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence. Our experiments, done on a large public dataset of ASL fingerspelling in the wild, show the importance of fingerspelling detection as a component of a search and retrieval model. Our model significantly outperforms baseline methods adapted from prior work on related tasks, Comment: ACL 2022
Published: 2022

17. Fingerspelling Detection in American Sign Language

Author: Shi, Bowen, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Fingerspelling, in which words are signed letter by letter, is an important component of American Sign Language. Most previous work on automatic fingerspelling recognition has assumed that the boundaries of fingerspelling regions in signing videos are known beforehand. In this paper, we consider the task of fingerspelling detection in raw, untrimmed sign language videos. This is an important step towards building real-world fingerspelling recognition systems. We propose a benchmark and a suite of evaluation metrics, some of which reflect the effect of detection on the downstream fingerspelling recognition task. In addition, we propose a new model that learns to detect fingerspelling via multi-task training, incorporating pose estimation and fingerspelling recognition (transcription) along with detection, and compare this model to several alternatives. The model outperforms all alternative approaches across all metrics, establishing a state of the art on the benchmark., Comment: CVPR 2021
Published: 2021

18. Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Author: Shi, Bowen, Settle, Shane, and Livescu, Karen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over prior A2W models., Comment: SLT 2021
Published: 2020

19. A Cross-Task Analysis of Text Span Representations

Author: Toshniwal, Shubham, Shi, Haoyue, Shi, Bowen, Gao, Lingyu, Livescu, Karen, and Gimpel, Kevin
Subjects: Computer Science - Computation and Language
Abstract: Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution. While extensive research has focused on functional architectures for representing words and sentences, there is less work on representing arbitrary spans of text within sentences. In this paper, we conduct a comprehensive empirical evaluation of six span representation methods using eight pretrained language representation models across six tasks, including two tasks that we introduce. We find that, although some simple span representations are fairly reliable across tasks, in general the optimal span representation varies by task, and can also vary within different facets of individual tasks. We also find that the choice of span representation has a bigger impact with a fixed pretrained encoder than with a fine-tuned encoder., Comment: RepL4NLP 2020
Published: 2020

20. Fingerspelling recognition in the wild with iterative visual attention

Author: Shi, Bowen, Del Rio, Aurora Martinez, Keane, Jonathan, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Sign language recognition is a challenging gesture sequence recognition problem, characterized by quick and highly coarticulated motion. In this paper we focus on recognition of fingerspelling sequences in American Sign Language (ASL) videos collected in the wild, mainly from YouTube and Deaf social media. Most previous work on sign language recognition has focused on controlled settings where the data is recorded in a studio environment and the number of signers is limited. Our work aims to address the challenges of real-life data, reducing the need for detection or segmentation modules commonly used in this domain. We propose an end-to-end model based on an iterative attention mechanism, without explicit hand detection or segmentation. Our approach dynamically focuses on increasingly high-resolution regions of interest. It outperforms prior work by a large margin. We also introduce a newly collected data set of crowdsourced annotations of fingerspelling in the wild, and show that performance can be further improved with this additional data set., Comment: ICCV 2019
Published: 2019

21. Compression of Acoustic Event Detection Models with Low-rank Matrix Factorization and Quantization Training

Author: Shi, Bowen, Sun, Ming, Kao, Chieh-Chi, Rozgic, Viktor, Matsoukas, Spyros, and Wang, Chao
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: In this paper, we present a compression approach based on the combination of low-rank matrix factorization and quantization training, to reduce complexity for neural network based acoustic event detection (AED) models. Our experimental results show this combined compression approach is very effective. For a three-layer long short-term memory (LSTM) based AED model, the original model size can be reduced to 1% with negligible loss of accuracy. Our approach enables the feasibility of deploying AED for resource-constraint applications., Comment: NeuralPS 2018 CDNNRIA workshop
Published: 2019

22. On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

Author: Pasad, Ankita, Shi, Bowen, Kamper, Herman, and Livescu, Karen
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent work has shown that speech paired with images can be used to learn semantically meaningful speech representations even without any textual supervision. In real-world low-resource settings, however, we often have access to some transcribed speech. We study whether and how visual grounding is useful in the presence of varying amounts of textual supervision. In particular, we consider the task of semantic speech retrieval in a low-resource setting. We use a previously studied data set and task, where models are trained on images with spoken captions and evaluated on human judgments of semantic relevance. We propose a multitask learning approach to leverage both visual and textual modalities, with visual supervision in the form of keyword probabilities from an external tagger. We find that visual grounding is helpful even in the presence of textual supervision, and we analyze this effect over a range of sizes of transcribed data sets. With ~5 hours of transcribed speech, we obtain 23% higher average precision when also using visual supervision.
Published: 2019

23. American Sign Language fingerspelling recognition in the wild

Author: Shi, Bowen, Del Rio, Aurora Martinez, Keane, Jonathan, Michaux, Jonathan, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We address the problem of American Sign Language fingerspelling recognition in the wild, using videos collected from websites. We introduce the largest data set available so far for the problem of fingerspelling recognition, and the first using naturally occurring video data. Using this data set, we present the first attempt to recognize fingerspelling sequences in this challenging setting. Unlike prior work, our video data is extremely challenging due to low frame rates and visual variability. To tackle the visual challenges, we train a special-purpose signing hand detector using a small subset of our data. Given the hand detector output, a sequence model decodes the hypothesized fingerspelled letter sequence. For the sequence model, we explore attention-based recurrent encoder-decoders and CTC-based approaches. As the first attempt at fingerspelling recognition in the wild, this work is intended to serve as a baseline for future work on sign language recognition in realistic conditions. We find that, as expected, letter error rates are much higher than in previous work on more controlled data, and we analyze the sources of error and effects of model variants., Comment: accepted in SLT 2018
Published: 2018

24. Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition

Author: Shi, Bowen and Livescu, Karen
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: We address the problem of automatic American Sign Language fingerspelling recognition from video. Prior work has largely relied on frame-level labels, hand-crafted features, or other constraints, and has been hampered by the scarcity of data for this task. We introduce a model for fingerspelling recognition that addresses these issues. The model consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly. The model receives a sequence of image frames and outputs the fingerspelled word, without relying on any frame-level training labels or hand-crafted features. In addition, the auto-encoder subcomponent makes it possible to leverage unlabeled data to improve the feature learning. The model achieves 11.6% and 4.4% absolute letter accuracy improvement respectively in signer-independent and signer-adapted fingerspelling recognition over previous approaches that required frame-level training labels.
Published: 2017

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

24 results on '"Shi, Bowen"'

1. Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

2. XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

3. Towards Privacy-Aware Sign Language Translation at Scale

4. Generative Pre-training for Speech with Flow Matching

5. Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

6. Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

7. EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

8. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

9. Prompt to GPT-3: Step-by-Step Thinking Instructions for Humor Generation

10. Scaling Speech Technology to 1,000+ Languages

11. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

12. Visual Story Generation Based on Emotion and Keywords

13. Comparative layer-wise analysis of self-supervised speech models

14. u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

15. Open-Domain Sign Language Translation Learned from Online Video

16. Searching for fingerspelled content in American Sign Language

17. Fingerspelling Detection in American Sign Language

18. Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

19. A Cross-Task Analysis of Text Span Representations

20. Fingerspelling recognition in the wild with iterative visual attention

21. Compression of Acoustic Event Detection Models with Low-rank Matrix Factorization and Quantization Training

22. On the Contributions of Visual and Textual Supervision in Low-Resource Semantic Speech Retrieval

23. American Sign Language fingerspelling recognition in the wild

24. Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

24 results on '"Shi, Bowen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources