Author: "Shi, Bowen" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shi, Bowen"' showing total 809 results

Start Over Author "Shi, Bowen"

809 results on '"Shi, Bowen"'

1. High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching

Author: Lan, Gael Le, Shi, Bowen, Ni, Zhaoheng, Srinivasan, Sidd, Kumar, Anurag, Ellis, Brian, Kant, David, Nagaraja, Varun, Chang, Ernie, Hsu, Wei-Ning, Shi, Yangyang, and Chandra, Vikas
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We introduce a simple and efficient text-controllable high-fidelity music generation and editing model. It operates on sequences of continuous latent representations from a low frame rate 48 kHz stereo variational auto encoder codec that eliminates the information loss drawback of discrete representations. Based on a diffusion transformer architecture trained on a flow-matching objective the model can generate and edit diverse high quality stereo samples of variable duration, with simple text descriptions. We also explore a new regularized latent inversion method for zero-shot test-time text-guided editing and demonstrate its superior performance over naive denoising diffusion implicit model (DDIM) inversion for variety of music editing prompts. Evaluations are conducted on both objective and subjective metrics and demonstrate that the proposed model is not only competitive to the evaluated baselines on a standard text-to-music benchmark - quality and efficiency-wise - but also outperforms previous state of the art for music editing when combined with our proposed latent inversion. Samples are available at https://melodyflow.github.io.
Published: 2024

2. Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Author: Chien, Chung-Ming, Tjandra, Andros, Vyas, Apoorv, Le, Matt, Shi, Bowen, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: As the scale of generative models continues to grow, efficient reuse and adaptation of pre-trained models have become crucial considerations. In this work, we propose Voicebox Adapter, a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model using a cross-attention module. To ensure a smooth integration of newly added modules with pre-trained ones, we explore various efficient fine-tuning approaches. Our experiment shows that the LoRA with bias-tuning configuration yields the best performance, enhancing controllability without compromising speech quality. Across three fine-grained conditional generation tasks, we demonstrate the effectiveness and resource efficiency of Voicebox Adapter. Follow-up experiments further highlight the robustness of Voicebox Adapter across diverse data setups., Comment: Accepted by InterSpeech 2024
Published: 2024

3. Strict area law implies commuting parent Hamiltonian

Author: Kim, Isaac H., Lin, Ting-Chun, Ranard, Daniel, and Shi, Bowen
Subjects: Quantum Physics, Condensed Matter - Strongly Correlated Electrons
Abstract: We show that in two spatial dimensions, when a quantum state has entanglement entropy obeying a strict area law, meaning $S(A)=\alpha |\partial A| - \gamma$ for constants $\alpha, \gamma$ independent of lattice region $A$, then it admits a commuting parent Hamiltonian. More generally, we prove that the entanglement bootstrap axioms in 2D imply the existence of a commuting, local parent Hamiltonian with a stable spectral gap. We also extend our proof to states that describe gapped domain walls. Physically, these results imply that the states studied in the entanglement bootstrap program correspond to ground states of some local Hamiltonian, describing a stable phase of matter. Our result also suggests that systems with chiral gapless edge modes cannot obey a strict area law provided they have finite local Hilbert space., Comment: 19+2 pages, 10 figures
Published: 2024

4. Meshfree finite difference solution of homogeneous Dirichlet problems of the fractional Laplacian

Author: Shen, Jinye, Shi, Bowen, and Huang, Weizhang
Subjects: Mathematics - Numerical Analysis, 65N06, 35R11
Abstract: A so-called grid-overlay finite difference method (GoFD) was proposed recently for the numerical solution of homogeneous Dirichlet boundary value problems of the fractional Laplacian on arbitrary bounded domains. It was shown to have advantages of both finite difference and finite element methods, including its efficient implementation through the fast Fourier transform and ability to work for complex domains and with mesh adaptation. The purpose of this work is to study GoFD in a meshfree setting, a key to which is to construct the data transfer matrix from a given point cloud to a uniform grid. Two approaches are proposed, one based on the moving least squares fitting and the other based on the Delaunay triangulation and piecewise linear interpolation. Numerical results obtained for examples with convex and concave domains and various types of point clouds are presented. They show that both approaches lead to comparable results. Moreover, the resulting meshfree GoFD converges at a similar order as GoFD with unstructured meshes and finite element approximation as the number of points in the cloud increases. Furthermore, numerical results show that the method is robust to random perturbations in the location of the points., Comment: 18 pages
Published: 2024

5. Conformal geometry from entanglement

Author: Kim, Isaac H., Li, Xiang, Lin, Ting-Chun, McGreevy, John, and Shi, Bowen
Subjects: Quantum Physics, Condensed Matter - Strongly Correlated Electrons, High Energy Physics - Theory
Abstract: In a physical system with conformal symmetry, observables depend on cross-ratios, measures of distance invariant under global conformal transformations (conformal geometry for short). We identify a quantum information-theoretic mechanism by which the conformal geometry emerges at the gapless edge of a 2+1D quantum many-body system with a bulk energy gap. We introduce a novel pair of information-theoretic quantities $(\mathfrak{c}_{\mathrm{tot}}, \eta)$ that can be defined locally on the edge from the wavefunction of the many-body system, without prior knowledge of any distance measure. We posit that, for a topological groundstate, the quantity $\mathfrak{c}_{\mathrm{tot}}$ is stationary under arbitrary variations of the quantum state, and study the logical consequences. We show that stationarity, modulo an entanglement-based assumption about the bulk, implies (i) $\mathfrak{c}_{\mathrm{tot}}$ is a non-negative constant that can be interpreted as the total central charge of the edge theory. (ii) $\eta$ is a cross-ratio, obeying the full set of mathematical consistency rules, which further indicates the existence of a distance measure of the edge with global conformal invariance. Thus, the conformal geometry emerges from a simple assumption on groundstate entanglement. We show that stationarity of $\mathfrak{c}_{\mathrm{tot}}$ is equivalent to a vector fixed-point equation involving $\eta$, making our assumption locally checkable. We also derive similar results for 1+1D systems under a suitable set of assumptions., Comment: 48+31 pages, 25 figures
Published: 2024

6. Chiral Virasoro algebra from a single wavefunction

Author: Kim, Isaac H., Li, Xiang, Lin, Ting-Chun, McGreevy, John, and Shi, Bowen
Subjects: Quantum Physics, Condensed Matter - Mesoscale and Nanoscale Physics, Condensed Matter - Strongly Correlated Electrons, High Energy Physics - Theory
Abstract: Chiral edges of 2+1D systems can have very robust emergent conformal symmetry. When the edge is purely chiral, the Hilbert space of low-energy edge excitations can form a representation of a single Virasoro algebra. We propose a method to systematically extract the generators of the Virasoro algebra from a single ground state wavefunction, using entanglement bootstrap and an input from the edge conformal field theory. We corroborate our construction by numerically verifying the commutation relations of the generators. We also study the unitary flows generated by these operators, whose properties (such as energy and state overlap) are shown numerically to agree with our analytical predictions., Comment: 60+20 pages, 28 figures
Published: 2024

7. XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

Author: Han, HyoJung, Anwar, Mohamed, Pino, Juan, Hsu, Wei-Ning, Carpuat, Marine, Shi, Bowen, and Wang, Changhan
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.
Published: 2024

8. Towards Privacy-Aware Sign Language Translation at Scale

Author: Rust, Phillip, Shi, Bowen, Wang, Skyler, Camgöz, Necati Cihan, and Maillard, Jean
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: A major impediment to the advancement of sign language translation (SLT) is data scarcity. Much of the sign language data currently available on the web cannot be used for training supervised models due to the lack of aligned captions. Furthermore, scaling SLT using large-scale web-scraped datasets bears privacy risks due to the presence of biometric information, which the responsible development of SLT technologies should account for. In this work, we propose a two-stage framework for privacy-aware SLT at scale that addresses both of these issues. We introduce SSVP-SLT, which leverages self-supervised video pretraining on anonymized and unannotated videos, followed by supervised SLT finetuning on a curated parallel dataset. SSVP-SLT achieves state-of-the-art finetuned and zero-shot gloss-free SLT performance on the How2Sign dataset, outperforming the strongest respective baselines by over 3 BLEU-4. Based on controlled experiments, we further discuss the advantages and limitations of self-supervised pretraining and anonymization via facial obfuscation for SLT.
Published: 2024

9. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Author: Shi, Bowen, Zhao, Peisen, Wang, Zichen, Zhang, Yuhang, Wang, Yaoming, Li, Jin, Dai, Wenrui, Zou, Junni, Xiong, Hongkai, Tian, Qi, and Zhang, Xiaopeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-language foundation models, represented by Contrastive language-image pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level, and pixel-level captions/tags. Accordingly, we develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail. Equipped with parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP models and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We hope UMG-CLIP can serve as a valuable option for advancing vision-language foundation models., Comment: The paper is undergoing internal legal review and will be resubmitted once it passes the review
Published: 2024

10. Audiobox: Unified Audio Generation with Natural Language Prompts

Author: Vyas, Apoorv, Shi, Bowen, Le, Matthew, Tjandra, Andros, Wu, Yi-Chiao, Guo, Baishan, Zhang, Jiemin, Zhang, Xinyue, Adkins, Robert, Ngan, William, Wang, Jeff, Cruz, Ivan, Akula, Bapi, Akinyemi, Akinniyi, Ellis, Brian, Moritz, Rashel, Yungster, Yael, Rakotoarison, Alice, Tan, Liang, Summers, Chris, Wood, Carleigh, Lane, Joshua, Williamson, Mary, and Hsu, Wei-Ning
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/
Published: 2023

11. AiluRus: A Scalable ViT Framework for Dense Prediction

Author: Li, Jin, Wang, Yaoming, Zhang, Xiaopeng, Shi, Bowen, Jiang, Dongsheng, Li, Chenglin, Dai, Wenrui, Xiong, Hongkai, and Tian, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance. However, when it comes to handling long token sequences, especially in dense prediction tasks that require high-resolution input, the complexity of ViTs increases significantly. Notably, dense prediction tasks, such as semantic segmentation or object detection, emphasize more on the contours or shapes of objects, while the texture inside objects is less informative. Motivated by this observation, we propose to apply adaptive resolution for different regions in the image according to their importance. Specifically, at the intermediate layer of the ViT, we utilize a spatial-aware density-based clustering algorithm to select representative tokens from the token sequence. Once the representative tokens are determined, we proceed to merge other tokens into their closest representative token. Consequently, semantic similar tokens are merged together to form low-resolution regions, while semantic irrelevant tokens are preserved independently as high-resolution regions. This strategy effectively reduces the number of tokens, allowing subsequent layers to handle a reduced token sequence and achieve acceleration. We evaluate our proposed method on three different datasets and observe promising performance. For example, the "Segmenter ViT-L" model can be accelerated by 48% FPS without fine-tuning, while maintaining the performance. Additionally, our method can be applied to accelerate fine-tuning as well. Experimental results demonstrate that we can save 52% training time while accelerating 2.46 times FPS with only a 0.09% performance drop. The code is available at https://github.com/caddyless/ailurus/tree/main., Comment: Accepted by NeurIPS 2023
Published: 2023

12. Traceability and policy suggestions for ozone pollution in heavy industrial city in Northeast China

Author: Shi, Bowen, Liu, Gang, Fu, Jing, Zhai, Shuai, He, Luyan, Li, Ruiqi, and Chen, Weiwei
Published: 2024
Full Text: View/download PDF

13. Prediction of tumor regression grade in far-advanced gastric cancer after preoperative immuno-chemotherapy using dual-energy CT-derived extracellular volume fraction

Author: Chen, Yong, Jiang, Jinling, Yan, Chao, Jiang, Jiang, Shi, Bowen, Xu, Zhihan, Yuan, Fei, Zhang, Huan, and Zhang, Jun
Published: 2024
Full Text: View/download PDF

14. The Role of Iron Metabolism in Sepsis-associated Encephalopathy: a Potential Target

Author: Liu, Yinuo, Hu, Shengnan, Shi, Bowen, Yu, Bodong, Luo, Wei, Peng, Shengliang, and Du, Xiaohong
Published: 2024
Full Text: View/download PDF

15. Generative Pre-training for Speech with Flow Matching

Author: Liu, Alexander H., Le, Matt, Vyas, Apoorv, Shi, Bowen, Tjandra, Andros, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data. In speech, text-to-speech synthesis and neural vocoder are good examples where generative models have shined. While generative models have been applied to different applications in speech, there exists no general-purpose generative model that models speech directly. In this work, we take a step toward this direction by showing a single pre-trained generative model can be adapted to different downstream tasks with strong performance. Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions. Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis. Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training., Comment: ICLR 2024
Published: 2023

16. Immersed figure-8 annuli and anyons

Author: Shi, Bowen
Subjects: Quantum Physics, Condensed Matter - Strongly Correlated Electrons, High Energy Physics - Theory, Mathematical Physics
Abstract: Immersion (i.e., local embedding) is relevant to the physics of topologically ordered phases through entanglement bootstrap. An annulus can immerse in a disk or a sphere as a ``figure-8", which cannot be smoothly deformed to an embedded annulus. We investigate a simple problem: is there an Abelian state on the immersed figure-8 annulus, locally indistinguishable from the ground state of the background physical system? We show that if the answer is affirmative, a strong sense of isomorphism must hold: two homeomorphic immersed regions must have isomorphic information convex sets, even if they cannot smoothly deform to each other on the background physical system. We explain why to care about strong isomorphism in physical systems with anyons and give proof in the context of Abelian anyon theory. We further discuss a connection between immersed annuli and anyon transportation in the presence of topological defects. In appendices, we discuss related problems in broader contexts., Comment: 15 figures, 20+10 pages
Published: 2023

17. Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Author: Yu, Lili, Shi, Bowen, Pasunuru, Ramakanth, Muller, Benjamin, Golovneva, Olga, Wang, Tianlu, Babu, Arun, Tang, Binh, Karrer, Brian, Sheynin, Shelly, Ross, Candace, Polyak, Adam, Howes, Russell, Sharma, Vasu, Xu, Puxin, Tamoyan, Hovhannes, Ashual, Oron, Singer, Uriel, Li, Shang-Wen, Zhang, Susan, James, Richard, Ghosh, Gargi, Taigman, Yaniv, Fazel-Zarandi, Maryam, Celikyilmaz, Asli, Zettlemoyer, Luke, and Aghajanyan, Armen
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.
Published: 2023

18. Toward American Sign Language Processing in the Real World: Data, Tasks, and Methods

Author: Shi, Bowen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Sign language, which conveys meaning through gestures, is the chief means of communication among deaf people. Recognizing sign language in natural settings presents significant challenges due to factors such as lighting, background clutter, and variations in signer characteristics. In this thesis, I study automatic sign language processing in the wild, using signing videos collected from the Internet. This thesis contributes new datasets, tasks, and methods. Most chapters of this thesis address tasks related to fingerspelling, an important component of sign language and yet has not been studied widely by prior work. I present three new large-scale ASL datasets in the wild: ChicagoFSWild, ChicagoFSWild+, and OpenASL. Using ChicagoFSWild and ChicagoFSWild+, I address fingerspelling recognition, which consists of transcribing fingerspelling sequences into text. I propose an end-to-end approach based on iterative attention that allows recognition from a raw video without explicit hand detection. I further show that using a Conformer-based network jointly modeling handshape and mouthing can bring performance close to that of humans. Next, I propose two tasks for building real-world fingerspelling-based applications: fingerspelling detection and search. For fingerspelling detection, I introduce a suite of evaluation metrics and a new detection model via multi-task training. To address the problem of searching for fingerspelled keywords in raw sign language videos, we propose a novel method that jointly localizes and matches fingerspelling segments to text. Finally, I will describe a benchmark for large-vocabulary open-domain sign language translation based on OpenASL. To address the challenges of sign language translation in realistic settings, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features., Comment: PhD thesis
Published: 2023

19. EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Author: Nguyen, Tu Anh, Hsu, Wei-Ning, D'Avirro, Antony, Shi, Bowen, Gat, Itai, Fazel-Zarani, Maryam, Remez, Tal, Copet, Jade, Synnaeve, Gabriel, Hassid, Michael, Kreuk, Felix, Adi, Yossi, and Dupoux, Emmanuel
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source
Published: 2023

20. A new strategy for groundwater level prediction using a hybrid deep learning model under Ecological Water Replenishment

Author: Jia, Zihao, Zhang, Qin, Shi, Bowen, Xu, Congchao, Liu, Di, Yang, Yihong, Xi, Beidou, and Li, Rui
Published: 2024
Full Text: View/download PDF

21. Thermal-fluid dynamic behavior and its effect on particle distribution and solidification characterization during the laser melting deposition of Ti-based composites

Author: Wang, Leilei, Li, Yifan, Yang, Xingyun, Shi, Bowen, Gao, Zhuanni, and Zhan, Xiaohong
Published: 2024
Full Text: View/download PDF

22. Treatment of paraffin deposition behavior in gas-condensate wells with chemical inhibitors

Author: Shi, Bowen, Hong, Jiajun, Wang, Zhihua, Chang, Zhenbo, and Li, Feng
Published: 2024
Full Text: View/download PDF

23. ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting

Author: Zheng, Hongwei, Li, Han, Shi, Bowen, Dai, Wenrui, Wan, Botao, Sun, Yu, Guo, Min, and Xiong, Hongkai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency across sequences to alleviate the depth ambiguity problem but ignore the action related prior knowledge hidden in the pose sequence. In this paper, we propose a plug-and-play module named Action Prompt Module (APM) that effectively mines different kinds of action clues for 3D HPE. The highlight is that, the mining scheme of APM can be widely adapted to different frameworks and bring consistent benefits. Specifically, we first present a novel Action-related Text Prompt module (ATP) that directly embeds action labels and transfers the rich language information in the label to the pose sequence. Besides, we further introduce Action-specific Pose Prompt module (APP) to mine the position-aware pose pattern of each action, and exploit the correlation between the mined patterns and input pose sequence for further pose refinement. Experiments show that APM can improve the performance of most video-based 2D-to-3D HPE frameworks by a large margin., Comment: 6 pages, 4 figures, 2023ICME
Published: 2023

24. Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Author: Shi, Bowen, Zhang, Xiaopeng, Wang, Yaoming, Li, Jin, Dai, Wenrui, Zou, Junni, Xiong, Hongkai, and Tian, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Representation learning has been evolving from traditional supervised training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous works have demonstrated their pros and cons in specific scenarios, i.e., CL and supervised pre-training excel at capturing longer-range global patterns and enabling better feature discrimination, while MIM can introduce more local and diverse attention across all transformer layers. In this paper, we explore how to obtain a model that combines their strengths. We start by examining previous feature distillation and mask feature reconstruction methods and identify their limitations. We find that their increasing diversity mainly derives from the asymmetric designs, but these designs may in turn compromise the discrimination ability. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy, which utilizes both the supervised/CL teacher and the MIM teacher to jointly guide the student model. Hybrid Distill imitates the token relations of the MIM teacher to alleviate attention collapse, as well as distills the feature maps of the supervised/CL teacher to enable discrimination. Furthermore, a progressive redundant token masking strategy is also utilized to reduce the distilling costs and avoid falling into local optima. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
Published: 2023

25. Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Author: Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, Williamson, Mary, Manohar, Vimal, Adi, Yossi, Mahadeokar, Jay, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}., Comment: Accepted to NeurIPS 2023
Published: 2023

26. Prompt to GPT-3: Step-by-Step Thinking Instructions for Humor Generation

Author: Chen, Yuetian, Shi, Bowen, and Si, Mei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Artificial intelligence has made significant progress in natural language processing, with models like GPT-3 demonstrating impressive capabilities. However, these models still have limitations when it comes to complex tasks that require an understanding of the user, such as mastering human comedy writing strategies. This paper explores humor generation using GPT-3 by modeling human comedy writing theory and leveraging step-by-step thinking instructions. In addition, we explore the role of cognitive distance in creating humor., Comment: 5 pages, 1 figure; ICCC '23 preprint
Published: 2023

27. Scaling Speech Technology to 1,000+ Languages

Author: Pratap, Vineel, Tjandra, Andros, Shi, Bowen, Tomasello, Paden, Babu, Arun, Kundu, Sayani, Elkahky, Ali, Ni, Zhaoheng, Vyas, Apoorv, Fazel-Zarandi, Maryam, Baevski, Alexei, Adi, Yossi, Zhang, Xiaohui, Hsu, Wei-Ning, Conneau, Alexis, and Auli, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
Published: 2023

28. SEGA: Structural Entropy Guided Anchor View for Graph Contrastive Learning

Author: Wu, Junran, Chen, Xueyuan, Shi, Bowen, Li, Shangzhe, and Xu, Ke
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In contrastive learning, the choice of ``view'' controls the information that the representation captures and influences the performance of the model. However, leading graph contrastive learning methods generally produce views via random corruption or learning, which could lead to the loss of essential information and alteration of semantic information. An anchor view that maintains the essential information of input graphs for contrastive learning has been hardly investigated. In this paper, based on the theory of graph information bottleneck, we deduce the definition of this anchor view; put differently, \textit{the anchor view with essential information of input graph is supposed to have the minimal structural uncertainty}. Furthermore, guided by structural entropy, we implement the anchor view, termed \textbf{SEGA}, for graph contrastive learning. We extensively validate the proposed anchor view on various benchmarks regarding graph classification under unsupervised, semi-supervised, and transfer learning and achieve significant performance boosts compared to the state-of-the-art methods., Comment: ICML'23
Published: 2023

29. Highly-confined and tunable plasmonics based on two-dimensional solid-state defect lattices

Author: Ghorashi, Ali, Rivera, Nicholas, Shi, Bowen, Sundararaman, Ravishankar, Kaxiras, Efthimios, Joannopoulos, John, and Soljacic, Marin
Subjects: Physics - Optics
Abstract: Plasmons, collective excitations of electrons in solids, are associated with strongly confined electromagnetic fields, with wavelengths far below the wavelength of photons in free space. This strong confinement promises the realization of optoelectronic devices that could bridge the size difference between photonic and electronic devices. However, despite decades of research in plasmonics, many applications remain limited by plasmonic losses, thus motivating a search for new engineered plasmonic materials with lower losses. A promising pathway for low-loss plasmonic materials is the engineering of materials with flat and energetically isolated metallic bands, which can strongly limit phonon-assisted optical losses, a major contributor to short plasmonic lifetimes. Such electronic band structures may be created by judiciously introducing an ordered lattice of defects in an insulating host material. Here, we explore this approach, presenting several low-loss, highly-confined, and tunable plasmonic materials based on arrays of carbon substitutions in hexagonal boron nitride (hBN) monolayers. From our first-principles calculations based on density functional theory (DFT), we find plasmonic structures with mid-infrared plasmons featuring very high confinements ($\lambda_{\text{vacuum}}/\lambda_{\text{plasmon}}$ exceeding 2000) and quality factors in excess of 1000. We provide a systematic explanation of how crystal structure, electronic bandwidth, and many-body effects affect the plasmonic dispersions and losses of these materials. The results are thus of relevance to low-loss plasmon engineering in other flat band systems.
Published: 2023

30. Rethinking Visual Prompt Learning as Masked Visual Token Modeling

Author: Liao, Ning, Shi, Bowen, Zhang, Xiaopeng, Cao, Min, Yan, Junchi, and Tian, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prompt learning has achieved great success in efficiently exploiting large-scale pre-trained models in natural language processing (NLP). It reformulates the downstream tasks as the generative pre-training ones to achieve consistency, thus improving the performance stably. However, when transferring it to the vision area, current visual prompt learning methods are almost designed on discriminative pre-trained models, and there is also a lack of careful design to unify the forms of pre-training and downstream tasks. To explore prompt learning on the generative pre-trained visual model, as well as keeping the task consistency, we propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction. In addition, we develop the prototypical verbalizer for mapping the predicted visual token with implicit semantics to explicit downstream labels. To our best knowledge, VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation. Experiments show that VPTM outperforms other visual prompt methods and achieves excellent efficiency. Moreover, the task consistency of VPTM contributes to the robustness against prompt location, prompt length and prototype dimension, and could be deployed uniformly.
Published: 2023

31. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

Author: Anwar, Mohamed, Shi, Bowen, Goswami, Vedanuj, Hsu, Wei-Ning, Pino, Juan, and Wang, Changhan
Subjects: Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce MuAViC, a multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation providing 1200 hours of audio-visual speech in 9 languages. It is fully transcribed and covers 6 English-to-X translation as well as 6 X-to-English translation directions. To the best of our knowledge, this is the first open benchmark for audio-visual speech-to-text translation and the largest open benchmark for multilingual audio-visual speech recognition. Our baseline results show that MuAViC is effective for building noise-robust speech recognition and translation models. We make the corpus available at https://github.com/facebookresearch/muavic.
Published: 2023

32. Pose-Oriented Transformer with Uncertainty-Guided Refinement for 2D-to-3D Human Pose Estimation

Author: Li, Han, Shi, Bowen, Dai, Wenrui, Zheng, Hongwei, Wang, Botao, Sun, Yu, Guo, Min, Li, Chenlin, Zou, Junni, and Xiong, Hongkai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: There has been a recent surge of interest in introducing transformers to 3D human pose estimation (HPE) due to their powerful capabilities in modeling long-term dependencies. However, existing transformer-based methods treat body joints as equally important inputs and ignore the prior knowledge of human skeleton topology in the self-attention mechanism. To tackle this issue, in this paper, we propose a Pose-Oriented Transformer (POT) with uncertainty guided refinement for 3D HPE. Specifically, we first develop novel pose-oriented self-attention mechanism and distance-related position embedding for POT to explicitly exploit the human skeleton topology. The pose-oriented self-attention mechanism explicitly models the topological interactions between body joints, whereas the distance-related position embedding encodes the distance of joints to the root joint to distinguish groups of joints with different difficulties in regression. Furthermore, we present an Uncertainty-Guided Refinement Network (UGRN) to refine pose predictions from POT, especially for the difficult joints, by considering the estimated uncertainty of each joint with uncertainty-guided sampling strategy and self-attention mechanism. Extensive experiments demonstrate that our method significantly outperforms the state-of-the-art methods with reduced model parameters on 3D HPE benchmarks such as Human3.6M and MPI-INF-3DHP, Comment: accepted by AAAI2023
Published: 2023

33. Universal lower bound on topological entanglement entropy

Author: Kim, Isaac H., Levin, Michael, Lin, Ting-Chun, Ranard, Daniel, and Shi, Bowen
Subjects: Quantum Physics, Condensed Matter - Strongly Correlated Electrons
Abstract: Entanglement entropies of two-dimensional gapped ground states are expected to satisfy an area law, with a constant correction term known as the topological entanglement entropy (TEE). In many models, the TEE takes a universal value that characterizes the underlying topological phase. However, the TEE is not truly universal: it can differ even for two states related by constant-depth circuits, which are necessarily in the same phase. The difference between the TEE and the value predicted by the anyon theory is often called the spurious topological entanglement entropy. We show that this spurious contribution is always nonnegative, thus the value predicted by the anyon theory provides a universal lower bound. This observation also leads to a definition of TEE that is invariant under constant-depth quantum circuits., Comment: 13 pages, 16 figures
Published: 2023
Full Text: View/download PDF

34. Preparation and Characterization of Hydroxylated Multi-Walled Carbon Nanotube (MWCNTs-OH) Composite Nanofiltration Membrane

Author: Xu, Xufeng, Zheng, Lixiang, Shi, Bowen, Förstner, Ulrich, Series Editor, Rulkens, Wim H., Series Editor, Wen, Fushuan, editor, and Zhu, Jizhong, editor
Published: 2024
Full Text: View/download PDF

35. Remote detectability from entanglement bootstrap I: Kirby's torus trick

Author: Shi, Bowen, Huang, Jin-Long, and McGreevy, John
Subjects: Quantum Physics, Condensed Matter - Strongly Correlated Electrons, High Energy Physics - Theory, Mathematical Physics, Mathematics - Quantum Algebra
Abstract: Remote detectability is often taken as a physical assumption in the study of topologically ordered systems, and it is a central axiom of mathematical frameworks of topological quantum field theories. We show under the entanglement bootstrap approach that remote detectability is a necessary property; that is, we derive it as a theorem. Starting from a single wave function on a topologically-trivial region satisfying the entanglement bootstrap axioms, we can construct states on closed manifolds. The crucial technique is to immerse the punctured manifold into the topologically trivial region and then heal the puncture. This is analogous to Kirby's torus trick. We then analyze a special class of such manifolds, which we call pairing manifolds. For each pairing manifold, which pairs two classes of excitations, we identify an analog of the topological $S$-matrix. This pairing matrix is unitary, which implies remote detectability between two classes of excitations. These matrices are in general not associated with the mapping class group of the manifold. As a by-product, we can count excitation types (e.g., graph excitations in 3+1d). The pairing phenomenon occurs in many physical contexts, including systems in different dimensions, with or without gapped boundaries. We provide a variety of examples to illustrate its scope., Comment: 110+20 pages. Many figures
Published: 2023

36. Visual Story Generation Based on Emotion and Keywords

Author: Chen, Yuetian, Li, Ruohua, Shi, Bowen, Liu, Peiru, and Si, Mei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development. This work proposes a story generation pipeline to co-create visual stories with the users. The pipeline allows the user to control events and emotions on the generated content. The pipeline includes two parts: narrative and image generation. For narrative generation, the system generates the next sentence using user-specified keywords and emotion labels. For image generation, diffusion models are used to create a visually appealing image corresponding to each generated sentence. Further, object recognition is applied to the generated images to allow objects in these images to be mentioned in future story development., Comment: 8 pages, 8 figures, AIIDE INT 2022
Published: 2023

37. ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Author: Hsu, Wei-Ning, Remez, Tal, Shi, Bowen, Donley, Jacob, and Adi, Yossi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
Published: 2022

38. Comparative layer-wise analysis of self-supervised speech models

Author: Pasad, Ankita, Shi, Bowen, and Livescu, Karen
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Many self-supervised speech models, varying in their pre-training objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA). We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models., Comment: Accepted to ICASSP 2023. Code: https://github.com/ankitapasad/layerwise-analysis
Published: 2022

39. Bibliometric Analysis of Advances in mHealth Technology Application in Chronic Disease Management

Author: SHI Bowen, MA Huimin, PAN Yanzhi, MA He, YANG Chen, XIONG Juyang
Subjects: mhealth technology, chronic disease management, telemedicine, telenursing, bibliometrics, Medicine
Abstract: Background The research related to mHealth technology in chronic disease management has developed rapidly in recent years, however, the research trends, hotspots and cutting-edge issues in this field remain unclear. Objective To systematically review the application and development of mHealth technology in chronic disease management and provide reference for future research. Methods Using Web of Science Core Collection and PubMed as the source of literature data, the relevant literature was searched from 1997 to 2022 by CiteSpace 6.1.R 3 software on October 18, 2022, restricting the language to English, and excluding conference papers, conference abstracts, online publications, editorials, letters, book chapters, news, and other non-compliant contents. National regions, disciplinary intersections and keywords were analyzed to grasp the current status and hotspots of related research internationally, and the cutting-edge issues and research trends of mHealth technology in chronic disease management were comprehensively analyzed using keywords clustering analysis, keywords bursting analysis and timeline views. Results A total of 7 622 papers were finally included in the study, with a significant growth trend in the volume of publications starting from 2011, in which the United States contributed the most with a total of 2 645 (34.70%). The journals in which the papers were published were mainly in the fields of medicine, psychology and health; and the top five high-frequency keywords were chronic disease (711 times), nursing (695 times), management (544 times), intervention (502 times) and health (448 times). A total of 10 meaningful clusters were formed, which can be categorized into 4 dimensions of research tools, research theories and methods, research objects, and research factors; combining with keywords bursting and timeline view, the hot issues mainly focus on telemedicine, telecare, and digital health. Conclusion The international research fervor for the application of mHealth technology in chronic disease management has continued, and the field of research has shifted from medicine to health science, with the focus on intervention research on chronic diseases through mHealth technology and the use of digital technology to provide integrated telehealth services for chronic diseases. It is suggested that our scholars should pay attention to the application of mHealth and digital technologies in chronic disease management, find high-quality health services for patients with chronic diseases in China through intervention studies, and provide strategies and suggestions for the high-quality development of chronic disease services and management in China.
Published: 2024
Full Text: View/download PDF

40. Knots and entanglement

Author: Huang, Jin-Long, McGreevy, John, and Shi, Bowen
Abstract: We extend the entanglement bootstrap program to (3+1)-dimensions. We study knotted excitations of (3+1)-dimensional liquid topological orders and exotic fusion processes of loops. As in previous work in (2+1)-dimensions [Ann. Phys. 418, 168164(2020), Phys. Rev. B 103, 115150 (2021)], we define a variety of superselection sectors and fusion spaces from two axioms on the ground state entanglement entropy. In particular, we identify fusion spaces associated with knots. We generalize the information convex set to a new class of regions called immersed regions, promoting various theorems to this new context. Examples from solvable models are provided; for instance, a concrete calculation of knot multiplicity shows that the knot complement of a trefoil knot can store quantum information. We define spiral maps that allow us to understand consistency relations for torus knots as well as spiral fusions of fluxes.
Published: 2023

41. u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality

Author: Hsu, Wei-Ning and Shi, Bowen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks. In particular, our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input. Codes and models are available at https://github.com/facebookresearch/av_hubert, Comment: NeurIPS 2022
Published: 2022

42. Modular Commutators in Conformal Field Theory

Author: Zou, Yijian, Shi, Bowen, Sorce, Jonathan, Lim, Ian T., and Kim, Isaac H.
Subjects: Condensed Matter - Strongly Correlated Electrons, High Energy Physics - Theory, Quantum Physics
Abstract: The modular commutator is a recently discovered multipartite entanglement measure that quantifies the chirality of the underlying many-body quantum state. In this Letter, we derive a universal expression for the modular commutator in conformal field theories in $1+1$ dimensions and discuss its salient features. We show that the modular commutator depends only on the chiral central charge and the conformal cross ratio. We test this formula for a gapped $(2+1)$-dimensional system with a chiral edge, i.e., the quantum Hall state, and observe excellent agreement with numerical simulations. Furthermore, we propose a geometric dual for the modular commutator in certain preferred states of the AdS/CFT correspondence. For these states, we argue that the modular commutator can be obtained from a set of crossing angles between intersecting Ryu-Takayanagi surfaces., Comment: 6+24 pages, 5+12 figures
Published: 2022
Full Text: View/download PDF

43. Open-Domain Sign Language Translation Learned from Online Video

Author: Shi, Bowen, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Existing work on sign language translation - that is, translation from sign language videos into sentences in a written language - has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube). OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers and is the largest publicly available ASL translation dataset to date. To tackle the challenges of sign language translation in realistic settings and without glosses, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features. The proposed techniques produce consistent and large improvements in translation quality, over baseline models based on prior work. Our data and code are publicly available at https://github.com/chevalierNoir/OpenASL, Comment: EMNLP 2022
Published: 2022

44. Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Author: Shi, Bowen, Mohamed, Abdelrahman, and Hsu, Wei-Ning
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound
Abstract: This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions., Comment: Interspeech 2022
Published: 2022

45. Searching for fingerspelled content in American Sign Language

Author: Shi, Bowen, Brentari, Diane, Shakhnarovich, Greg, and Livescu, Karen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Natural language processing for sign language video - including tasks like recognition, translation, and search - is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled key-words or key phrases in raw sign language videos. This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before. We propose an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence. Our experiments, done on a large public dataset of ASL fingerspelling in the wild, show the importance of fingerspelling detection as a component of a search and retrieval model. Our model significantly outperforms baseline methods adapted from prior work on related tasks, Comment: ACL 2022
Published: 2022

46. Intestinal microbiota links to allograft stability after lung transplantation: a prospective cohort study

Author: Wu, Junqi, Li, Chongwu, Gao, Peigen, Zhang, Chenhong, Zhang, Pei, Zhang, Lei, Dai, Chenyang, Zhang, Kunpeng, Shi, Bowen, Liu, Mengyang, Zheng, Junmeng, Pan, Bo, Chen, Zhan, Zhang, Chao, Liao, Wanqing, Pan, Weihua, Fang, Wenjie, and Chen, Chang
Published: 2023
Full Text: View/download PDF

47. Identification of the molecular subtypes and construction of risk models in neuroblastoma

Author: He, Enyang, Shi, Bowen, Liu, Ziyu, Chang, Kaili, Zhao, Hailan, Zhao, Wei, and Cui, Hualei
Published: 2023
Full Text: View/download PDF

48. Domain-relevance of influence: characterizing variations in online influence across multiple domains on social media

Author: Shi, Bowen, Xu, Ke, and Zhao, Jichang
Published: 2023
Full Text: View/download PDF

49. Robust Self-Supervised Audio-Visual Speech Recognition

Author: Shi, Bowen, Hsu, Wei-Ning, and Mohamed, Abdelrahman
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average., Comment: Interspeech 2022
Published: 2022

50. Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Author: Shi, Bowen, Hsu, Wei-Ning, Lakhotia, Kushal, and Mohamed, Abdelrahman
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound
Abstract: Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert, Comment: ICLR 2022
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

809 results on '"Shi, Bowen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources