Author: "Zheng, Siqi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zheng, Siqi"' showing total 692 results

Start Over Author "Zheng, Siqi"

692 results on '"Zheng, Siqi"'

1. Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Author: Cheng, Luyao, Wang, Hui, Zheng, Siqi, Chen, Yafeng, Huang, Rongjie, Zhang, Qinglin, Chen, Qian, and Li, Xihao
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals. Recent studies have made tremendous efforts towards audio-visual or audio-semantic modeling to enhance performance. However, even the incorporation of up to two modalities often falls short in addressing the complexities of spontaneous and unstructured conversations. To exploit more meaningful dialogue patterns, we propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization. Our method elegantly formulates the multimodal modeling as a constrained optimization problem. First, we build insights into the visual connections among active speakers and the semantic interactions within spoken content, thereby establishing abundant pairwise constraints. Then we introduce a joint pairwise constraint propagation algorithm to cluster speakers based on these visual and semantic constraints. This integration effectively leverages the complementary strengths of different modalities, refining the affinity estimation between individual speaker embeddings. Extensive experiments conducted on multiple multimodal datasets demonstrate that our approach consistently outperforms state-of-the-art speaker diarization methods.
Published: 2024

2. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Author: Du, Zhihao, Chen, Qian, Zhang, Shiliang, Hu, Kai, Lu, Heng, Yang, Yexin, Hu, Hangrui, Zheng, Siqi, Gu, Yue, Ma, Ziyang, Gao, Zhifu, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models., Comment: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051
Published: 2024

3. FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Author: An, Keyu, Chen, Qian, Deng, Chong, Du, Zhihao, Gao, Changfeng, Gao, Zhifu, Gu, Yue, He, Ting, Hu, Hangrui, Hu, Kai, Ji, Shengpeng, Li, Yabin, Li, Zerui, Lu, Heng, Luo, Haoneng, Lv, Xiang, Ma, Bin, Ma, Ziyang, Ni, Chongjia, Song, Changhe, Shi, Jiaqi, Shi, Xian, Wang, Hao, Wang, Wen, Wang, Yuxuan, Xiao, Zhangyu, Yan, Zhijie, Yang, Yexin, Zhang, Bin, Zhang, Qinglin, Zhang, Shiliang, Zhao, Nan, and Zheng, Siqi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM., Comment: Work in progress. Authors are listed in alphabetical order by family name
Published: 2024

4. Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

Author: Li, Ruiqi, Hong, Zhiqing, Wang, Yongqi, Zhang, Lichao, Huang, Rongjie, Zheng, Siqi, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a language model manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io., Comment: Working in progress
Published: 2024

5. Intercity Connectivity and Innovation

Author: Liang, Xiaofan, Hidalgo, César A., Balland, Pierre-Alexandre, Zheng, Siqi, and Wang, Jianghao
Subjects: Physics - Physics and Society
Abstract: Urban outputs, from economy to innovation, are known to grow as a power of a city's population. But, since large cities tend to be central in transportation and communication networks, the effects attributed to city size may be confounded with those of intercity connectivity. Here, we map intercity networks for the world's two largest economies (the United States and China) to explore whether a city's position in the networks of communication, human mobility, and scientific collaboration explains variance in a city's patenting activity that is unaccounted for by its population. We find evidence that models incorporating intercity connectivity outperform population-based models and exhibit stronger predictive power for patenting activity, particularly for technologies of more recent vintage (which we expect to be more complex or sophisticated). The effects of intercity connectivity are more robust in China, even after controlling for population, GDP, and education, but not in the United States once adjusted for GDP and education. This divergence suggests distinct urban network dynamics driving innovation in these regions. In China, models with social media and mobility networks explain more heterogeneity in the scaling of innovation, whereas in the United States, scientific collaboration plays a more significant role. These findings support the significance of a city's position within the intercity network in shaping its success in innovative activities., Comment: peer-reviewed journal article; An interactive visualization and data are available at: https://github.com/xiaofanliang/intercity_connectivity
Published: 2024
Full Text: View/download PDF

6. Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Author: Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Yu, Hai, Liu, Jiaqing, Ma, Yukun, and Zhang, Chong
Subjects: Computer Science - Computation and Language
Abstract: The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention between non-adjacent layers. This method improves the model's ability to capture dependencies between high-level abstract features and low-level details. By facilitating direct attention between these diverse feature levels, our approach overcomes the limitations of current Transformers, which often rely on suboptimal intra-layer attention. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer, thus enhancing the diversity of multi-head attention without additional computational burden. Extensive experiments demonstrate that our enhanced Transformer model achieves superior performance in language modeling tasks, highlighting the effectiveness of our skip-layer attention mechanism., Comment: 7 pages, 1 figure
Published: 2024

7. Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Author: Chen, Yafeng, Zheng, Siqi, Wang, Hui, Cheng, Luyao, Chen, Qian, Zhang, Shiliang, and Wang, Wen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training., Comment: We update this paper to an earlier paper
Published: 2024

8. Dense Outflowing Molecular Gas in Massive Star-forming Regions

Author: Xu, Yani, Wang, Junzhi, Liu, Shu, Li, Juan, LI, Yuqiang, Luo, Rui, Ou, Chao, Zheng, Siqi, and Liu, Yijia
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: Dense outflowing gas, traced by transitions of molecules with large dipole moment, is important for understanding mass loss and feedback of massive star formation. HCN 3-2 and HCO$^+$ 3-2 are good tracers of dense outflowing molecular gas, which are closely related to active star formation. In this study, we present on-the-fly (OTF) mapping observations of HCN 3-2 and HCO$^+$ 3-2 toward a sample of 33 massive star-forming regions using the 10-m Submillimeter Telescope (SMT). With the spatial distribution of line wings of HCO$^+$ 3-2 and HCN 3-2, outflows are detected in 25 sources, resulting in a detection rate of 76$\%$. The optically thin H$^{13}$CN and H$^{13}$CO$^+$ 3-2 lines are used to identify line wings as outflows and estimate core mass. The mass $M_{out}$, momentum $P_{out}$, kinetic energy $E_{K}$, force $F_{out}$ and mass loss rate $\dot M_{out}$ of outflow and core mass, are obtained for each source. A sublinear tight correlation is found between the mass of dense molecular outflow and core mass, with an index of $\sim$ 0.8 and a correlation coefficient of 0.88., Comment: 22 pages, 5 figures, 4 tables, accepted in AJ
Published: 2024
Full Text: View/download PDF

9. ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency

Author: Chen, Yafeng, Zheng, Siqi, Wang, Hui, Cheng, Luyao, Chen, Qian, Zhang, Shiliang, and Li, Junjie
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: Speaker verification systems experience significant performance degradation when tasked with short-duration trial recordings. To address this challenge, a multi-scale feature fusion approach has been proposed to effectively capture speaker characteristics from short utterances. Constrained by the model's size, a robust backbone Enhanced Res2Net (ERes2Net) combining global and local feature fusion demonstrates sub-optimal performance in short-duration speaker verification. To further improve the short-duration feature extraction capability of ERes2Net, we expand the channel dimension within each stage. However, this modification also increases the number of model parameters and computational complexity. To alleviate this problem, we propose an improved ERes2NetV2 by pruning redundant structures, ultimately reducing both the model parameters and its computational cost. A range of experiments conducted on the VoxCeleb datasets exhibits the superiority of ERes2NetV2, which achieves EER of 0.61% for the full-duration trial, 0.98% for the 3s-duration trial, and 1.48% for the 2s-duration trial on VoxCeleb1-O, respectively.
Published: 2024

10. ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

Author: Ji, Shengpeng, Zuo, Jialong, Fang, Minghui, Zheng, Siqi, Chen, Qian, Wang, Wen, Jiang, Ziyue, Huang, Hai, Cheng, Xize, Huang, Rongjie, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many mapping fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .
Published: 2024

11. AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Author: Liu, Huadai, Huang, Rongjie, Liu, Yang, Cao, Hengyuan, Wang, Jialei, Cheng, Xize, Zheng, Siqi, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.
Published: 2024

12. 3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization

Author: Chen, Yafeng, Zheng, Siqi, Wang, Hui, Cheng, Luyao, Zhu, Tinglong, Song, Changhe, Huang, Rongjie, Ma, Ziyang, Chen, Qian, Zhang, Shiliang, and Li, Xihao
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker verification and diarization. It is designed for the needs of academic researchers and industrial practitioners. The 3D-Speaker-Toolkit adeptly leverages the combined strengths of acoustic, semantic, and visual data, seamlessly fusing these modalities to offer robust speaker recognition capabilities. The acoustic module extracts speaker embeddings from acoustic features, employing both fully-supervised and self-supervised learning approaches. The semantic module leverages advanced language models to apprehend the substance and context of spoken language, thereby augmenting the system's proficiency in distinguishing speakers through linguistic patterns. Finally, the visual module applies image processing technologies to scrutinize facial features, which bolsters the precision of speaker diarization in multi-speaker environments. Collectively, these modules empower the 3D-Speaker-Toolkit to attain elevated levels of accuracy and dependability in executing speaker-related tasks, establishing a new benchmark in multi-modal speaker analysis. The 3D-Speaker project also includes a handful of open-sourced state-of-the-art models and a large dataset containing over 10,000 speakers. The toolkit is publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
Published: 2024

13. Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models

Author: Ji, Shengpeng, Fang, Minghui, Jiang, Ziyue, Zheng, Siqi, Chen, Qian, Huang, Rongjie, Zuo, Jialung, Wang, Shulei, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serves as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) most codec models are trained on only 1,000 hours of data, whereas most speech language models are trained on 60,000 hours; 2) Achieving good reconstruction performance requires the utilization of numerous codebooks, which increases the burden on downstream speech language models; 3) The initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Mask Channel Residual Vector Quantization (MCRVQ) mechanism along with improved Fourier transform structures and larger training datasets to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pre-trained models can be accessed at https://github.com/jishengpeng/languagecodec ., Comment: We release a more powerful checkpoint in Language-Codec v3
Published: 2024

14. An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Author: Ma, Ziyang, Yang, Guanrou, Yang, Yifan, Gao, Zhifu, Wang, Jiaming, Du, Zhihao, Yu, Fan, Chen, Qian, Zheng, Siqi, Zhang, Shiliang, and Chen, Xie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). Recent works have complex designs such as compressing the output temporally for the speech encoder, tackling modal alignment for the projector, and utilizing parameter-efficient fine-tuning for the LLM. We found that delicate designs are not necessary, while an embarrassingly simple composition of off-the-shelf speech encoder, LLM, and the only trainable linear projector is competent for the ASR task. To be more specific, we benchmark and explore various combinations of LLMs and speech encoders, leading to the optimal LLM-based ASR system, which we call SLAM-ASR. The proposed SLAM-ASR provides a clean setup and little task-specific design, where only the linear projector is trained. To the best of our knowledge, SLAM-ASR achieves the best performance on the Librispeech benchmark among LLM-based ASR models and even outperforms the latest LLM-based audio-universal model trained on massive pair data. Finally, we explore the capability emergence of LLM-based ASR in the process of modal alignment. We hope that our study can facilitate the research on extending LLM with cross-modality capacity and shed light on the LLM-based ASR community., Comment: Working in progress and will open-source soon
Published: 2024

15. Exploring the varied effects of occupational and family stress on sleep quality in military personnel on islands: a cross-sectional study in China

Author: Xu, Ying, Guo, Xin, Xu, Shuyu, Zheng, Siqi, Xu, Jingzhou, Xiao, Lei, He, Jingwen, Ren, Pan, Wang, Hao, Cai, Wenpeng, Su, Tong, and Tang, Yunxiang
Published: 2024
Full Text: View/download PDF

16. Co-pyrolysis of poly (methyl methacrylate) and sugarcane bagasse evaluated through multiple kinetic analysis methods

Author: Zhang, Huili, Huang, Zhen, Liu, Ting-ting, and Zheng, Siqi
Published: 2024
Full Text: View/download PDF

17. The effects of the detection stimulus duration on the persistence of the attentional boost effect

Author: Meng, Yingfang, Huang, Fajie, Lin, Huiru, Zheng, Siqi, and Zhao, Xin
Published: 2024
Full Text: View/download PDF

18. The impact of encoding-retrieval perceptual differences on the retrieval-divided attention effect

Author: Lin, Xin, Wu, Mei, Meng, Yingfang, Zheng, Siqi, and Huang, Fajie
Published: 2024
Full Text: View/download PDF

19. Spatial distribution of NH2D in massive star-forming regions

Author: Li, Yuqiang, Wang, Junzhi, Li, Juan, Liu, Shu, Yang, Kai, Zheng, Siqi, and Lu, Zhe
Subjects: Astrophysics - Astrophysics of Galaxies, Astrophysics - Solar and Stellar Astrophysics
Abstract: To understand the relation between NH$_2$D and its physical environment, we mapped ortho-NH$_2$D $1_{11}^s-1_{01}^a$ at 85.9 GHz toward 24 Galactic late-stage massive star-forming regions with Institut de Radioastronomie Millim$ \'e$trique (IRAM) 30-m telescope. Ortho-NH$_2$D $1_{11}^s-1_{01}^a$ was detected in 18 of 24 sources. Comparing with the distribution of H$^{13}$CN 1-0 as a dense gas tracer and radio recombination line H42$\alpha$, ortho-NH$_2$D $1_{11}^s-1_{01}^a$ present complex and diverse spatial distribution in these targets. 11 of the 18 targets, present a different distribution between ortho-NH$_2$D $1_{11}^s-1_{01}^a$ and H$^{13}$CN 1-0, while no significant difference between these two lines can be found in the other 7 sources, mainly due to limited spatial resolution and sensitivity. Moreover, with H42$\alpha$ tracing massive young stellar objects, ortho-NH$_2$D $1_{11}^s-1_{01}^a$ seems to show a relatively weak emission near the massive young stellar objects., Comment: 30 pages, 20 figures, 4 tables. Accepted to MNRAS
Published: 2023
Full Text: View/download PDF

20. Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR

Author: Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Zhang, Shiliang, Deng, Chong, Ma, Yukun, Yu, Hai, Liu, Jiaqing, and Zhang, Chong
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on various speech tasks. These models discretize speech signals into tokens (speech discretization) and use a shared vocabulary for both text and speech tokens. Then they train a single decoder-only Transformer on a mixture of speech tasks. However, these models rely on the Loss Masking strategy for the ASR task, which ignores the dependency among speech tokens. In this paper, we propose to model speech tokens in an autoregressive way, similar to text. We find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over the Loss Masking approach. To address this issue, we propose a novel approach denoted Smoothed Label Distillation (SLD), which applies a KL divergence loss with smoothed labels on speech tokens. Our experiments show that SLD effectively models speech tokens and outperforms Loss Masking for decoder-only Transformers in ASR tasks with different speech discretization methods. The source code can be found here: https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/sld, Comment: 5 pages, accepted by ICASSP 2024
Published: 2023

21. Tentative detection of cyanoformamide NCCONH2 in space

Author: Li, Juan, Quan, Donghui, Wang, Junzhi, Zhang, Xia, Lu, Xing, Gou, Qian, Gao, Feng, Wu, Yajun, Bergin, Edwin, Li, Shanghuo, Shen, Zhiqiang, Du, Fujun, Li, Meng, Zheng, Siqi, and Zheng, Xingwu
Subjects: Astrophysics - Astrophysics of Galaxies, Astrophysics - Solar and Stellar Astrophysics
Abstract: The peptide-like molecules, cyanoformamide (NCCONH2), is the cyano (CN) derivative of formamide (NH2CHO). It is known to play a role in the synthesis of nucleic acid precursors under prebiotic conditions. In this paper, we present a tentative detection of NCCONH2 in the interstellar medium (ISM) with the Atacama Large Millimeter/submillimeter Array (ALMA) archive data. Ten unblended lines of NCCONH2 were seen around 3sigma noise levels toward Sagittarius B2(N1E), a position that is slightly offset from the continuum peak. The column density of NCCONH2 was estimated to be 2.4\times 10^15 cm ^-2, and the fractional abundance of NCCONH2 toward Sgr B2(N1E) was 6.9\times10^-10. The abundance ratio between NCCONH2 and NH2CHO is estimated to be ~0.01. We also searched for other peptide-like molecules toward Sgr B2(N1E). The abundances of NH2CHO, CH3NCO and CH3NHCHO toward Sgr B2(N1E) were about one tenth of those toward Sgr B2(N1S), while the abundances of CH3CONH2 was only one twentieth of that toward Sgr B2(N1S)., Comment: 20 pages, 6 figures, 2 tables, accepted by PASJ
Published: 2023

22. Sulphur isotopes toward Sagittarius B2 extended envelope in the Galactic Center

Author: Li, Qingxu, Li, Juan, Zheng, Siqi, Wang, Junzhi, Gao, Feng, and Wu, Yajun
Subjects: Astrophysics - Astrophysics of Galaxies, Astrophysics - Solar and Stellar Astrophysics
Abstract: The isotopic ratios are good tools for probing the stellar nucleosynthesis and chemical evolution. We performed high-sensitivity mapping observations of the J=7-6 rotational transitions of OCS, OC34S, O13CS, and OC33S toward the Galactic Center giant molecular cloud, Sagittarius B2 (Sgr B2) with IRAM 30m telescope. Positions with optically thin and uncontaminated lines are chosen to determine the sulfur isotope ratios. A 32S/34S ratio of 17.1\pm0.9 was derived with OCS and OC34S lines, while 34S/33S ratio of 6.8\pm1.9 was derived directly from integrated intensity ratio of OC34S and OC33S. With independent and accurate measurements of 32S/34S ratio, our results confirm the termination of the decreasing trend of 32S/34S ratios toward the Galactic Center, suggesting a drop in the production of massive stars at the Galactic centre., Comment: 20 pages, 7 figures, accepted by PASJ
Published: 2023

23. Mapping Observations of Peptide-like molecules around Sagittarius B2

Author: Zheng, Siqi, Li, Juan, Wang, Junzhi, Wang, Yao, Gao, Feng, Quan, Donghui, Du, Fujun, Wu, Yajun, Bergin, Edwin, and Li, Yuqiang
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: Peptide-like molecule, which has a close connection with the origin of life, has been detected in universe. Mapping observations of HCONH$_2$ and CH$_3$CONH$_2$, two simplest peptide-like molecules, are performed towards Sagittarius B2 (Sgr B2) complex with the IRAM 30m telescope. Seven transitions of HCONH$_2$ and five transitions of CH$_3$CONH$_2$ are used in analysis. The spatial distribution of excitation temperature and column density of HCONH$_2$ in the molecular envelope of Sgr B2 are obtained by the rotation diagrams. Assuming the same excitation temperature as HCONH$_2$, the column densities of CH$_3$CONH$_2$ are also calculated. The results show that excitation temperature ranges from 6 K to 46 K in the molecular envelope of Sgr B2. The abundance ratio between HCONH$_2$ and CH$_3$CONH$_2$ are calculated to explore the relationship among them, as well as HNCO mentioned in our pervious research. The abundance ratio of CH$_3$CONH$_2$/HCONH$_2$ varies from 10% to 20%, while that of HCONH$_2$/HNCO ranges from 1.5% to 10%. CH$_3$CONH$_2$ is enhanced with respect to HCONH$_2$ in the northwest region of Sgr B2. One transition of H$^{13}$CONH$_2$ is detected toward 12 positions of Sgr B2, from which a $^{12}$C/$^{13}$C ratio of 28.7 is obtained. A time-dependent chemical model with a short duration of X-ray burst is used to explain the observed abundances of HCONH$_2$ and CH$_3$CONH$_2$, with the best fitting result at T$\rm_{dust}$ = 53-56 K. More chemical reactions are required to be included into the model since the modeled abundance is lower than the observed one at the observed T$\rm_{dust}$.
Published: 2023

24. LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT

Author: Du, Zhihao, Wang, Jiaming, Chen, Qian, Chu, Yunfei, Gao, Zhifu, Li, Zerui, Hu, Kai, Zhou, Xiaohuan, Xu, Jin, Ma, Ziyang, Wang, Wen, Zheng, Siqi, Zhou, Chang, Yan, Zhijie, and Zhang, Shiliang
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete codec codes. We propose a one-step codec vocoder to overcome the prediction challenge caused by the multimodal distribution of codec tokens. We fine-tune LauraGPT using supervised multi-task learning. Extensive experiments show that LauraGPT consistently achieves comparable to superior performance compared to strong baselines on a wide range of audio tasks related to content, semantics, paralinguistics, and audio-signal analysis, such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, speech enhancement, automated audio captioning, speech emotion recognition, and spoken language understanding., Comment: 10 pages, work in progress
Published: 2023

25. Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Author: Cheng, Luyao, Zheng, Siqi, Zhang, Qinglin, Wang, Hui, Chen, Yafeng, Chen, Qian, and Zhang, Shiliang
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.
Published: 2023

26. FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

Author: Du, Zhihao, Zhang, Shiliang, Hu, Kai, and Zheng, Siqi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec., Comment: 5 pages, 3 figures, submitted to ICASSP 2024
Published: 2023

27. UbiPhysio: Support Daily Functioning, Fitness, and Rehabilitation with Action Understanding and Feedback in Natural Language

Author: Wang, Chongyang, Feng, Yuan, Zhong, Lingxiao, Zhu, Siyi, Zhang, Chi, Zheng, Siqi, Liang, Chen, Wang, Yuntao, He, Chengqi, Yu, Chun, and Shi, Yuanchun
Subjects: Computer Science - Human-Computer Interaction
Abstract: We introduce UbiPhysio, a milestone framework that delivers fine-grained action description and feedback in natural language to support people's daily functioning, fitness, and rehabilitation activities. This expert-like capability assists users in properly executing actions and maintaining engagement in remote fitness and rehabilitation programs. Specifically, the proposed UbiPhysio framework comprises a fine-grained action descriptor and a knowledge retrieval-enhanced feedback module. The action descriptor translates action data, represented by a set of biomechanical movement features we designed based on clinical priors, into textual descriptions of action types and potential movement patterns. Building on physiotherapeutic domain knowledge, the feedback module provides clear and engaging expert feedback. We evaluated UbiPhysio's performance through extensive experiments with data from 104 diverse participants, collected in a home-like setting during 25 types of everyday activities and exercises. We assessed the quality of the language output under different tuning strategies using standard benchmarks. We conducted a user study to gather insights from clinical physiotherapists and potential users about our framework. Our initial tests show promise for deploying UbiPhysio in real-life settings without specialized devices., Comment: Accepted by IMWUT/Ubicomp'24
Published: 2023

28. Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Author: Chen, Yafeng, Zheng, Siqi, Wang, Hui, Cheng, Luyao, Chen, Qian, Zhang, Shiliang, and Wang, Wen
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the verification performance., Comment: arXiv admin note: text overlap with arXiv:2211.04168
Published: 2023

29. Improving BERT with Hybrid Pooling Network and Drop Mask

Author: Chen, Qian, Wang, Wen, Zhang, Qinglin, Deng, Chong, Yukun, Ma, and Zheng, Siqi
Subjects: Computer Science - Computation and Language
Abstract: Transformer-based pre-trained language models, such as BERT, achieve great success in various natural language understanding tasks. Prior research found that BERT captures a rich hierarchy of linguistic information at different layers. However, the vanilla BERT uses the same self-attention mechanism for each layer to model the different contextual features. In this paper, we propose a HybridBERT model which combines self-attention and pooling networks to encode different contextual features in each layer. Additionally, we propose a simple DropMask method to address the mismatch between pre-training and fine-tuning caused by excessive use of special mask tokens during Masked Language Modeling pre-training. Experiments show that HybridBERT outperforms BERT in pre-training with lower loss, faster training speed (8% relative), lower memory cost (13% relative), and also in transfer learning with 1.5% relative higher accuracies on downstream tasks. Additionally, DropMask improves accuracies of BERT on downstream tasks across various masking rates., Comment: 7 pages, 2 figures
Published: 2023

30. 3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Author: Zheng, Siqi, Cheng, Luyao, Chen, Yafeng, Wang, Hui, and Chen, Qian
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representation entanglement, thereby motivating intriguing methods to untangle them. The multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate large universal speech models and experiment methods of out-of-domain learning and self-supervised learning. https://3dspeaker.github.io/
Published: 2023

31. Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

Author: Cheng, Luyao, Zheng, Siqi, Qinglin, Zhang, Wang, Hui, Chen, Yafeng, and Chen, Qian
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker diarization(SD) is a classic task in speech processing and is crucial in multi-party scenarios such as meetings and conversations. Current mainstream speaker diarization approaches consider acoustic information only, which result in performance degradation when encountering adverse acoustic conditions. In this paper, we propose methods to extract speaker-related information from semantic content in multi-party meetings, which, as we will show, can further benefit speaker diarization. We introduce two sub-tasks, Dialogue Detection and Speaker-Turn Detection, in which we effectively extract speaker information from conversational semantics. We also propose a simple yet effective algorithm to jointly model acoustic and semantic information and obtain speaker-identified texts. Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems., Comment: Accepted to Findings of ACL 2023
Published: 2023

32. An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Author: Chen, Yafeng, Zheng, Siqi, Wang, Hui, Cheng, Luyao, Chen, Qian, and Qi, Jiajun
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
Published: 2023

33. Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings

Author: Chen, Qian, Wang, Wen, Zhang, Qinglin, Zheng, Siqi, Deng, Chong, Yu, Hai, Liu, Jiaqing, Ma, Yukun, and Zhang, Chong
Subjects: Computer Science - Computation and Language
Abstract: Prior studies diagnose the anisotropy problem in sentence representations from pre-trained language models, e.g., BERT, without fine-tuning. Our analysis reveals that the sentence embeddings from BERT suffer from a bias towards uninformative words, limiting the performance in semantic textual similarity (STS) tasks. To address this bias, we propose a simple and efficient unsupervised approach, Diagonal Attention Pooling (Ditto), which weights words with model-based importance estimations and computes the weighted average of word representations from pre-trained models as sentence embeddings. Ditto can be easily applied to any pre-trained language model as a postprocessing operation. Compared to prior sentence embedding approaches, Ditto does not add parameters nor requires any learning. Empirical evaluations demonstrate that our proposed Ditto can alleviate the anisotropy problem and improve various pre-trained models on STS tasks., Comment: 8 pages, accepted by EMNLP 2023 short paper, the source code can be found at https://github.com/alibaba-damo-academy/SpokenNLP/tree/main/ditto
Published: 2023

34. $^{18}$O$/^{17}$O abundance ratio toward a sample of massive star forming regions with parallax distances

Author: Ou, Chao, Wang, Junzhi, Zheng, Siqi, Li, Juan, Romano, Donatella, and Zhang, Zhi-Yu
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: The $^{18}$O$/^{17}$O abundance ratio is, in principle, a powerful tool to estimate the relative contributions of massive stars and low- to intermediate-mass stars to the chemical enrichment of galaxies. We present $^{18}$O$/^{17}$O ratios derived from simultaneous observations of C$^{18}$O and C$^{17}$O 1-0 toward fifty-one massive star forming regions with the Institut de Radioastronomie Millim\'etrique (IRAM) 30 meter telescope. Simultaneous observations of HC$^{18}$O$^{+}$ 1-0 and HC$^{17}$O$^{+}$ 1-0 with the Yebes 40m telescope toward five sources from this sample were also done to test the consistency of $^{18}$O$/^{17}$O ratios derived from different isotopic pairs. From our improved measurements, resulting in smaller errors than previous work in the literature, we obtain a clear trend of increasing $^{18}$O$/^{17}$O ratio with increasing galactocentric distance (D$_{GC}$), which provides a significant constraint on Galactic chemical evolution (GCE) models. Current GCE models have to be improved in order to explain the observed C$^{18}$O/C$^{17}$O 1-0 gradient., Comment: 15 pages, 5 figures, 1 table, Accepted for publication in MNRAS
Published: 2023
Full Text: View/download PDF

35. Imaging Molecular Outflow in Massive Star-forming Regions with HNCO Lines

Author: Xie, Jinjin, Li, Juan, Wang, Junzhi, Liu, Shu, Yang, Kai, Quan, Donghui, Zheng, Siqi, Li, Yuqiang, Wu, Jingwen, Duan, Yan, and Li, Di
Subjects: Astrophysics - Astrophysics of Galaxies, Astrophysics - Solar and Stellar Astrophysics
Abstract: Protostellar outflows are considered a signpost of star formation. These outflows can cause shocks in the molecular gas and are typically traced by the line wings of certain molecules. HNCO (4--3) has been regarded as a shock tracer because of the high abundance in shocked regions. Here we present the first imaging results of HNCO (4--3) line wings toward nine sources in a sample of twenty three massive star-forming regions using the IRAM 30\,m telescope. We adopt the velocity range of the full width of HC$_{3}$N (10--9) and H$^{13}$CO$^+$ (1--0) emissions as the central emission values, beyond which the emission from HNCO (4--3) is considered to be from line wings. The spatial distributions of the red- and/or blue-lobes of HNCO (4--3) emission nicely associate with those lobes of HCO$^{+}$ (1--0) in most of the sources. High intensity ratios of HNCO (4--3) to HCO$^+$ (1--0) are obtained in the line wings. The derived column density ratios of HNCO to HCO$^+$ are consistent with those previously observed towards massive star-forming regions. These results provide direct evidence that HNCO could trace outflow in massive star-forming regions. This work also implies that the formation of some HNCO molecules is related to shock, either on the grain surface or within the shocked gas., Comment: 18 pages, 4 tables, 4 figures, and accepted for publication in ApJ
Published: 2023
Full Text: View/download PDF

36. CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking

Author: Wang, Hui, Zheng, Siqi, Chen, Yafeng, Cheng, Luyao, and Chen, Qian
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Time delay neural network (TDNN) has been proven to be efficient for speaker verification. One of its successful variants, ECAPA-TDNN, achieved state-of-the-art performance at the cost of much higher computational complexity and slower inference speed. This makes it inadequate for scenarios with demanding inference rate and limited computational resources. We are thus interested in finding an architecture that can achieve the performance of ECAPA-TDNN and the efficiency of vanilla TDNN. In this paper, we propose an efficient network based on context-aware masking, namely CAM++, which uses densely connected time delay neural network (D-TDNN) as backbone and adopts a novel multi-granularity pooling to capture contextual information at different levels. Extensive experiments on two public benchmarks, VoxCeleb and CN-Celeb, demonstrate that the proposed architecture outperforms other mainstream speaker verification systems with lower computational cost and faster inference speed.
Published: 2023

37. Entrepreneurship, sustainability, and urban development

Author: Qian, Haifeng, Wu, Jing, and Zheng, Siqi
Published: 2024
Full Text: View/download PDF

38. Positive effects of napping on memory consolidation and resistance against interference

Author: Wang, Yajing, Meng, Yao, Zhang, Ruike, Du, Jing, Huang, Yujia, Zheng, Siqi, Xu, Shuyu, Xiao, Lei, Xu, Jingzhou, Wang, Hao, Su, Tong, and Tang, Yunxiang
Published: 2023
Full Text: View/download PDF

39. DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect

Author: Liu, Jinglin, Ye, Zhenhui, Chen, Qian, Zheng, Siqi, Wang, Wen, Zhang, Qinglin, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, binaural audio synthesis (BAS) has emerged as a promising research field for its applications in augmented and virtual realities. Binaural audio helps users orient themselves and establish immersion by providing the brain with interaural time differences reflecting spatial information. However, existing BAS methods are limited in terms of phase estimation, which is crucial for spatial hearing. In this paper, we propose the \textbf{DopplerBAS} method to explicitly address the Doppler effect of the moving sound source. Specifically, we calculate the radial relative velocity of the moving speaker in spherical coordinates, which further guides the synthesis of binaural audio. This simple method introduces no additional hyper-parameters and does not modify the loss functions, and is plug-and-play: it scales well to different types of backbones. DopperBAS distinctly improves the representative WarpNet and BinauralGrad backbones in the phase error metric and reaches a new state of the art (SOTA): 0.780 (versus the current SOTA 0.807). Experiments and ablation studies demonstrate the effectiveness of our method., Comment: Accepted to ACL 2023 short paper; key words: binaural audio, stereophonic sound
Published: 2022

40. Contextual Expressive Text-to-Speech

Author: Tu, Jianhong, Cui, Zeyu, Zhou, Xiaohuan, Zheng, Siqi, Hu, Kai, Fan, Ju, and Zhou, Chang
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality expressive speech based on the given context both in synthetic datasets and real-world scenarios., Comment: Submitted to ICASSP 2023
Published: 2022

41. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction., Comment: Accepted by EMNLP 2022
Published: 2022

42. Pushing the limits of self-supervised speaker verification using regularized distillation framework

Author: Chen, Yafeng, Zheng, Siqi, Wang, Hui, Cheng, Luyao, and Chen, Qian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Training robust speaker verification systems without speaker labels has long been a challenging task. Previous studies observed a large performance gap between self-supervised and fully supervised methods. In this paper, we apply a non-contrastive self-supervised learning framework called DIstillation with NO labels (DINO) and propose two regularization terms applied to embeddings in DINO. One regularization term guarantees the diversity of the embeddings, while the other regularization term decorrelates the variables of each embedding. The effectiveness of various data augmentation techniques are explored, on both time and frequency domain. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the regularized DINO framework in speaker verification. Our method achieves the state-of-the-art speaker verification performance under a single-stage self-supervised setting on VoxCeleb. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
Published: 2022

43. A Comparison of Reproducibility Guidelines and Its Implications on Undergraduate Statistical Education

Author: Zheng, Siqi
Subjects: Statistics - Other Statistics
Abstract: In this paper, we replicated a Bayesian educational research project, which explores the association between broadband access and online course enrollment in the US. We summarized key findings from our replication and compared them with the original project. Based on my replication experience, we aim to demonstrate the challenges of research reproduction, even when codes and data are shared openly and the quality of the materials on GitHub are high. Moreover, we investigate the implicit presumptions of the researchers' level of knowledge and discuss how such presumptions may add difficulty to the reproduction of scientific research. Finally, we hope this article sheds light on the design of reproducibility criterion and opens up a space to explore what should be taught in undergraduate statistics education.
Published: 2022

44. Deep Representation Decomposition for Rate-Invariant Speaker Verification

Author: Tong, Fuchuan, Zheng, Siqi, Zhou, Haodong, Xie, Xingjia, Hong, Qingyang, and Li, Lin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While promising performance for speaker verification has been achieved by deep speaker embeddings, the advantage would reduce in the case of speaking-style variability. Speaking rate mismatch is often observed in practical speaker verification systems, which may actually degrade the system performance. To reduce intra-class discrepancy caused by speaking rate, we propose a deep representation decomposition approach with adversarial learning to learn speaking rate-invariant speaker embeddings. Specifically, adopting an attention block, we decompose the original embedding into an identity-related component and a rate-related component through multi-task training. Additionally, to reduce the latent relationship between the two decomposed components, we further propose a cosine mapping block to train the parameters adversarially to minimize the cosine similarity between the two decomposed components. As a result, identity-related features become robust to speaking rate and then are used for verification. Experiments are conducted on VoxCeleb1 data and HI-MIA data to demonstrate the effectiveness of our proposed approach., Comment: Accepted by Odyssey 2022
Published: 2022

45. PRISM: Pre-trained Indeterminate Speaker Representation Model for Speaker Diarization and Speaker Verification

Author: Zheng, Siqi, Suo, Hongbin, and Chen, Qian
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker embedding has been a fundamental feature for speaker-related tasks such as verification, clustering, and diarization. Traditionally, speaker embeddings are represented as fixed vectors in high-dimensional space. This could lead to biased estimations, especially when handling shorter utterances. In this paper we propose to represent a speaker utterance as "floating" vector whose state is indeterminate without knowing the context. The state of a speaker representation is jointly determined by itself, other speech from the same speaker, as well as other speakers it is being compared to. The content of the speech also contributes to determining the final state of a speaker representation. We pre-train an indeterminate speaker representation model that estimates the state of an utterance based on the context. The pre-trained model can be fine-tuned for downstream tasks such as speaker verification, speaker clustering, and speaker diarization. Substantial improvements are observed across all downstream tasks., Comment: INTERSPEECH 2022
Published: 2022

46. The universality in urban commuting across and within cities

Author: Dong, Lei, Santi, Paolo, Liu, Yu, Zheng, Siqi, and Ratti, Carlo
Subjects: Physics - Physics and Society, Economics - General Economics
Abstract: Commuting is a key mechanism that governs the dynamics of cities. Despite its importance, very little is known of the properties and mechanisms underlying this crucial urban process. Here, we capitalize on $\sim$ 50 million individuals' smartphone data from 234 Chinese cities to show that urban commuting obeys remarkable regularities. These regularities can be generalized as two laws: (i) the scale-invariance of the average commuting distance across cities, which is a long-awaited validation of Marchetti's constant conjecture, and (ii) a universal inverted U-shape of the commuting distance as a function of the distance from the city centre within cities, indicating that the city centre's attraction is bounded. Motivated by such empirical findings, we develop a simple urban growth model that connects individual-level mobility choices with macroscopic urban spatial structure and faithfully explains both commuting laws. Our results further show that the scale-invariants of human mobility will ultimately lead to the polycentric transition in cities, which could be used to better inform urban development strategies.
Published: 2022

47. Reformulating Speaker Diarization as Community Detection With Emphasis On Topological Structure

Author: Zheng, Siqi and Suo, Hongbin
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Clustering-based speaker diarization has stood firm as one of the major approaches in reality, despite recent development in end-to-end diarization. However, clustering methods have not been explored extensively for speaker diarization. Commonly-used methods such as k-means, spectral clustering, and agglomerative hierarchical clustering only take into account properties such as proximity and relative densities. In this paper we propose to view clustering-based diarization as a community detection problem. By doing so the topological structure is considered. This work has four major contributions. First it is shown that Leiden community detection algorithm significantly outperforms the previous methods on the clustering of speaker-segments. Second, we propose to use uniform manifold approximation to reduce dimension while retaining global and local topological structure. Third, a masked filtering approach is introduced to extract "clean" speaker embeddings. Finally, the community structure is applied to an end-to-end post-processing network to obtain diarization results. The final system presents a relative DER reduction of up to 70 percent. The breakdown contribution of each component is analyzed., Comment: ICASSP 2022
Published: 2022

48. Graph Convolutional Network Based Semi-Supervised Learning on Multi-Speaker Meeting Data

Author: Tong, Fuchuan, Zheng, Siqi, Zhang, Min, Chen, Yafeng, Suo, Hongbin, Hong, Qingyang, and Li, Lin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Unsupervised clustering on speakers is becoming increasingly important for its potential uses in semi-supervised learning. In reality, we are often presented with enormous amounts of unlabeled data from multi-party meetings and discussions. An effective unsupervised clustering approach would allow us to significantly increase the amount of training data without additional costs for annotations. Recently, methods based on graph convolutional networks (GCN) have received growing attention for unsupervised clustering, as these methods exploit the connectivity patterns between nodes to improve learning performance. In this work, we present a GCN-based approach for semi-supervised learning. Given a pre-trained embedding extractor, a graph convolutional network is trained on the labeled data and clusters unlabeled data with "pseudo-labels". We present a self-correcting training mechanism that iteratively runs the cluster-train-correct process on pseudo-labels. We show that this proposed approach effectively uses unlabeled data and improves speaker recognition accuracy., Comment: Accepted by ICASSP 2022
Published: 2022

49. Spatial distribution of HOCN around Sagittarius B2

Author: Zheng, Siqi, Li, Juan, Wang, Junzhi, Gao, Feng, Wu, Yajun, liu, Shu, and Li, Shanghuo
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: HOCN and HNCO abundance ratio in molecular gas can tell us the information of their formation mechanism. We performed high-sensitivity mapping observations of HOCN, HNCO, and HNC$^{18}$O lines around Sagittarius B2 (Sgr B2) with IRAM 30m telescope at 3-mm wavelength. HNCO 4$_{04}$-3$_{03}$ and HOCN 4$_{04}$-3$_{03}$ are used to obtain the abundance ratio of HNCO to HOCN. The ratio of HNCO 4$_{04}$-3$_{03}$ to HNC$^{18}$O 4$_{04}$-3$_{03}$ is used to calculate the optical depth of HNCO 4$_{04}$-3$_{03}$. The abundance ratio of HOCN and HNCO is observed to range from 0.4% to 0.7% toward most positions, which agrees well with the gas-grain model. However, the relative abundance of HOCN is observed to be enhanced toward the direction of Sgr B2 (S), with HOCN to HNCO abundance ratio of $\sim$ 0.9%. The reason for that still needs further investigation.Based on the intensity ratio of HNCO and HNC$^{18}$O lines, we updated the isotopic ratio of $^{16}$O/$^{18}$O to be 296 $\pm$ 54 in Sgr B2.
Published: 2022
Full Text: View/download PDF

50. Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

Author: Du, Zhihao, Zhang, Shiliang, Zheng, Siqi, and Yan, Zhijie
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with the power set, which represents the possible combinations of target speakers. This formulation has two benefits. First, the overlaps of target speakers are explicitly modeled. Second, threshold selection is no longer needed. Through this formulation, we propose the speaker embedding-aware neural diarization (SEND) framework, where a speech encoder, a speaker encoder, two similarity scorers, and a post-processing network are jointly optimized to predict the encoded labels according to the similarities between speech features and speaker embeddings. Experimental results show that SEND has a stable learning process and can be trained on highly overlapped data without extra initialization. More importantly, our method achieves the state-of-the-art performance in real meeting scenarios with fewer model parameters and lower computational complexity., Comment: Submitted to INTERSPEECH 2022, 5 parges, 2 figure
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

692 results on '"Zheng, Siqi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources