Author: "Xue Wei" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Xue Wei"' showing total 144 results

Start Over Author "Xue Wei" Publication Type Reports

144 results on '"Xue Wei"'

1. pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Author: Jiang, Ziyang, Qian, Xinyuan, Lei, Jiahe, Pan, Zexu, Xue, Wei, and Yin, Xu-cheng
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: TSE(Target Speaker Extraction) aims to extract the clean speech of the target speaker in an audio mixture, thus eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information (e.g., lip motions and gestures), and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Unlike all existing work, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text content, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real-time is challenging. To this end, we design two different networks. Specifically, our proposed TPE fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise, while another proposal, namely TSR, employs the contrastive learning technique to associate blindly separated speech signals with semantic cues. The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150, respectively. Dataset and source code will be publicly available. Project demo page: https://slideTSE.github.io/.
Published: 2024

2. EVA: An Embodied World Model for Future Video Anticipation

Author: Chi, Xiaowei, Zhang, Hengyuan, Fan, Chun-Kai, Qi, Xingqun, Zhang, Rongyu, Chen, Anthony, Chan, Chi-min, Xue, Wei, Luo, Wenhan, Zhang, Shanghang, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Computer Science - Robotics
Abstract: World models integrate raw data from various modalities, such as images and language to simulate comprehensive interactions in the world, thereby displaying crucial roles in fields like mixed reality and robotics. Yet, applying the world model for accurate video prediction is quite challenging due to the complex and dynamic intentions of the various scenes in practice. In this paper, inspired by the human rethinking process, we decompose the complex video prediction into four meta-tasks that enable the world model to handle this issue in a more fine-grained manner. Alongside these tasks, we introduce a new benchmark named Embodied Video Anticipation Benchmark (EVA-Bench) to provide a well-rounded evaluation. EVA-Bench focused on evaluating the video prediction ability of human and robot actions, presenting significant challenges for both the language model and the generation model. Targeting embodied video prediction, we propose the Embodied Video Anticipator (EVA), a unified framework aiming at video understanding and generation. EVA integrates a video generation model with a visual language model, effectively combining reasoning capabilities with high-quality generation. Moreover, to enhance the generalization of our framework, we tailor-designed a multi-stage pretraining paradigm that adaptatively ensembles LoRA to produce high-fidelity results. Extensive experiments on EVA-Bench highlight the potential of EVA to significantly improve performance in embodied scenes, paving the way for large-scale pre-trained models in real-world prediction tasks.
Published: 2024

3. FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation

Author: Liu, Huadai, Wang, Jialei, Huang, Rongjie, Liu, Yang, Lu, Heng, Xue, Wei, and Zhao, Zhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods utilizing consistency-based distillation aim to achieve few-step or single-step inference, their one-step performance is constrained by curved trajectories, preventing them from surpassing traditional diffusion models. In this work, we introduce FlashAudio with rectified flows to learn straight flow for fast simulation. To alleviate the inefficient timesteps allocation and suboptimal distribution of noise, FlashAudio optimizes the time distribution of rectified flow with Bifocal Samplers and proposes immiscible flow to minimize the total distance of data-noise pairs in a batch vias assignment. Furthermore, to address the amplified accumulation error caused by the classifier-free guidance (CFG), we propose Anchored Optimization, which refines the guidance scale by anchoring it to a reference trajectory. Experimental results on text-to-audio generation demonstrate that FlashAudio's one-step generation performance surpasses the diffusion-based models with hundreds of sampling steps on audio quality and enables a sampling speed of 400x faster than real-time on a single NVIDIA 4090Ti GPU.
Published: 2024

4. Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Author: Sun, Peiwen, Cheng, Sitong, Li, Xiangtai, Ye, Zhen, Liu, Huadai, Zhang, Honggang, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for latent diffusion models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objective of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.
Published: 2024

5. Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Author: Hou, Siyuan, Liu, Shansong, Yuan, Ruibin, Xue, Wei, Shan, Ying, Zhao, Mangsuo, and Zhang, Chao
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-$k$ constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments have been performed on text-to-music generation and music-style transfer tasks using open-source instrumental recording data. The results demonstrate that by extending StableAudio, a pre-trained text-controlled DiT model, our approach enables superior melody-controlled editing while retaining good text-to-music generation performance. These results outperform a strong MusicGen baseline in terms of both text-based generation and melody preservation for editing. Audio examples can be found at https://stable-audio-control.github.io/web/., Comment: 5 pages, 1 figure
Published: 2024

6. You Know What I'm Saying: Jailbreak Attack via Implicit Reference

Author: Wu, Tianyu, Mei, Lingrui, Yuan, Ruibin, Li, Lujun, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Computation and Language
Abstract: While recent advancements in large language model (LLM) alignment have enabled the effective identification of malicious objectives involving scene nesting and keyword rewriting, our study reveals that these methods remain inadequate at detecting malicious objectives expressed through context within nested harmless objectives. This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR). AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. This method employs multiple related harmless objectives to generate malicious content without triggering refusal responses, thereby effectively bypassing existing detection techniques.Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models, including GPT-4o, Claude-3.5-Sonnet, and Qwen-2-72B. Notably, we observe an inverse scaling phenomenon, where larger models are more vulnerable to this attack method. These findings underscore the urgent need for defense mechanisms capable of understanding and preventing contextual attacks. Furthermore, we introduce a cross-model attack strategy that leverages less secure models to generate malicious contexts, thereby further increasing the ASR when targeting other models.Our code and jailbreak artifacts can be found at https://github.com/Lucas-TY/llm_Implicit_reference.
Published: 2024

7. PSHuman: Photorealistic Single-view Human Reconstruction using Cross-Scale Diffusion

Author: Li, Peng, Zheng, Wangguandong, Liu, Yuan, Yu, Tao, Li, Yangguang, Qi, Xingqun, Li, Mengfei, Chi, Xiaowei, Xia, Siyu, Xue, Wei, Luo, Wenhan, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.
Published: 2024

8. HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts

Author: Liu, Xinyu, He, Yingqing, Guo, Lanqing, Li, Xiang, Jin, Bu, Li, Peng, Li, Yan, Chan, Chi-Min, Chen, Qifeng, Xue, Wei, Luo, Wenhan, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality., Comment: https://liuxinyv.github.io/HiPrompt/
Published: 2024

9. Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Author: Ye, Zhen, Sun, Peiwen, Lei, Jiahe, Lin, Hongzhan, Tan, Xu, Dai, Zheqi, Kong, Qiuqiang, Chen, Jianyi, Pan, Jiahao, Liu, Qifeng, Guo, Yike, and Xue, Wei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: https://x-codec-audio.github.io Code: https://github.com/zhenye234/xcodec)
Published: 2024

10. AgentMonitor: A Plug-and-Play Framework for Predictive and Secure Multi-Agent Systems

Author: Chan, Chi-Min, Yu, Jianxuan, Chen, Weize, Jiang, Chunyang, Liu, Xinyu, Shi, Weijie, Liu, Zhiyuan, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Computation and Language
Abstract: The rapid advancement of large language models (LLMs) has led to the rise of LLM-based agents. Recent research shows that multi-agent systems (MAS), where each agent plays a specific role, can outperform individual LLMs. However, configuring an MAS for a task remains challenging, with performance only observable post-execution. Inspired by scaling laws in LLM development, we investigate whether MAS performance can be predicted beforehand. We introduce AgentMonitor, a framework that integrates at the agent level to capture inputs and outputs, transforming them into statistics for training a regression model to predict task performance. Additionally, it can further apply real-time corrections to address security risks posed by malicious agents, mitigating negative impacts and enhancing MAS security. Experiments demonstrate that an XGBoost model achieves a Spearman correlation of 0.89 in-domain and 0.58 in more challenging scenarios. Furthermore, using AgentMonitor reduces harmful content by 6.2% and increases helpful content by 1.8% on average, enhancing safety and reliability. Code is available at \url{https://github.com/chanchimin/AgentMonitor}.
Published: 2024

11. Importance Weighting Can Help Large Language Models Self-Improve

Author: Jiang, Chunyang, Chan, Chi-min, Xue, Wei, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have shown remarkable capability in numerous tasks and applications. However, fine-tuning LLMs using high-quality datasets under external supervision remains prohibitively expensive. In response, LLM self-improvement approaches have been vibrantly developed recently. The typical paradigm of LLM self-improvement involves training LLM on self-generated data, part of which may be detrimental and should be filtered out due to the unstable data quality. While current works primarily employs filtering strategies based on answer correctness, in this paper, we demonstrate that filtering out correct but with high distribution shift extent (DSE) samples could also benefit the results of self-improvement. Given that the actual sample distribution is usually inaccessible, we propose a new metric called DS weight to approximate DSE, inspired by the Importance Weighting methods. Consequently, we integrate DS weight with self-consistency to comprehensively filter the self-generated samples and fine-tune the language model. Experiments show that with only a tiny valid set (up to 5\% size of the training set) to compute DS weight, our approach can notably promote the reasoning ability of current LLM self-improvement methods. The resulting performance is on par with methods that rely on external supervision from pre-trained reward models.
Published: 2024

12. NoRA: Nested Low-Rank Adaptation for Efficient Fine-Tuning Large Models

Author: Lin, Cheng, Li, Lujun, Li, Dezhi, Zou, Jie, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Machine Learning
Abstract: In this paper, we introduce Nested Low-Rank Adaptation (NoRA), a novel approach to parameter-efficient fine-tuning that extends the capabilities of Low-Rank Adaptation (LoRA) techniques. Vanilla LoRA overlooks pre-trained weight inheritance and still requires fine-tuning numerous parameters. To addresses these issues, our NoRA adopts a dual-layer nested structure with Singular Value Decomposition (SVD), effectively leveraging original matrix knowledge while reducing tunable parameters. Specifically, NoRA freezes the outer LoRA weights and utilizes an inner LoRA design, providing enhanced control over model optimization. This approach allows the model to more precisely adapt to specific tasks while maintaining a compact parameter space. By freezing outer LoRA weights and using an inner LoRA design, NoRA enables precise task adaptation with a compact parameter space. Evaluations on tasks including commonsense reasoning with large language models, fine-tuning vision-language models, and subject-driven generation demonstrate NoRA's superiority over LoRA and its variants. Code will be released upon acceptance., Comment: Work in progress, revisions ongoing
Published: 2024

13. STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs

Author: Dong, Peijie, Li, Lujun, Zhong, Yuedong, Du, Dayou, Fan, Ruibo, Chen, Yuhan, Tang, Zhenheng, Wang, Qiang, Xue, Wei, Guo, Yike, and Chu, Xiaowen
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: In this paper, we present the first structural binarization method for LLM compression to less than 1-bit precision. Although LLMs have achieved remarkable performance, their memory-bound nature during the inference stage hinders the adoption of resource-constrained devices. Reducing weights to 1-bit precision through binarization substantially enhances computational efficiency. We observe that some weights in binarized LLMs can be randomly flipped without significant performance degradation, suggesting the potential for further compression. To exploit this, our STBLLM employs an N:M sparsity technique to achieve structural binarization of the weights. Specifically, we introduce a novel Standardized Importance (SI) metric, which considers weight magnitude and input feature norm to more accurately assess weight significance. Then, we propose a layer-wise approach, allowing different layers of the LLM to be sparsified with varying N:M ratios, thereby balancing compression and accuracy. Furthermore, we implement a fine-grained grouping strategy for less important weights, applying distinct quantization schemes to sparse, intermediate, and dense regions. Finally, we design a specialized CUDA kernel to support structural binarization. We conduct extensive experiments on LLaMA-1/2/3, OPT family, and Mistral to evaluate the effectiveness of STBLLM. The results demonstrate that our approach performs better than other compressed binarization LLM methods while significantly reducing memory requirements.
Published: 2024

14. Can LLMs 'Reason' in Music? An Evaluation of LLMs' Capability of Music Understanding and Generation

Author: Zhou, Ziya, Wu, Yuhang, Wu, Zhiyue, Zhang, Xinyue, Yuan, Ruibin, Ma, Yinghao, Wang, Lu, Benetos, Emmanouil, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditioned generation, especially from the multi-step reasoning perspective, which is a critical aspect in the conditioned, editable, and interactive human-computer co-creation process. This study conducts a thorough investigation of LLMs' capability and limitations in symbolic music processing. We identify that current LLMs exhibit poor performance in song-level multi-step music reasoning, and typically fail to leverage learned music knowledge when addressing complex musical tasks. An analysis of LLMs' responses highlights distinctly their pros and cons. Our findings suggest achieving advanced musical capability is not intrinsically obtained by LLMs, and future research should focus more on bridging the gap between music knowledge and reasoning, to improve the co-creation experience for musicians., Comment: Accepted by ISMIR2024
Published: 2024

15. MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

Author: Chi, Xiaowei, Wang, Yatian, Cheng, Aosong, Fang, Pengjun, Tian, Zeyue, He, Yingqing, Liu, Zhaoyang, Qi, Xingqun, Pan, Jiahao, Zhang, Rongyu, Li, Mengfei, Yuan, Ruibin, Jiang, Yanbing, Xue, Wei, Luo, Wenhan, Chen, Qifeng, Zhang, Shanghang, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Massive multi-modality datasets play a significant role in facilitating the success of large video-language models. However, current video-language datasets primarily provide text descriptions for visual frames, considering audio to be weakly related information. They usually overlook exploring the potential of inherent audio-visual correlation, leading to monotonous annotation within each modality instead of comprehensive and precise descriptions. Such ignorance results in the difficulty of multiple cross-modality studies. To fulfill this gap, we present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions, and 2M high-quality clips with multimodal captions. Trailers preview full-length video works and integrate context, visual frames, and background music. In particular, the trailer has two main advantages: (1) the topics are diverse, and the content characters are of various types, e.g., film, news, and gaming. (2) the corresponding background music is custom-designed, making it more coherent with the visual context. Upon these insights, we propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Here, to ensure the caption retains music perspective while preserving the authority of visual context, we leverage the advanced LLM to merge all annotations adaptively. In this fashion, our MMtrail dataset potentially paves the path for fine-grained large multimodal-language model training. In experiments, we provide evaluation metrics and benchmark results on our dataset, demonstrating the high quality of our annotation and its effectiveness for model training., Comment: 15 Pages. Dataset report
Published: 2024

16. M-LRM: Multi-view Large Reconstruction Model

Author: Li, Mengfei, Long, Xiaoxiao, Liang, Yixun, Li, Weiyu, Liu, Yuan, Li, Peng, Chi, Xiaowei, Qi, Xingqun, Xue, Wei, Luo, Wenhan, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite recent advancements in the Large Reconstruction Model (LRM) demonstrating impressive results, when extending its input from single image to multiple images, it exhibits inefficiencies, subpar geometric and texture quality, as well as slower convergence speed than expected. It is attributed to that, LRM formulates 3D reconstruction as a naive images-to-3D translation problem, ignoring the strong 3D coherence among the input images. In this paper, we propose a Multi-view Large Reconstruction Model (M-LRM) designed to efficiently reconstruct high-quality 3D shapes from multi-views in a 3D-aware manner. Specifically, we introduce a multi-view consistent cross-attention scheme to enable M-LRM to accurately query information from the input images. Moreover, we employ the 3D priors of the input multi-view images to initialize the tri-plane tokens. Compared to LRM, the proposed M-LRM can produce a tri-plane NeRF with $128 \times 128$ resolution and generate 3D shapes of high fidelity. Experimental studies demonstrate that our model achieves a significant performance gain and faster training convergence than LRM. Project page: https://murphylmf.github.io/M-LRM/
Published: 2024

17. VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling

Author: Tian, Zeyue, Liu, Zhaoyang, Yuan, Ruibin, Pan, Jiahao, Liu, Qifeng, Tan, Xu, Chen, Qifeng, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound
Abstract: In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 360K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/., Comment: The code and datasets will be available at https://github.com/ZeyueT/VidMuse/
Published: 2024

18. LLMs Meet Multimodal Generation and Editing: A Survey

Author: He, Yingqing, Liu, Zhaoyang, Chen, Jingye, Tian, Zeyue, Liu, Hongyu, Chi, Xiaowei, Liu, Runtao, Yuan, Ruibin, Xing, Yazhou, Wang, Wenhai, Dai, Jifeng, Zhang, Yong, Xue, Wei, Liu, Qifeng, Guo, Yike, and Chen, Qifeng
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Computer Science - Sound
Abstract: With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation, Comment: 52 Pages with 16 Figures, 12 Tables, and 545 References. GitHub Repository at: https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
Published: 2024

19. CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild

Author: Qi, Xingqun, Zhang, Hengyuan, Wang, Yatian, Pan, Jiahao, Liu, Chen, Li, Peng, Chi, Xiaowei, Li, Mengfei, Zhang, Qixun, Xue, Wei, Zhang, Shanghang, Luo, Wenhan, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold. Therefore, to alleviate the scarcity of 3D data, we first construct a large-scale co-speech 3D gesture dataset containing more than 40M meshed posture instances across 4.3K speakers, dubbed GES-X. Then, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our gesture experts. At the finetune stage, we present the audio ControlNet that incorporates the human voice as condition prompts to guide the gesture generation. Here, we construct the audio ControlNet through a trainable copy of our pre-trained diffusion model. Moreover, we design a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism. Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation. Extensive experiments demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation. The dataset will be publicly available at: https://mattie-e.github.io/GES-X/, Comment: The dataset will be released as soon as possible
Published: 2024

20. Gravitational Production of Heavy Particles during and after Inflation

Author: Racco, Davide, Verner, Sarunas, and Xue, Wei
Subjects: High Energy Physics - Phenomenology, Astrophysics - Cosmology and Nongalactic Astrophysics
Abstract: We investigate the gravitational production of a scalar field $\chi$ with a mass exceeding the Hubble scale during inflation $m_\chi \gtrsim H_I$, employing both analytical and numerical approaches. We demonstrate that the steepest descent method effectively captures the epochs and yields of gravitational production in a compact and simple analytical framework. These analytical results align with the numerical solutions of the field equation. Our study covers three spacetime backgrounds: de Sitter, power-law inflation, and the Starobinsky inflation model. Within these models, we identify two distinct phases of particle production: during and after inflation. During inflation, we derive an accurate analytic expression for the particle production rate, accounting for a varying Hubble rate. After inflation, the additional burst of particle production depends on the inflaton mass around its minimum. When this mass is smaller than the Hubble scale during inflation, $H_I$, there is no significant extra production. However, if the inflaton mass is larger, post-inflation production becomes the dominant contribution. Furthermore, we explore the implications of gravitationally produced heavy fields for dark matter abundance, assuming their cosmological stability., Comment: 28 pages + appendices, 13 figures
Published: 2024

21. VAE-Var: Variational-Autoencoder-Enhanced Variational Assimilation

Author: Xiao, Yi, Jia, Qilong, Xue, Wei, and Bai, Lei
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Dynamical Systems, Physics - Atmospheric and Oceanic Physics
Abstract: Data assimilation refers to a set of algorithms designed to compute the optimal estimate of a system's state by refining the prior prediction (known as background states) using observed data. Variational assimilation methods rely on the maximum likelihood approach to formulate a variational cost, with the optimal state estimate derived by minimizing this cost. Although traditional variational methods have achieved great success and have been widely used in many numerical weather prediction centers, they generally assume Gaussian errors in the background states, which limits the accuracy of these algorithms due to the inherent inaccuracies of this assumption. In this paper, we introduce VAE-Var, a novel variational algorithm that leverages a variational autoencoder (VAE) to model a non-Gaussian estimate of the background error distribution. We theoretically derive the variational cost under the VAE estimation and present the general formulation of VAE-Var; we implement VAE-Var on low-dimensional chaotic systems and demonstrate through experimental results that VAE-Var consistently outperforms traditional variational assimilation methods in terms of accuracy across various observational settings.
Published: 2024

22. FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation

Author: Chen, Jianyi, Xue, Wei, Tan, Xu, Ye, Zhen, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-time applications. In this paper, we aim to develop a Fast SAG method that can create high-quality and coherent accompaniments. A non-AR diffusion-based framework is developed, which by carefully designing the conditions inferred from the vocal signals, generates the Mel spectrogram of the target accompaniment directly. With diffusion and Mel spectrogram modeling, the proposed method significantly simplifies the AR token-based SingSong framework, and largely accelerates the generation. We also design semantic projection, prior projection blocks as well as a set of loss functions, to ensure the generated accompaniment has semantic and rhythm coherence with the vocal signal. By intensive experimental studies, we demonstrate that the proposed method can generate better samples than SingSong, and accelerate the generation by at least 30 times. Audio samples and code are available at https://fastsag.github.io/., Comment: IJCAI 2024
Published: 2024

23. ComposerX: Multi-Agent Symbolic Music Composition with LLMs

Author: Deng, Qixin, Yang, Qikai, Yuan, Ruibin, Huang, Yipeng, Wang, Yi, Liu, Xubo, Tian, Zeyue, Pan, Jiahao, Zhang, Ge, Lin, Hanfeng, Li, Yizhi, Ma, Yinghao, Fu, Jie, Lin, Chenghua, Benetos, Emmanouil, Wang, Wenwu, Xia, Guangyu, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music composition represents the creative side of humanity, and itself is a complex task that requires abilities to understand and generate information with long dependency and harmony constraints. While demonstrating impressive capabilities in STEM subjects, current LLMs easily fail in this task, generating ill-written music even when equipped with modern techniques like In-Context-Learning and Chain-of-Thoughts. To further explore and enhance LLMs' potential in music composition by leveraging their reasoning ability and the large knowledge base in music history and theory, we propose ComposerX, an agent-based symbolic music generation framework. We find that applying a multi-agent approach significantly improves the music composition quality of GPT-4. The results demonstrate that ComposerX is capable of producing coherent polyphonic music compositions with captivating melodies, while adhering to user instructions.
Published: 2024

24. Information Re-Organization Improves Reasoning in Large Language Models

Author: Cheng, Xiaoxia, Tan, Zeqi, Xue, Wei, and Lu, Weiming
Subjects: Computer Science - Computation and Language
Abstract: Improving the reasoning capabilities of large language models (LLMs) has attracted considerable interest. Recent approaches primarily focus on improving the reasoning process to yield a more precise final answer. However, in scenarios involving contextually aware reasoning, these methods neglect the importance of first identifying logical relationships from the context before proceeding with the reasoning. This oversight could lead to a superficial understanding and interaction with the context, potentially undermining the quality and reliability of the reasoning outcomes. In this paper, we propose an information re-organization (InfoRE) method before proceeding with the reasoning to enhance the reasoning ability of LLMs. Our re-organization method involves initially extracting logical relationships from the contextual content, such as documents or paragraphs, and subsequently pruning redundant content to minimize noise. Then, we utilize the re-organized information in the reasoning process. This enables LLMs to deeply understand the contextual content by clearly perceiving these logical relationships, while also ensuring high-quality responses by eliminating potential noise. To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks. Using only a zero-shot setting, our method achieves an average absolute improvement of 4% across all tasks, highlighting its potential to improve the reasoning performance of LLMs. Our source code is available at https://github.com/hustcxx/InfoRE., Comment: 15 pages, 4 figures
Published: 2024

25. FlashSpeech: Efficient Zero-Shot Speech Synthesis

Author: Ye, Zhen, Ju, Zeqian, Liu, Haohe, Tan, Xu, Chen, Jianyi, Lu, Yiwen, Sun, Peiwen, Pan, Jiahao, Bian, Weizhen, He, Shulin, Xue, Wei, Liu, Qifeng, and Guo, Yike
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by language models and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/., Comment: Efficient zero-shot speech synthesis
Published: 2024

26. Kilometer-Level Coupled Modeling Using 40 Million Cores: An Eight-Year Journey of Model Development

Author: Duan, Xiaohui, Li, Yuxuan, Liu, Zhao, Yang, Bin, Zheng, Juepeng, Fu, Haohuan, Zhang, Shaoqing, Xu, Shiming, Gao, Yang, Xue, Wei, Wei, Di, Lv, Xiaojing, Yan, Lifeng, Huang, Haopeng, Lu, Haitian, Wan, Lingfeng, Lin, Haoran, Chang, Qixin, Li, Chenlin, He, Quanjie, Song, Zeyu, Wang, Xuantong, Yu, Yangyang, Fan, Xilong, Qu, Zhaopeng, Xu, Yankun, Guo, Xiuwen, Fei, Yunlong, Wang, Zhaoying, Li, Mingkui, Jiang, Yingjing, Lu, Lv, Su, Liang, Fu, Jiayu, Yu, Peinan, Liu, Weiguo, Wu, Lixin, Wang, Lanning, Liu, Xin, Chen, Dexun, and Yang, Guangwen
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: With current and future leading systems adopting heterogeneous architectures, adapting existing models for heterogeneous supercomputers is of urgent need for improving model resolution and reducing modeling uncertainty. This paper presents our three-week effort on porting a complex earth system model, CESM 2.2, to a 40-million-core Sunway supercomputer. Taking a non-intrusive approach that tries to minimizes manual code modifications, our project tries to achieve both improvement of performance and consistency of the model code. By using a hierarchical grid system and an OpenMP-based offloading toolkit, our porting and parallelization effort covers over 80% of the code, and achieves a simulation speed of 340 SDPD (simulated days per day) for 5-km atmosphere, 265 SDPD for 3-km ocean, and 222 SDPD for a coupled model, thus making multi-year or even multi-decadal experiments at such high resolution possible., Comment: 18 pages, 13 figures
Published: 2024

27. Mixed-Precision Computing in the GRIST Dynamical Core for Weather and Climate Modelling

Author: Chen, Siyuan, Zhang, Yi, Wang, Yiming, Liu, Zhuang, Li, Xiaohan, and Xue, Wei
Subjects: Physics - Atmospheric and Oceanic Physics
Abstract: Atmosphere modelling applications become increasingly memory-bound due to the inconsistent development rates between processor speeds and memory bandwidth. In this study, we mitigate memory bottlenecks and reduce the computational load of the GRIST dynamical core by adopting the mixed-precision computing strategy. Guided by a limited-degree of iterative development principle, we identify the equation terms that are precision insensitive and modify them from double- to single-precision. The results show that most precision-sensitive terms are predominantly linked to pressure-gradient and gravity terms, while most precision-insensitive terms are advective terms. The computational cost is reduced without compromising the solver accuracy. The runtime of the model's hydrostatic solver, non-hydrostatic solver, and tracer transport solver is reduced by 24%, 27%, and 44%, respectively. A series of idealized tests, real-world weather and climate modelling tests, has been performed to assess the optimized model performance qualitatively and quantitatively. In particular, in the high-resolution weather forecast simulation, the model sensitivity to the precision level is mainly dominated by the small-scale features. While in long-term climate simulation, the precision-induced sensitivity can form at the large scale., Comment: 28 pages, 9 figures
Published: 2024

28. MuPT: A Generative Symbolic Music Pretrained Transformer

Author: Qu, Xingwei, Bai, Yuelin, Ma, Yinghao, Zhou, Ziya, Lo, Ka Man, Liu, Jiaheng, Yuan, Ruibin, Min, Lejun, Liu, Xueling, Zhang, Tianyu, Du, Xinrun, Guo, Shuyue, Liang, Yiming, Li, Yizhi, Wu, Shangda, Zhou, Junting, Zheng, Tianyu, Ma, Ziyang, Han, Fengze, Xue, Wei, Xia, Gus, Benetos, Emmanouil, Yue, Xiang, Lin, Chenghua, Tan, Xu, Huang, Stephen W., Fu, Jie, and Zhang, Ge
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
Published: 2024

29. RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Author: Chan, Chi-Min, Xu, Chunpu, Yuan, Ruibin, Luo, Hongyin, Xue, Wei, Guo, Yike, and Fu, Jie
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses. This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios. To tackle these challenges, Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process, thus leveraging non-parametric knowledge alongside LLMs' in-context learning abilities. However, existing RAG implementations primarily focus on initial input for context retrieval, overlooking the nuances of ambiguous or complex queries that necessitate further clarification or decomposition for accurate responses. To this end, we propose learning to Refine Query for Retrieval Augmented Generation (RQ-RAG) in this paper, endeavoring to enhance the model by equipping it with capabilities for explicit rewriting, decomposition, and disambiguation. Our experimental results indicate that our method, when applied to a 7B Llama2 model, surpasses the previous state-of-the-art (SOTA) by an average of 1.9\% across three single-hop QA datasets, and also demonstrates enhanced performance in handling complex, multi-hop QA datasets. Our code is available at https://github.com/chanchimin/RQ-RAG.
Published: 2024

30. Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm

Author: Liu, Lei, Yang, Xiaoyan, Li, Fangzhou, Chi, Chenfei, Shen, Yue, Zhang, Shiwei Lyu Ming, Ma, Xiaowei, Lyu, Xiangguo, Ma, Liya, Zhang, Zhiqiang, Xue, Wei, Huang, Yiran, and Gu, Jinjie
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) are gaining increasing interests to improve clinical efficiency for medical diagnosis, owing to their unprecedented performance in modelling natural language. Ensuring the safe and reliable clinical applications, the evaluation of LLMs indeed becomes critical for better mitigating the potential risks, e.g., hallucinations. However, current evaluation methods heavily rely on labor-intensive human participation to achieve human-preferred judgements. To overcome this challenge, we propose an automatic evaluation paradigm tailored to assess the LLMs' capabilities in delivering clinical services, e.g., disease diagnosis and treatment. The evaluation paradigm contains three basic elements: metric, data, and algorithm. Specifically, inspired by professional clinical practice pathways, we formulate a LLM-specific clinical pathway (LCP) to define the clinical capabilities that a doctor agent should possess. Then, Standardized Patients (SPs) from the medical education are introduced as the guideline for collecting medical data for evaluation, which can well ensure the completeness of the evaluation procedure. Leveraging these steps, we develop a multi-agent framework to simulate the interactive environment between SPs and a doctor agent, which is equipped with a Retrieval-Augmented Evaluation (RAE) to determine whether the behaviors of a doctor agent are in accordance with LCP. The above paradigm can be extended to any similar clinical scenarios to automatically evaluate the LLMs' medical capabilities. Applying such paradigm, we construct an evaluation benchmark in the field of urology, including a LCP, a SPs dataset, and an automated RAE. Extensive experiments are conducted to demonstrate the effectiveness of the proposed approach, providing more insights for LLMs' safe and reliable deployments in clinical practice.
Published: 2024

31. ChatMusician: Understanding and Generating Music Intrinsically with LLM

Author: Yuan, Ruibin, Lin, Hanfeng, Wang, Yi, Tian, Zeyue, Wu, Shangda, Shen, Tianhao, Zhang, Ge, Wu, Yuhang, Liu, Cong, Zhou, Ziya, Ma, Ziyang, Xue, Liumeng, Wang, Ziyu, Liu, Qin, Zheng, Tianyu, Li, Yizhi, Ma, Yinghao, Liang, Yiming, Chi, Xiaowei, Liu, Ruibo, Wang, Zili, Li, Pengfei, Wu, Jingcheng, Lin, Chenghua, Liu, Qifeng, Jiang, Tao, Huang, Wenhao, Chen, Wenhu, Benetos, Emmanouil, Fu, Jie, Xia, Gus, Dannenberg, Roger, Xue, Wei, Kang, Shiyin, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: While Large Language Models (LLMs) demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub., Comment: GitHub: https://shanghaicannon.github.io/ChatMusician/
Published: 2024

32. Ads Recommendation in a Collapsed and Entangled World

Author: Pan, Junwei, Xue, Wei, Wang, Ximei, Yu, Haibin, Liu, Xun, Quan, Shijie, Qiu, Xueming, Liu, Dapeng, Xiao, Lei, and Jiang, Jie
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: We present Tencent's ads recommendation system and examine the challenges and practices of learning appropriate recommendation representations. Our study begins by showcasing our approaches to preserving prior knowledge when encoding features of diverse types into embedding representations. We specifically address sequence features, numeric features, and pre-trained embedding features. Subsequently, we delve into two crucial challenges related to feature representation: the dimensional collapse of embeddings and the interest entanglement across different tasks or scenarios. We propose several practical approaches to address these challenges that result in robust and disentangled recommendation representations. We then explore several training techniques to facilitate model optimization, reduce bias, and enhance exploration. Additionally, we introduce three analysis tools that enable us to study feature correlation, dimensional collapse, and interest entanglement. This work builds upon the continuous efforts of Tencent's ads recommendation team over the past decade. It summarizes general design principles and presents a series of readily applicable solutions and analysis tools. The reported performance is based on our online advertising platform, which handles hundreds of billions of requests daily and serves millions of ads to billions of users.
Published: 2024
Full Text: View/download PDF

33. RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Author: Jin, Congyun, Zhang, Ming, Ma, Xiaowei, Yujiao, Li, Wang, Yingbo, Jia, Yabo, Du, Yuliang, Sun, Tao, Wang, Haowen, Fan, Cong, Gu, Jinjie, Chi, Chenfei, Lv, Xiangguo, Li, Fangzhou, Xue, Wei, and Huang, Yiran
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Statistics - Applications
Abstract: Recent advancements in Large Language Models (LLMs) and Large Multi-modal Models (LMMs) have shown potential in various medical applications, such as Intelligent Medical Diagnosis. Although impressive results have been achieved, we find that existing benchmarks do not reflect the complexity of real medical reports and specialized in-depth reasoning capabilities. In this work, we introduced RJUA-MedDQA, a comprehensive benchmark in the field of medical specialization, which poses several challenges: comprehensively interpreting imgage content across diverse challenging layouts, possessing numerical reasoning ability to identify abnormal indicators and demonstrating clinical reasoning ability to provide statements of disease diagnosis, status and advice based on medical contexts. We carefully design the data generation pipeline and proposed the Efficient Structural Restoration Annotation (ESRA) Method, aimed at restoring textual and tabular content in medical report images. This method substantially enhances annotation efficiency, doubling the productivity of each annotator, and yields a 26.8% improvement in accuracy. We conduct extensive evaluations, including few-shot assessments of 5 LMMs which are capable of solving Chinese medical QA tasks. To further investigate the limitations and potential of current LMMs, we conduct comparative experiments on a set of strong LLMs by using image-text generated by ESRA method. We report the performance of baselines and offer several observations: (1) The overall performance of existing LMMs is still limited; however LMMs more robust to low-quality and diverse-structured images compared to LLMs. (3) Reasoning across context and image content present significant challenges. We hope this benchmark helps the community make progress on these challenging tasks in multi-modal medical document understanding and facilitate its application in healthcare., Comment: 15 pages, 13 figures
Published: 2024

34. CoMoSVC: Consistency Model-based Singing Voice Conversion

Author: Lu, Yiwen, Ye, Zhen, Xue, Wei, Tan, Xu, Liu, Qifeng, and Guo, Yike
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound
Abstract: The diffusion-based Singing Voice Conversion (SVC) methods have achieved remarkable performances, producing natural audios with high similarity to the target timbre. However, the iterative sampling process results in slow inference speed, and acceleration thus becomes crucial. In this paper, we propose CoMoSVC, a consistency model-based SVC method, which aims to achieve both high-quality generation and high-speed sampling. A diffusion-based teacher model is first specially designed for SVC, and a student model is further distilled under self-consistency properties to achieve one-step sampling. Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system, it still achieves comparable or superior conversion performance based on both subjective and objective metrics. Audio samples and codes are available at https://comosvc.github.io/.
Published: 2024

35. FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection

Author: Zhang, Dongmei, Li, Chang, Zhang, Ray, Xie, Shenghao, Xue, Wei, Xie, Xiaodong, and Zhang, Shanghang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git., Comment: Accepted by AAAI 2024. Code will be released at https://github.com/dmzhang0425/FM-OV3D.git
Published: 2023

36. FengWu-4DVar: Coupling the Data-driven Weather Forecasting Model with 4D Variational Assimilation

Author: Xiao, Yi, Bai, Lei, Xue, Wei, Chen, Kang, Han, Tao, and Ouyang, Wanli
Subjects: Physics - Atmospheric and Oceanic Physics, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Weather forecasting is a crucial yet highly challenging task. With the maturity of Artificial Intelligence (AI), the emergence of data-driven weather forecasting models has opened up a new paradigm for the development of weather forecasting systems. Despite the significant successes that have been achieved (e.g., surpassing advanced traditional physical models for global medium-range forecasting), existing data-driven weather forecasting models still rely on the analysis fields generated by the traditional assimilation and forecasting system, which hampers the significance of data-driven weather forecasting models regarding both computational cost and forecasting accuracy. In this work, we explore the possibility of coupling the data-driven weather forecasting model with data assimilation by integrating the global AI weather forecasting model, FengWu, with one of the most popular assimilation algorithms, Four-Dimensional Variational (4DVar) assimilation, and develop an AI-based cyclic weather forecasting system, FengWu-4DVar. FengWu-4DVar can incorporate observational data into the data-driven weather forecasting model and consider the temporal evolution of atmospheric dynamics to obtain accurate analysis fields for making predictions in a cycling manner without the help of physical models. Owning to the auto-differentiation ability of deep learning models, FengWu-4DVar eliminates the need of developing the cumbersome adjoint model, which is usually required in the traditional implementation of the 4DVar algorithm. Experiments on the simulated observational dataset demonstrate that FengWu-4DVar is capable of generating reasonable analysis fields for making accurate and efficient iterative predictions., Comment: 15 pages, 8 figures
Published: 2023

37. RJUA-QA: A Comprehensive QA Dataset for Urology

Author: Lyu, Shiwei, Chi, Chenfei, Cai, Hongbo, Shi, Lei, Yang, Xiaoyan, Liu, Lei, Chen, Xiang, Zhao, Deng, Zhang, Zhiqiang, Lyu, Xianguo, Zhang, Ming, Li, Fangzhou, Ma, Xiaowei, Shen, Yue, Gu, Jinjie, Xue, Wei, and Huang, Yiran
Subjects: Computer Science - Computation and Language
Abstract: We introduce RJUA-QA, a novel medical dataset for question answering (QA) and reasoning with clinical evidence, contributing to bridge the gap between general large language models (LLMs) and medical-specific LLM applications. RJUA-QA is derived from realistic clinical scenarios and aims to facilitate LLMs in generating reliable diagnostic and advice. The dataset contains 2,132 curated Question-Context-Answer pairs, corresponding about 25,000 diagnostic records and clinical cases. The dataset covers 67 common urological disease categories, where the disease coverage exceeds 97.6\% of the population seeking medical services in urology. Each data instance in RJUA-QA comprises: (1) a question mirroring real patient to inquiry about clinical symptoms and medical conditions, (2) a context including comprehensive expert knowledge, serving as a reference for medical examination and diagnosis, (3) a doctor response offering the diagnostic conclusion and suggested examination guidance, (4) a diagnosed clinical disease as the recommended diagnostic outcome, and (5) clinical advice providing recommendations for medical examination. RJUA-QA is the first medical QA dataset for clinical reasoning over the patient inquiries, where expert-level knowledge and experience are required for yielding diagnostic conclusions and medical examination advice. A comprehensive evaluation is conducted to evaluate the performance of both medical-specific and general LLMs on the RJUA-QA dataset. Our data is are publicly available at \url{https://github.com/alipay/RJU_Ant_QA}., Comment: An initial version
Published: 2023

38. Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

Author: Qi, Xingqun, Pan, Jiahao, Li, Peng, Yuan, Ruibin, Chi, Xiaowei, Li, Mengfei, Luo, Wenhan, Xue, Wei, Zhang, Shanghang, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label, they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition, the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal, we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult, we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically, to enhance the coordination of transition gestures w.r.t different emotional ones, we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last, we present a keyframe sampler to supply effective initial posture cues in long sequences, enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/., Comment: Accepted by CVPR 2024
Published: 2023

39. Gauged Global Strings

Author: Niu, Xuce, Xue, Wei, and Yang, Fengwei
Subjects: High Energy Physics - Phenomenology, Astrophysics - Cosmology and Nongalactic Astrophysics, High Energy Physics - Theory
Abstract: We investigate the string solutions and cosmological implications of the gauge ${\rm U(1)_Z}\,\times$ global ${\rm U(1)_{PQ}}$ model. With two hierarchical symmetry-breaking scales, the model exhibits three distinct string solutions: a conventional global string, a global string with a heavy core, and a gauge string as a bound state of the two global strings. This model reveals rich phenomenological implications in cosmology. During the evolution of the universe, these three types of strings can form a Y-junction configuration. Intriguingly, when incorporating this model with the QCD axion framework, the heavy-core global strings emit more axion particles compared to conventional axion cosmic strings due to their higher tension. This radiation significantly enhances the QCD axion dark matter abundance, thereby opening up the QCD axion mass window. Consequently, axions with masses exceeding $\sim 10^{-5}\, {\rm eV}$ have the potential to constitute the whole dark matter abundance. Furthermore, in contrast to conventional gauge strings, the gauge strings in this model exhibit a distinctive behavior by radiating axions., Comment: 36 pages, 8 figures
Published: 2023

40. Effective Action Approach for Preheating

Author: Xu, Bin and Xue, Wei
Subjects: High Energy Physics - Theory, Astrophysics - Cosmology and Nongalactic Astrophysics, High Energy Physics - Phenomenology
Abstract: We present a semiclassical non-perturbative approach for calculating the preheating process at the end of inflation. Our method involves integrating out the decayed particles within the path integral framework and subsequently determining world-line instanton solutions in the effective action. This enables us to obtain the effective action of the inflaton, with its imaginary part linked to the phenomenon of particle creation driven by coherent inflaton field oscillations. Additionally, we utilize the Bogoliubov transformation to investigate the evolution of particle density within the medium after multiple inflaton oscillations. We apply our approach to various final state particles, including scalar fields, tachyonic fields, and gauge fields. The non-perturbative approach provides analytical results for preheating that are in accord with previous methods., Comment: 22 pages
Published: 2023

41. Continual Learning with Dirichlet Generative-based Rehearsal

Author: Zeng, Min, Xue, Wei, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent advancements in data-driven task-oriented dialogue systems (ToDs) struggle with incremental learning due to computational constraints and time-consuming issues. Continual Learning (CL) attempts to solve this by avoiding intensive pre-training, but it faces the problem of catastrophic forgetting (CF). While generative-based rehearsal CL methods have made significant strides, generating pseudo samples that accurately reflect the underlying task-specific distribution is still a challenge. In this paper, we present Dirichlet Continual Learning (DCL), a novel generative-based rehearsal strategy for CL. Unlike the traditionally used Gaussian latent variable in the Conditional Variational Autoencoder (CVAE), DCL leverages the flexibility and versatility of the Dirichlet distribution to model the latent prior variable. This enables it to efficiently capture sentence-level features of previous tasks and effectively guide the generation of pseudo samples. In addition, we introduce Jensen-Shannon Knowledge Distillation (JSKD), a robust logit-based knowledge distillation method that enhances knowledge transfer during pseudo sample generation. Our experiments confirm the efficacy of our approach in both intent detection and slot-filling tasks, outperforming state-of-the-art methods.
Published: 2023

42. O2ATH: An OpenMP Offloading Toolkit for the Sunway Heterogeneous Manycore Platform

Author: Lin, Haoran, Yan, Lifeng, Chang, Qixin, Lu, Haitian, Li, Chenlin, He, Quanjie, Song, Zeyu, Duan, Xiaohui, Yin, Zekun, Li, Yuxuan, Liu, Zhao, Xue, Wei, Fu, Haohuan, Gan, Lin, Yang, Guangwen, and Liu, Weiguo
Subjects: Computer Science - Programming Languages, Computer Science - Software Engineering
Abstract: The next generation Sunway supercomputer employs the SW26010pro processor, which features a specialized on-chip heterogeneous architecture. Applications with significant hotspots can benefit from the great computation capacity improvement of Sunway many-core architectures by carefully making intensive manual many-core parallelization efforts. However, some legacy projects with large codebases, such as CESM, ROMS and WRF, contain numerous lines of code and do not have significant hotspots. The cost of manually porting such applications to the Sunway architecture is almost unaffordable. To overcome such a challenge, we have developed a toolkit named O2ATH. O2ATH forwards GNU OpenMP runtime library calls to Sunway's Athread library, which greatly simplifies the parallelization work on the Sunway architecture.O2ATH enables users to write both MPE and CPE code in a single file, and parallelization can be achieved by utilizing OpenMP directives and attributes. In practice, O2ATH has helped us to port two large projects, CESM and ROMS, to the CPEs of the next generation Sunway supercomputers via the OpenMP offload method. In the experiments, kernel speedups range from 3 to 15 times, resulting in 3 to 6 times whole application speedups.Furthermore, O2ATH requires significantly fewer code modifications compared to manually crafting CPE functions.This indicates that O2ATH can greatly enhance development efficiency when porting or optimizing large software projects on Sunway supercomputers., Comment: 15 pages, 6 figures, 5 tables
Published: 2023

43. PUMGPT: A Large Vision-Language Model for Product Understanding

Author: Xue, Wei, Guo, Zongyi, Cui, Baoliang, Xing, Zheng, Zeng, Xiaoyi, Wang, Xiufei, Wu, Shuhui, and Lu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain.
Published: 2023

44. ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Author: Chan, Chi-Min, Chen, Weize, Su, Yusheng, Yu, Jianxuan, Xue, Wei, Zhang, Shanghang, Fu, Jie, and Liu, Zhiyuan
Subjects: Computer Science - Computation and Language
Abstract: Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in handling intricate tasks. In this paper, we construct a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models on open-ended questions and traditional natural language generation (NLG) tasks. Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments. Our code is available at https://github.com/chanchimin/ChatEval.
Published: 2023

45. LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT

Author: Zhuo, Le, Yuan, Ruibin, Pan, Jiahao, Ma, Yinghao, LI, Yizhi, Zhang, Ge, Liu, Si, Dannenberg, Roger, Fu, Jie, Lin, Chenghua, Benetos, Emmanouil, Xue, Wei, and Guo, Yike
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task., Comment: 9 pages, 2 figures, 5 tables, accepted by ISMIR 2023
Published: 2023

46. MARBLE: Music Audio Representation Benchmark for Universal Evaluation

Author: Yuan, Ruibin, Ma, Yinghao, Li, Yizhi, Zhang, Ge, Chen, Xingran, Yin, Hanzhi, Zhuo, Le, Liu, Yiqi, Huang, Jiawen, Tian, Zeyue, Deng, Binyue, Wang, Ningzhi, Lin, Chenghua, Benetos, Emmanouil, Ragni, Anton, Gyenge, Norbert, Dannenberg, Roger, Chen, Wenhu, Xia, Gus, Xue, Wei, Liu, Si, Wang, Shi, Liu, Ruibo, Guo, Yike, and Fu, Jie
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In the era of extensive intersection between art and Artificial Intelligence (AI), such as image generation and fiction co-creation, AI for music remains relatively nascent, particularly in music understanding. This is evident in the limited work on deep music representations, the scarcity of large-scale datasets, and the absence of a universal and community-driven benchmark. To address this issue, we introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines. Besides, MARBLE offers an easy-to-use, extendable, and reproducible suite for the community, with a clear statement on copyright issues on datasets. Results suggest recently proposed large-scale pre-trained musical language models perform the best in most tasks, with room for further improvement. The leaderboard and toolkit repository are published at https://marble-bm.shef.ac.uk to promote future music AI research., Comment: camera-ready version for NeurIPS 2023
Published: 2023

47. ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation

Author: Liu, Jiaming, Yang, Senqiao, Jia, Peidong, Zhang, Renrui, Lu, Ming, Guo, Yandong, Xue, Wei, and Zhang, Shanghang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Since real-world machine systems are running in non-stationary environments, Continual Test-Time Adaptation (CTTA) task is proposed to adapt the pre-trained model to continually changing target domains. Recently, existing methods mainly focus on model-based adaptation, which aims to leverage a self-training manner to extract the target domain knowledge. However, pseudo labels can be noisy and the updated model parameters are unreliable under dynamic data distributions, leading to error accumulation and catastrophic forgetting in the continual adaptation process. To tackle these challenges and maintain the model plasticity, we design a Visual Domain Adapter (ViDA) for CTTA, explicitly handling both domain-specific and domain-shared knowledge. Specifically, we first comprehensively explore the different domain representations of the adapters with trainable high-rank or low-rank embedding spaces. Then we inject ViDAs into the pre-trained model, which leverages high-rank and low-rank features to adapt the current domain distribution and maintain the continual domain-shared knowledge, respectively. To exploit the low-rank and high-rank ViDAs more effectively, we further propose a Homeostatic Knowledge Allotment (HKA) strategy, which adaptively combines different knowledge from each ViDA. Extensive experiments conducted on four widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance in both classification and segmentation CTTA tasks. Note that, our method can be regarded as a novel transfer paradigm for large-scale models, delivering promising results in adaptation to continually changing distributions. Project page: https://sites.google.com/view/iclr2024-vida/home., Comment: Accepted by ICLR2024
Published: 2023

48. NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound Synthesis based on Frequency Modulation

Author: Ye, Zhen, Xue, Wei, Tan, Xu, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Developing digital sound synthesizers is crucial to the music industry as it provides a low-cost way to produce high-quality sounds with rich timbres. Existing traditional synthesizers often require substantial expertise to determine the overall framework of a synthesizer and the parameters of submodules. Since expert knowledge is hard to acquire, it hinders the flexibility to quickly design and tune digital synthesizers for diverse sounds. In this paper, we propose ``NAS-FM'', which adopts neural architecture search (NAS) to build a differentiable frequency modulation (FM) synthesizer. Tunable synthesizers with interpretable controls can be developed automatically from sounds without any prior expert knowledge and manual operating costs. In detail, we train a supernet with a specifically designed search space, including predicting the envelopes of carriers and modulators with different frequency ratios. An evolutionary search algorithm with adaptive oscillator size is then developed to find the optimal relationship between oscillators and the frequency ratio of FM. Extensive experiments on recordings of different instrument sounds show that our algorithm can build a synthesizer fully automatically, achieving better results than handcrafted synthesizers. Audio samples are available at https://nas-fm.github.io/.
Published: 2023

49. Insert or Attach: Taxonomy Completion via Box Embedding

Author: Xue, Wei, Shen, Yongliang, Ren, Wenqi, Guo, Jietian, Pu, Shiliang, and Lu, Weiming
Subjects: Computer Science - Computation and Language
Abstract: Taxonomy completion, enriching existing taxonomies by inserting new concepts as parents or attaching them as children, has gained significant interest. Previous approaches embed concepts as vectors in Euclidean space, which makes it difficult to model asymmetric relations in taxonomy. In addition, they introduce pseudo-leaves to convert attachment cases into insertion cases, leading to an incorrect bias in network learning dominated by numerous pseudo-leaves. Addressing these, our framework, TaxBox, leverages box containment and center closeness to design two specialized geometric scorers within the box embedding space. These scorers are tailored for insertion and attachment operations and can effectively capture intrinsic relationships between concepts by optimizing on a granular box constraint loss. We employ a dynamic ranking loss mechanism to balance the scores from these scorers, allowing adaptive adjustments of insertion and attachment scores. Experiments on four real-world datasets show that TaxBox significantly outperforms previous methods, yielding substantial improvements over prior methods in real-world datasets, with average performance boosts of 6.7%, 34.9%, and 51.4% in MRR, Hit@1, and Prec@1, respectively.
Published: 2023

50. CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

Author: Ye, Zhen, Xue, Wei, Tan, Xu, Chen, Jie, Liu, Qifeng, and Guo, Yike
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines. Audio samples are available at https://comospeech.github.io/., Comment: Accepted to ACM MM 2023
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

144 results on '"Xue Wei"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources