Author: "Wang, Zhongyuan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Zhongyuan"' showing total 1,527 results

Start Over Author "Wang, Zhongyuan"

1,527 results on '"Wang, Zhongyuan"'

1. Remaking Bonds: Adaptive Party Linkage-building in Contemporary China

Author: Wang, Zhongyuan
Published: 2023
Full Text: View/download PDF

2. Intense Monitoring and Bureaucratic Support for an Anti-poverty Campaign in China

Author: Zeng, Qingjie, Zuo, Cai, and Wang, Zhongyuan
Published: 2022
Full Text: View/download PDF

3. 52B to 1T: Lessons Learned via Tele-FLM Series

Author: Li, Xiang, Yao, Yiqun, Jiang, Xin, Fang, Xuezhi, Wang, Chao, Liu, Xinzhang, Wang, Zihan, Zhao, Yu, Wang, Xin, Huang, Yuyao, Song, Shuangyong, Li, Yongxiang, Zhang, Zheng, Zhao, Bo, Sun, Aixin, Wang, Yequan, He, Zhongjiang, Wang, Zhongyuan, Li, Xuelong, and Huang, Tiejun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the "less is more" approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research., Comment: For the Tele-FLM-52B tech report, see also 2404.16645
Published: 2024

4. GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension

Author: Liang, Jiafeng, Jiang, Shixin, Wang, Zekun, Pan, Haojie, Chen, Zerui, Chu, Zheng, Liu, Ming, Fu, Ruiji, Wang, Zhongyuan, and Qin, Bing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension., Comment: IJCAI 2024
Published: 2024

5. Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Author: Sun, Chenxi, Zhang, Hongzhi, Lin, Zijia, Zhang, Jingyuan, Zhang, Fuzheng, Wang, Zhongyuan, Chen, Bin, Song, Chengru, Zhang, Di, Gai, Kun, and Xiong, Deyi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\% speed up on code generation with a negligible quality loss of 3\%. Distinctively, LUD requires no auxiliary models and does not require changes to existing architectures. It can also be integrated with other decoding acceleration methods, thus achieving an even more pronounced inference efficiency boost. We posit that the foundational principles of LUD could define a new decoding paradigm for future language models, enhancing their applicability for a broader spectrum of applications. All codes are be publicly available at https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD-. Keywords: Parallel Decoding, Lexical Unit Decoding, Large Language Model, Comment: Accepted for publication at LREC-COLING 2024
Published: 2024

6. SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

Author: Shen, Guibao, Wang, Luozhou, Lin, Jiantao, Ge, Wenhang, Zhang, Chaozhe, Tao, Xin, Zhang, Yuan, Wan, Pengfei, Wang, Zhongyuan, Chen, Guangyong, Li, Yijun, and Chen, Ying-Cong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple relationships. To address the challenges posed by low-quality annotated datasets like Visual Genome, we have manually curated a highly clean, multi-relational scene graph-image paired dataset MultiRels. Furthermore, we design three metrics derived from GPT-4V to effectively and thoroughly measure the correspondence between images and scene graphs. Both qualitative and quantitative results validate the efficacy of our approach in controlling the correspondence in multiple relationships.
Published: 2024

7. Learning Multi-dimensional Human Preference for Text-to-Image Generation

Author: Zhang, Sixian, Wang, Bohan, Wu, Junqiang, Li, Yan, Gao, Tingting, Zhang, Di, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images, they reduce the rich tapestry of human preference to a single overall score. However, the preference results vary when humans evaluate images with different aspects. Therefore, to learn the multi-dimensional human preferences, we propose the Multi-dimensional Preference Score (MPS), the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset, which comprises 918,315 human preference choices across four dimensions (i.e., aesthetics, semantic alignment, detail quality and overall assessment) on 607,541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions, enabling it a promising metric for evaluating and improving text-to-image generation.
Published: 2024

8. Tele-FLM Technical Report

Author: Li, Xiang, Yao, Yiqun, Jiang, Xin, Fang, Xuezhi, Wang, Chao, Liu, Xinzhang, Wang, Zihan, Zhao, Yu, Wang, Xin, Huang, Yuyao, Song, Shuangyong, Li, Yongxiang, Zhang, Zheng, Zhao, Bo, Sun, Aixin, Wang, Yequan, He, Zhongjiang, Wang, Zhongyuan, Li, Xuelong, and Huang, Tiejun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.
Published: 2024

9. End-to-end training of Multimodal Model and ranking Model

Author: Deng, Xiuqi, Xu, Lu, Li, Xiyao, Yu, Jinkai, Xue, Erpeng, Wang, Zhongyuan, Zhang, Di, Liu, Zhaojie, Zhou, Guorui, Song, Yang, Mou, Na, Jiang, Shen, and Li, Han
Subjects: Computer Science - Information Retrieval
Abstract: Traditional recommender systems heavily rely on ID features, which often encounter challenges related to cold-start and generalization. Modeling pre-extracted content features can mitigate these issues, but is still a suboptimal solution due to the discrepancies between training tasks and model parameters. End-to-end training presents a promising solution for these problems, yet most of the existing works mainly focus on retrieval models, leaving the multimodal techniques under-utilized. In this paper, we propose an industrial multimodal recommendation framework named EM3: End-to-end training of Multimodal Model and ranking Model, which sufficiently utilizes multimodal information and allows personalized ranking tasks to directly train the core modules in the multimodal model to obtain more task-oriented content features, without overburdening resource consumption. First, we propose Fusion-Q-Former, which consists of transformers and a set of trainable queries, to fuse different modalities and generate fixed-length and robust multimodal embeddings. Second, in our sequential modeling for user content interest, we utilize Low-Rank Adaptation technique to alleviate the conflict between huge resource consumption and long sequence length. Third, we propose a novel Content-ID-Contrastive learning task to complement the advantages of content and ID by aligning them with each other, obtaining more task-oriented content embeddings and more generalized ID embeddings. In experiments, we implement EM3 on different ranking models in two scenario, achieving significant improvements in both offline evaluation and online A/B test, verifying the generalizability of our method. Ablation studies and visualization are also performed. Furthermore, we also conduct experiments on two public datasets to show that our proposed method outperforms the state-of-the-art methods., Comment: 9 pages, 8 figures
Published: 2024

10. Not All Layers of LLMs Are Necessary During Inference

Author: Fan, Siqi, Jiang, Xin, Li, Xiang, Meng, Xuying, Han, Peng, Shang, Shuo, Sun, Aixin, Wang, Yequan, and Wang, Zhongyuan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.
Published: 2024

11. Microsoft Concept Graph: Mining Semantic Concepts for Short Text Understanding

Author: Ji, Lei, Wang, Yujing, Shi, Botian, Zhang, Dawei, Wang, Zhongyuan, and Yan, Jun
Subjects: Information technology, T58.5-58.64
Abstract: Knowlege is important for text-related applications. In this paper, we introduce Microsoft Concept Graph, a knowledge graph engine that provides concept tagging APIs to facilitate the understanding of human languages. Microsoft Concept Graph is built upon Probase, a universal probabilistic taxonomy consisting of instances and concepts mined from the Web. We start by introducing the construction of the knowledge graph through iterative semantic extraction and taxonomy construction procedures, which extract 2.7 million concepts from 1.68 billion Web pages. We then use conceptualization models to represent text in the concept space to empower text-related applications, such as topic search, query recommendation, Web table understanding and Ads relevance. Since the release in 2016, Microsoft Concept Graph has received more than 100,000 pageviews, 2 million API calls and 3,000 registered downloads from 50,000 visitors over 64 countries.
Published: 2019
Full Text: View/download PDF

12. DVIS++: Improved Decoupled Framework for Universal Video Segmentation

Author: Zhang, Tao, Tian, Xingye, Zhou, Yikang, Ji, Shunping, Wang, Xuebo, Tao, Xin, Zhang, Yuan, Wan, Pengfei, Wang, Zhongyuan, and Wu, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous methods that model video segmentation in an end-to-end manner, our approach decouples video segmentation into three cascaded sub-tasks: segmentation, tracking, and refinement. This decoupling design allows for simpler and more effective modeling of the spatio-temporal representations of objects, especially in complex scenes and long videos. Accordingly, we introduce two novel components: the referring tracker and the temporal refiner. These components track objects frame by frame and model spatio-temporal representations based on pre-aligned features. To improve the tracking capability of DVIS, we propose a denoising training strategy and introduce contrastive learning, resulting in a more robust framework named DVIS++. Furthermore, we evaluate DVIS++ in various settings, including open vocabulary and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework. We conduct extensive experiments on six mainstream benchmarks, including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++ significantly outperforms state-of-the-art specialized methods on these benchmarks in both close- and open-vocabulary settings. Code:~\url{https://github.com/zhang-tao-whu/DVIS_Plus}.
Published: 2023

13. KwaiAgents: Generalized Information-seeking Agent System with Large Language Models

Author: Pan, Haojie, Zhai, Zepeng, Yuan, Hao, Lv, Yaojia, Fu, Ruiji, Liu, Ming, Wang, Zhongyuan, and Qin, Bing
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Driven by curiosity, humans have continually sought to explore and understand the world around them, leading to the invention of various tools to satiate this inquisitiveness. Despite not having the capacity to process and memorize vast amounts of information in their brains, humans excel in critical thinking, planning, reflection, and harnessing available tools to interact with and interpret the world, enabling them to find answers efficiently. The recent advancements in large language models (LLMs) suggest that machines might also possess the aforementioned human-like capabilities, allowing them to exhibit powerful abilities even with a constrained parameter count. In this paper, we introduce KwaiAgents, a generalized information-seeking agent system based on LLMs. Within KwaiAgents, we propose an agent system that employs LLMs as its cognitive core, which is capable of understanding a user's query, behavior guidelines, and referencing external documents. The agent can also update and retrieve information from its internal memory, plan and execute actions using a time-aware search-browse toolkit, and ultimately provide a comprehensive response. We further investigate the system's performance when powered by LLMs less advanced than GPT-4, and introduce the Meta-Agent Tuning (MAT) framework, designed to ensure even an open-sourced 7B or 13B model performs well among many agent systems. We exploit both benchmark and human evaluations to systematically validate these capabilities. Extensive experiments show the superiority of our agent system compared to other autonomous agents and highlight the enhanced generalized agent-abilities of our fine-tuned LLMs.
Published: 2023

14. Stable Segment Anything Model

Author: Fan, Qi, Tao, Xin, Ke, Lei, Ye, Mingqiao, Zhang, Yuan, Wan, Pengfei, Wang, Zhongyuan, Tai, Yu-Wing, and Tang, Chi-Keung
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key finding reveals that given such low-quality prompts, SAM's mask decoder tends to activate image features that are biased towards the background or confined to specific object parts. To mitigate this issue, our key idea consists of calibrating solely SAM's mask attention by adjusting the sampling locations and amplitudes of image features, while the original SAM model architecture and weights remain unchanged. Consequently, our deformable sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted target regions in a data-driven manner, facilitated by our effective robust training strategy (RTS). During inference, dynamic routing plugin (DRP) is proposed that toggles SAM between the deformable and regular grid sampling modes, conditioned on the input prompt quality. Thus, our solution, termed Stable-SAM, offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality, with 3) minimal learnable parameters (0.08 M) and fast adaptation (by 1 training epoch). Extensive experiments across multiple datasets validate the effectiveness and advantages of our approach, underscoring Stable-SAM as a more robust solution for segmenting anything. Codes will be released upon acceptance. https://github.com/fanq15/Stable-SAM, Comment: Smaller file size for the easy access. Codes will be released upon acceptance. https://github.com/fanq15/Stable-SAM
Published: 2023

15. Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Author: Wu, Weijia, Li, Zhuang, He, Yefei, Shou, Mike Zheng, Shen, Chunhua, Cheng, Lele, Li, Yan, Gao, Tingting, Zhang, Di, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment., Comment: The project website is at: https://weijiawu.github.io/ParaDiffusionPage/. Code: https://github.com/weijiawu/ParaDiffusion
Published: 2023

16. Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery

Author: Chen, Ming, Zhou, Yan, Jian, Weihua, Wan, Pengfei, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Though significant progress in human pose and shape recovery from monocular RGB images has been made in recent years, obtaining 3D human motion with high accuracy and temporal consistency from videos remains challenging. Existing video-based methods tend to reconstruct human motion from global image features, which lack detailed representation capability and limit the reconstruction accuracy. In this paper, we propose a Temporal-Aware Refining Network (TAR), to synchronously explore temporal-aware global and local image features for accurate pose and shape recovery. First, a global transformer encoder is introduced to obtain temporal global features from static feature sequences. Second, a bidirectional ConvGRU network takes the sequence of high-resolution feature maps as input, and outputs temporal local feature maps that maintain high resolution and capture the local motion of the human body. Finally, a recurrent refinement module iteratively updates estimated SMPL parameters by leveraging both global and local temporal information to achieve accurate and smooth results. Extensive experiments demonstrate that our TAR obtains more accurate results than previous state-of-the-art methods on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M., Comment: 20 pages, 12 figures
Published: 2023

17. Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios

Author: Lin, Lei, Fu, Jiayi, Liu, Pengli, Li, Qingyang, Gong, Yan, Wan, Junchen, Zhang, Fuzheng, Wang, Zhongyuan, Zhang, Di, and Gai, Kun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Although chain-of-thought (CoT) prompting combined with language models has achieved encouraging results on complex reasoning tasks, the naive greedy decoding used in CoT prompting usually causes the repetitiveness and local optimality. To address this shortcoming, ensemble-optimization tries to obtain multiple reasoning paths to get the final answer assembly. However, current ensemble-optimization methods either simply employ rule-based post-processing such as \textit{self-consistency}, or train an additional model based on several task-related human annotations to select the best one among multiple reasoning paths, yet fail to generalize to realistic settings where the type of input questions is unknown or the answer format of reasoning paths is unknown. To avoid their limitations, we propose \textbf{Self-Agreement}, a generalizable ensemble-optimization method applying in almost all scenarios where the type of input questions and the answer format of reasoning paths may be known or unknown. Self-agreement firstly samples from language model's decoder to generate a \textit{diverse} set of reasoning paths, and subsequently prompts the language model \textit{one more time} to determine the optimal answer by selecting the most \textit{agreed} answer among the sampled reasoning paths. Self-agreement simultaneously achieves remarkable performance on six public reasoning benchmarks and superior generalization capabilities., Comment: Accepted by Findings of ACL 2024
Published: 2023

18. Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Author: Yang, Cheng, Xu, Rui, Guo, Ye, Huang, Peixiang, Chen, Yiru, Ding, Wenkui, Wang, Zhongyuan, and Zhou, Hong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and promoted the state-of-the-art performance of VCR. However, the existing approaches almost employ the BERT-like objectives to learn multi-modal representations. These objectives motivated from the text-domain are insufficient for the excavation on the complex scenario of visual modality. Most importantly, the spatial distribution of the visual objects is basically neglected. To address the above issue, we propose to construct the spatial relation graph based on the given visual scenario. Further, we design two pre-training tasks named object position regression (OPR) and spatial relation classification (SRC) to learn to reconstruct the spatial relation graph respectively. Quantitative analysis suggests that the proposed method can guide the representations to maintain more spatial context and facilitate the attention on the essential visual regions for reasoning. We achieve the state-of-the-art results on VCR and two other vision-and-language reasoning tasks VQA, and NLVR.
Published: 2023

19. Unveiling the Hub Genes Involved in Cadmium-Induced Hepatotoxicity

Author: Yang, Bing, Wang, Zhongyuan, Wang, Shujuan, and Li, Xiaofeng
Published: 2024
Full Text: View/download PDF

20. Global Justice Index Report 2023

Author: Gu, Yanfeng, Guo, Sujian, Gan, Yiqing, Qin, Xuan, Qu, Wen, Wang, Zhongyuan, and Zhang, Tiantian
Published: 2024
Full Text: View/download PDF

21. Graph Ranking Contrastive Learning: A Extremely Simple yet Efficient Method

Author: Hu, Yulan, Ouyang, Sheng, Liu, Jingyu, Chen, Ge, Yang, Zhirui, Wan, Junchen, Zhang, Fuzheng, Wang, Zhongyuan, and Liu, Yong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Graph contrastive learning (GCL) has emerged as a representative graph self-supervised method, achieving significant success. The currently prevalent optimization objective for GCL is InfoNCE. Typically, it employs augmentation techniques to obtain two views, where a node in one view acts as the anchor, the corresponding node in the other view serves as the positive sample, and all other nodes are regarded as negative samples. The goal is to minimize the distance between the anchor node and positive samples and maximize the distance to negative samples. However, due to the lack of label information during training, InfoNCE inevitably treats samples from the same class as negative samples, leading to the issue of false negative samples. This can impair the learned node representations and subsequently hinder performance in downstream tasks. While numerous methods have been proposed to mitigate the impact of false negatives, they still face various challenges. For instance, while increasing the number of negative samples can dilute the impact of false negatives, it concurrently increases computational burden. Thus, we propose GraphRank, a simple yet efficient graph contrastive learning method that addresses the problem of false negative samples by redefining the concept of negative samples to a certain extent, thereby avoiding the issue of false negative samples. The effectiveness of GraphRank is empirically validated through experiments on the node, edge, and graph level tasks.
Published: 2023

22. KwaiYiiMath: Technical Report

Author: Fu, Jiayi, Lin, Lei, Gao, Xiaoyang, Liu, Pengli, Chen, Zhengzong, Yang, Zhirui, Zhang, Shengnan, Zheng, Xue, Li, Yan, Liu, Yuliang, Ye, Xucheng, Liao, Yiqiao, Liao, Chao, Chen, Bin, Song, Chengru, Wan, Junchen, Lin, Zijia, Zhang, Fuzheng, Wang, Zhongyuan, Zhang, Di, and Gai, Kun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent advancements in large language models (LLMs) have demonstrated remarkable abilities in handling a variety of natural language processing (NLP) downstream tasks, even on mathematical tasks requiring multi-step reasoning. In this report, we introduce the KwaiYiiMath which enhances the mathematical reasoning abilities of KwaiYiiBase1, by applying Supervised Fine-Tuning (SFT) and Reinforced Learning from Human Feedback (RLHF), including on both English and Chinese mathematical tasks. Meanwhile, we also constructed a small-scale Chinese primary school mathematics test set (named KMath), consisting of 188 examples to evaluate the correctness of the problem-solving process generated by the models. Empirical studies demonstrate that KwaiYiiMath can achieve state-of-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with the similar size models, respectively., Comment: technical report. arXiv admin note: text overlap with arXiv:2306.16636 by other authors
Published: 2023

23. Exploring Sentence Type Effects on the Lombard Effect and Intelligibility Enhancement: A Comparative Study of Natural and Grid Sentences

Author: Chen, Hongyang, Yang, Yuhong, Wang, Zhongyuan, Tu, Weiping, Ai, Haojun, and Lin, Song
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This study explores how sentence types affect the Lombard effect and intelligibility enhancement, focusing on comparisons between natural and grid sentences. Using the Lombard Chinese-TIMIT (LCT) corpus and the Enhanced MAndarin Lombard Grid (EMALG) corpus, we analyze changes in phonetic and acoustic features across different noise levels. Our results show that grid sentences produce more pronounced Lombard effects than natural sentences. Then, we develop and test a normal-to-Lombard conversion model, trained separately on LCT and EMALG corpora. Through subjective and objective evaluations, natural sentences are superior in maintaining speech quality in intelligibility enhancement. In contrast, grid sentences could provide superior intelligibility due to the more pronounced Lombard effect. This study provides a valuable perspective on enhancing speech communication in noisy environments.
Published: 2023

24. Code-Style In-Context Learning for Knowledge-Based Question Answering

Author: Nie, Zhijie, Zhang, Richong, Wang, Zhongyuan, and Liu, Xudong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Current methods for Knowledge-Based Question Answering (KBQA) usually rely on complex training techniques and model frameworks, leading to many limitations in practical applications. Recently, the emergence of In-Context Learning (ICL) capabilities in Large Language Models (LLMs) provides a simple and training-free semantic parsing paradigm for KBQA: Given a small number of questions and their labeled logical forms as demo examples, LLMs can understand the task intent and generate the logic form for a new question. However, current powerful LLMs have little exposure to logic forms during pre-training, resulting in a high format error rate. To solve this problem, we propose a code-style in-context learning method for KBQA, which converts the generation process of unfamiliar logical form into the more familiar code generation process for LLMs. Experimental results on three mainstream datasets show that our method dramatically mitigated the formatting error problem in generating logic forms while realizing a new SOTA on WebQSP, GrailQA, and GraphQ under the few-shot setting. The code and supplementary files are released at https://github.com/Arthurizijar/KB-Coder ., Comment: AAAI2024 Camera Ready
Published: 2023

25. Towards Practical Capture of High-Fidelity Relightable Avatars

Author: Yang, Haotian, Zheng, Mingwu, Feng, Wanquan, Huang, Haibin, Lai, Yu-Kun, Wan, Pengfei, Wang, Zhongyuan, and Ma, Chongyang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting., Comment: Accepted to SIGGRAPH Asia 2023 (Conference); Project page: https://travatar-paper.github.io/
Published: 2023

26. Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis

Author: Qiang, Chunyu, Yang, Peng, Che, Hao, Zhang, Ying, Wang, Xiaorui, and Wang, Zhongyuan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre. In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style, similar to the one-to-many problem(i.e., multiple prosody variations correspond to the same text). In response to this problem, a strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre, improving the representation and interpretability of the global style embedding, which can alleviate the one-to-many mapping and data imbalance problems in prosody prediction. A hierarchical prosody predictor is proposed to improve prosody modeling. We find that better style transfer can be achieved by using the source speaker's prosody features that are easily predicted. Additionally, a speaker-transfer-wise cycle consistency loss is proposed to assist the model in learning unseen style-timbre combinations during the training phase. Experimental results show that the method outperforms the baseline. We provide a website with audio samples., Comment: Accepted by ICASSP2023
Published: 2023

27. Nonlinear Aerodynamic Modeling and Analysis on Body of Fixed Canard Dual-Spin Projectiles

Author: Zhao, Xinxin, Shi, Jinguang, Wang, Zhongyuan, Zhang, Ning, Wang, Cheng, Chaari, Fakher, Series Editor, Gherardini, Francesco, Series Editor, Ivanov, Vitalii, Series Editor, Haddar, Mohamed, Series Editor, Cavas-Martínez, Francisco, Editorial Board Member, di Mare, Francesca, Editorial Board Member, Kwon, Young W., Editorial Board Member, Trojanowska, Justyna, Editorial Board Member, Xu, Jinyang, Editorial Board Member, Rui, Xiaoting, editor, and Liu, Caishan, editor
Published: 2024
Full Text: View/download PDF

28. TalkSee: Interactive Video Retrieval Engine Using Large Language Model

Author: Gu, Guihe, Wu, Zhengqian, He, Jiangshan, Song, Lin, Wang, Zhongyuan, Liang, Chao, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Rudinac, Stevan, editor, Hanjalic, Alan, editor, Liem, Cynthia, editor, Worring, Marcel, editor, Jónsson, Björn Þór, editor, Liu, Bei, editor, and Yamakata, Yoko, editor
Published: 2024
Full Text: View/download PDF

29. Dynamic mechanical response of hot central plant recycling asphalt pavement considering the rutting deformation of existing structure: pollution reduction and durability promotion

Author: Zhan, He, Li, Ning, Tang, Wei, Yu, Xin, and Wang, Zhongyuan
Published: 2024
Full Text: View/download PDF

30. Efficient lightweight network for video super-resolution

Author: Luo, Laigan, Yi, Benshun, Wang, Zhongyuan, Yi, Peng, and He, Zheng
Published: 2024
Full Text: View/download PDF

31. Identification of the Hub Genes Linked to Lead (IV)-Induced Spleen Toxicity Using the Rat Model

Author: Yang, Bing, Wang, Zhongyuan, Hu, Zhongze, Wang, Shujuan, Xu, Jingen, and Li, Xiaofeng
Published: 2023
Full Text: View/download PDF

32. Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

Author: Qiang, Chunyu, Yang, Peng, Che, Hao, Wang, Xiaorui, and Wang, Zhongyuan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples., Comment: Published to ISCSLP 2022
Published: 2022

33. A Scale-Arbitrary Image Super-Resolution Network Using Frequency-domain Information

Author: Fang, Jing, Yu, Yinbo, Wang, Zhongyuan, Ding, Xin, and Hu, Ruimin
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Image super-resolution (SR) is a technique to recover lost high-frequency information in low-resolution (LR) images. Spatial-domain information has been widely exploited to implement image SR, so a new trend is to involve frequency-domain information in SR tasks. Besides, image SR is typically application-oriented and various computer vision tasks call for image arbitrary magnification. Therefore, in this paper, we study image features in the frequency domain to design a novel scale-arbitrary image SR network. First, we statistically analyze LR-HR image pairs of several datasets under different scale factors and find that the high-frequency spectra of different images under different scale factors suffer from different degrees of degradation, but the valid low-frequency spectra tend to be retained within a certain distribution range. Then, based on this finding, we devise an adaptive scale-aware feature division mechanism using deep reinforcement learning, which can accurately and adaptively divide the frequency spectrum into the low-frequency part to be retained and the high-frequency one to be recovered. Finally, we design a scale-aware feature recovery module to capture and fuse multi-level features for reconstructing the high-frequency spectrum at arbitrary scale factors. Extensive experiments on public datasets show the superiority of our method compared with state-of-the-art methods.
Published: 2022

34. A Unified Model for Video Understanding and Knowledge Embedding with Heterogeneous Knowledge Graph Dataset

Author: Deng, Jiaxin, Shen, Dong, Pan, Haojie, Wu, Xiangyu, Liu, Ximan, Meng, Gaofeng, Yang, Fan, Li, Size, Fu, Ruiji, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video understanding is an important task in short video business platforms and it has a wide application in video recommendation and classification. Most of the existing video understanding works only focus on the information that appeared within the video content, including the video frames, audio and text. However, introducing common sense knowledge from the external Knowledge Graph (KG) dataset is essential for video understanding when referring to the content which is less relevant to the video. Owing to the lack of video knowledge graph dataset, the work which integrates video understanding and KG is rare. In this paper, we propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations. This dataset also provides multiple novel video inference tasks like the Video-Relation-Tag (VRT) and Video-Relation-Video (VRV) tasks. Furthermore, based on this dataset, we propose an end-to-end model that jointly optimizes the video understanding objective with knowledge graph embedding, which can not only better inject factual knowledge into video understanding but also generate effective multi-modal entity embedding for KG. Comprehensive experiments indicate that combining video understanding embedding with factual knowledge benefits the content-based video retrieval performance. Moreover, it also helps the model generate better knowledge graph embedding which outperforms traditional KGE-based methods on VRT and VRV tasks with at least 42.36% and 17.73% improvement in HITS@10., Comment: Accepted by ICMR 2023
Published: 2022

35. Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation

Author: Qiang, Chunyu, Yang, Peng, Che, Hao, Xiao, Jinba, Wang, Xiaorui, and Wang, Zhongyuan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Conversion of Chinese Grapheme-to-Phoneme (G2P) plays an important role in Mandarin Chinese Text-To-Speech (TTS) systems, where one of the biggest challenges is the task of polyphone disambiguation. Most of the previous polyphone disambiguation models are trained on manually annotated datasets, and publicly available datasets for polyphone disambiguation are scarce. In this paper we propose a simple back-translation-style data augmentation method for mandarin Chinese polyphone disambiguation, utilizing a large amount of unlabeled text data. Inspired by the back-translation technique proposed in the field of machine translation, we build a Grapheme-to-Phoneme (G2P) model to predict the pronunciation of polyphonic character, and a Phoneme-to-Grapheme (P2G) model to predict pronunciation into text. Meanwhile, a window-based matching strategy and a multi-model scoring strategy are proposed to judge the correctness of the pseudo-label. We design a data balance strategy to improve the accuracy of some typical polyphonic characters in the training set with imbalanced distribution or data scarcity. The experimental result shows the effectiveness of the proposed back-translation-style data augmentation method., Comment: Published to APSIPA ASC 2022
Published: 2022

36. Kuaipedia: a Large-scale Multi-modal Short-video Encyclopedia

Author: Pan, Haojie, Zhai, Zepeng, Zhang, Yuzhou, Fu, Ruiji, Liu, Ming, Song, Yangqiu, Wang, Zhongyuan, and Qin, Bing
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Online encyclopedias, such as Wikipedia, have been well-developed and researched in the last two decades. One can find any attributes or other information of a wiki item on a wiki page edited by a community of volunteers. However, the traditional text, images and tables can hardly express some aspects of an wiki item. For example, when we talk about ``Shiba Inu'', one may care more about ``How to feed it'' or ``How to train it not to protect its food''. Currently, short-video platforms have become a hallmark in the online world. Whether you're on TikTok, Instagram, Kuaishou, or YouTube Shorts, short-video apps have changed how we consume and create content today. Except for producing short videos for entertainment, we can find more and more authors sharing insightful knowledge widely across all walks of life. These short videos, which we call knowledge videos, can easily express any aspects (e.g. hair or how-to-feed) consumers want to know about an item (e.g. Shiba Inu), and they can be systematically analyzed and organized like an online encyclopedia. In this paper, we propose Kuaipedia, a large-scale multi-modal encyclopedia consisting of items, aspects, and short videos lined to them, which was extracted from billions of videos of Kuaishou (Kwai), a well-known short-video platform in China. We first collected items from multiple sources and mined user-centered aspects from millions of users' queries to build an item-aspect tree. Then we propose a new task called ``multi-modal item-aspect linking'' as an expansion of ``entity linking'' to link short videos into item-aspect pairs and build the whole short-video encyclopedia. Intrinsic evaluations show that our encyclopedia is of large scale and highly accurate. We also conduct sufficient extrinsic experiments to show how Kuaipedia can help fundamental applications such as entity typing and entity linking.
Published: 2022

37. RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Author: Wu, Xing, Gao, Chaochen, Lin, Zijia, Wang, Zhongyuan, Han, Jizhong, and Hu, Songlin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions, called visual redundancy. Sparse sampling is also likely to miss important frames corresponding to some text portions, resulting in textual redundancy. Inter-modal redundancy leads to a mismatch of video and text information, hindering the model from better learning the shared semantics across modalities. To alleviate it, we propose Redundancy-aware Video-language Pre-training. We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity. Then, we penalize the highredundant video patches and text tokens through a proposed redundancy-aware contrastive learning. We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC, achieving a significant improvement over the previous stateof-the-art results. Our code are available at https://github.com/caskcsg/VLP/tree/main/RaP., Comment: EMNLP 2022
Published: 2022

38. Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

Author: Zheng, Wanfeng, Li, Qiang, Guo, Xiaoyan, Wan, Pengfei, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-driven image manipulation is developed since the vision-language model (CLIP) has been proposed. Previous work has adopted CLIP to design a text-image consistency-based objective to address this issue. However, these methods require either test-time optimization or image feature cluster analysis for single-mode manipulation direction. In this paper, we manage to achieve inference-time optimization-free diverse manipulation direction mining by bridging CLIP and StyleGAN through Latent Alignment (CSLA). More specifically, our efforts consist of three parts: 1) a data-free training strategy to train latent mappers to bridge the latent space of CLIP and StyleGAN; 2) for more precise mapping, temporal relative consistency is proposed to address the knowledge distribution bias problem among different latent spaces; 3) to refine the mapped latent in s space, adaptive style mixing is also proposed. With this mapping scheme, we can achieve GAN inversion, text-to-image generation and text-driven image manipulation. Qualitative and quantitative comparisons are made to demonstrate the effectiveness of our method., Comment: 20 pages, 23 figures
Published: 2022

39. InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings

Author: Wu, Xing, Gao, Chaochen, Lin, Zijia, Han, Jizhong, Wang, Zhongyuan, and Hu, Songlin
Subjects: Computer Science - Computation and Language
Abstract: Contrastive learning has been extensively studied in sentence embedding learning, which assumes that the embeddings of different views of the same sentence are closer. The constraint brought by this assumption is weak, and a good sentence representation should also be able to reconstruct the original sentence fragments. Therefore, this paper proposes an information-aggregated contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. InfoCSE forces the representation of [CLS] positions to aggregate denser sentence information by introducing an additional Masked language model task and a well-designed network. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large, achieving state-of-the-art results among unsupervised sentence representation learning methods. Our code are available at https://github.com/caskcsg/sentemb/tree/main/InfoCSE., Comment: EMNLP 2022
Published: 2022

40. TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval

Author: Zou, Xiaohan, Wu, Changqiao, Cheng, Lele, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most existing methods in vision-language retrieval match two modalities by either comparing their global feature vectors which misses sufficient information and lacks interpretability, detecting objects in images or videos and aligning the text with fine-grained features which relies on complicated model designs, or modeling fine-grained interaction via cross-attention upon visual and textual tokens which suffers from inferior efficiency. To address these limitations, some recent works simply aggregate the token-wise similarities to achieve fine-grained alignment, but they lack intuitive explanations as well as neglect the relationships between token-level features and global representations with high-level semantics. In this work, we rethink fine-grained cross-modal alignment and devise a new model-agnostic formulation for it. We additionally demystify the recent popular works and subsume them into our scheme. Furthermore, inspired by optimal transport theory, we introduce TokenFlow, an instantiation of the proposed scheme. By modifying only the similarity function, the performance of our method is comparable to the SoTA algorithms with heavy model designs on major video-text retrieval benchmarks. The visualization further indicates that TokenFlow successfully leverages the fine-grained information and achieves better interpretability.
Published: 2022

41. ConTextual Masked Auto-Encoder for Dense Passage Retrieval

Author: Wu, Xing, Ma, Guangyuan, Lin, Meng, Lin, Zijia, Wang, Zhongyuan, and Hu, Songlin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae., Comment: This paper has been accepted by AAAI2023
Published: 2022

42. Magic ELF: Image Deraining Meets Association Learning and Transformer

Author: Jiang, Kui, Wang, Zhongyuan, Chen, Chen, Wang, Zheng, Cui, Laizhong, and Lin, Chia-Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Convolutional neural network (CNN) and Transformer have achieved great success in multimedia applications. However, little effort has been made to effectively and efficiently harmonize these two architectures to satisfy image deraining. This paper aims to unify these two architectures to take advantage of their learning merits for image deraining. In particular, the local connectivity and translation equivariance of CNN and the global aggregation ability of self-attention (SA) in Transformer are fully exploited for specific local context and global structure representations. Based on the observation that rain distribution reveals the degradation location and degree, we introduce degradation prior to help background recovery and accordingly present the association refinement deraining scheme. A novel multi-input attention module (MAM) is proposed to associate rain perturbation removal and background recovery. Moreover, we equip our model with effective depth-wise separable convolutions to learn the specific feature representations and trade off computational complexity. Extensive experiments show that our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average, but only accounts for 11.7\% and 42.1\% of its computational cost and parameters. The source code is available at https://github.com/kuijiang94/Magic-ELF.
Published: 2022

43. Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Author: Wu, Wejia, Li, Zhuang, Li, Jiahong, Shen, Chunhua, Zhou, Hong, Li, Size, Wang, Zhongyuan, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is not friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, lightweight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTC-based recognition head with Masked RoI. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous best method. The code can be found at github.com/weijiawu/CoText., Comment: merge the paper with arXiv article 2207.08417. We will withdraw the two paper and create new one
Published: 2022

44. Estimation of non-symmetric and unbounded region of attraction using shifted shape function and R-composition

Author: Li, Dongyang, Ignatyev, Dmitry, Tsourdos, Antonios, and Wang, Zhongyuan
Subjects: Mathematics - Numerical Analysis
Abstract: A general numerical method using sum of squares programming is proposed to address the problem of estimating the region of attraction (ROA) of an asymptotically stable equilibrium point of a nonlinear polynomial system. The method is based on Lyapunov theory, and a shape function is defined to enlarge the provable subset of a local Lyapunov function. In contrast with existing methods with a shape function centered at the equilibrium point, the proposed method utilizes a shifted shape function (SSF) with its center shifted iteratively towards the boundary of the newly obtained invariant subset to improve ROA estimation. A set of shifting centers with corresponding SSFs is generated to produce proven subsets of the exact ROA and then a composition method, namely R-composition, is employed to express these independent sets in a compact form by just a single but richer-shaped level set. The proposed method denoted as RcomSSF brings a significant improvement for general ROA estimation problems, especially for non-symmetric or unbounded ROA, while keeping the computational burden at a reasonable level. Its effectiveness and advantages are demonstrated by several benchmark examples from literature., Comment: 44 pages, 9 figures, ISA Transactions, 2022
Published: 2022
Full Text: View/download PDF

45. Deepfake Face Traceability with Disentangling Reversing Network

Author: Ai, Jiaxin, Wang, Zhongyuan, Huang, Baojin, and Han, Zhen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deepfake face not only violates the privacy of personal identity, but also confuses the public and causes huge social harm. The current deepfake detection only stays at the level of distinguishing true and false, and cannot trace the original genuine face corresponding to the fake face, that is, it does not have the ability to trace the source of evidence. The deepfake countermeasure technology for judicial forensics urgently calls for deepfake traceability. This paper pioneers an interesting question about face deepfake, active forensics that "know it and how it happened". Given that deepfake faces do not completely discard the features of original faces, especially facial expressions and poses, we argue that original faces can be approximately speculated from their deepfake counterparts. Correspondingly, we design a disentangling reversing network that decouples latent space features of deepfake faces under the supervision of fake-original face pair samples to infer original faces in reverse., Comment: 5 pages, 4 figures
Published: 2022

46. Diagnosing Ensemble Few-Shot Classifiers

Author: Yang, Weikai, Ye, Xi, Zhang, Xingxing, Xiao, Lanxi, Xia, Jiazhi, Wang, Zhongyuan, Zhu, Jun, Pfister, Hanspeter, and Liu, Shixia
Subjects: Computer Science - Machine Learning
Abstract: The base learners and labeled samples (shots) in an ensemble few-shot classifier greatly affect the model performance. When the performance is not satisfactory, it is usually difficult to understand the underlying causes and make improvements. To tackle this issue, we propose a visual analysis method, FSLDiagnotor. Given a set of base learners and a collection of samples with a few shots, we consider two problems: 1) finding a subset of base learners that well predict the sample collections; and 2) replacing the low-quality shots with more representative ones to adequately represent the sample collections. We formulate both problems as sparse subset selection and develop two selection algorithms to recommend appropriate learners and shots, respectively. A matrix visualization and a scatterplot are combined to explain the recommended learners and shots in context and facilitate users in adjusting them. Based on the adjustment, the algorithm updates the recommendation results for another round of improvement. Two case studies are conducted to demonstrate that FSLDiagnotor helps build a few-shot classifier efficiently and increases the accuracy by 12% and 21%, respectively., Comment: Accepted in IEEE TVCG
Published: 2022
Full Text: View/download PDF

47. Augmentation-Aware Self-Supervision for Data-Efficient GAN Training

Author: Hou, Liang, Cao, Qi, Yuan, Yige, Zhao, Songtao, Ma, Chongyang, Pan, Siyuan, Wan, Pengfei, Wang, Zhongyuan, Shen, Huawei, and Cheng, Xueqi
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Training generative adversarial networks (GANs) with limited data is challenging because the discriminator is prone to overfitting. Previously proposed differentiable augmentation demonstrates improved data efficiency of training GANs. However, the augmentation implicitly introduces undesired invariance to augmentation for the discriminator since it ignores the change of semantics in the label space caused by data transformation, which may limit the representation learning ability of the discriminator and ultimately affect the generative modeling performance of the generator. To mitigate the negative impact of invariance while inheriting the benefits of data augmentation, we propose a novel augmentation-aware self-supervised discriminator that predicts the augmentation parameter of the augmented data. Particularly, the prediction targets of real data and generated data are required to be distinguished since they are different during training. We further encourage the generator to adversarially learn from the self-supervised discriminator by generating augmentation-predictable real and not fake data. This formulation connects the learning objective of the generator and the arithmetic $-$ harmonic mean divergence under certain assumptions. We compare our method with state-of-the-art (SOTA) methods using the class-conditional BigGAN and unconditional StyleGAN2 architectures on data-limited CIFAR-10, CIFAR-100, FFHQ, LSUN-Cat, and five low-shot datasets. Experimental results demonstrate significant improvements of our method over SOTA methods in training data-efficient GANs., Comment: NeurIPS 2023
Published: 2022

48. ITTR: Unpaired Image-to-Image Translation with Transformers

Author: Zheng, Wanfeng, Li, Qiang, Zhang, Guoxin, Wan, Pengfei, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unpaired image-to-image translation is to translate an image from a source domain to a target domain without paired training data. By utilizing CNN in extracting local semantics, various techniques have been developed to improve the translation performance. However, CNN-based generators lack the ability to capture long-range dependency to well exploit global semantics. Recently, Vision Transformers have been widely investigated for recognition tasks. Though appealing, it is inappropriate to simply transfer a recognition-based vision transformer to image-to-image translation due to the generation difficulty and the computation limitation. In this paper, we propose an effective and efficient architecture for unpaired Image-to-Image Translation with Transformers (ITTR). It has two main designs: 1) hybrid perception block (HPB) for token mixing from different receptive fields to utilize global semantics; 2) dual pruned self-attention (DPSA) to sharply reduce the computational complexity. Our ITTR outperforms the state-of-the-arts for unpaired image-to-image translation on six benchmark datasets., Comment: 18 pages, 7 figures, 5 tables
Published: 2022

49. Domain Generalization via Shuffled Style Assembly for Face Anti-Spoofing

Author: Wang, Zhuo, Wang, Zezheng, Yu, Zitong, Deng, Weihong, Li, Jiahong, Gao, Tingting, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: With diverse presentation attacks emerging continually, generalizable face anti-spoofing (FAS) has drawn growing attention. Most existing methods implement domain generalization (DG) on the complete representations. However, different image statistics may have unique properties for the FAS tasks. In this work, we separate the complete representation into content and style ones. A novel Shuffled Style Assembly Network (SSAN) is proposed to extract and reassemble different content and style features for a stylized feature space. Then, to obtain a generalized representation, a contrastive learning strategy is developed to emphasize liveness-related style information while suppress the domain-specific one. Finally, the representations of the correct assemblies are used to distinguish between living and spoofing during the inferring. On the other hand, despite the decent performance, there still exists a gap between academia and industry, due to the difference in data quantity and distribution. Thus, a new large-scale benchmark for FAS is built up to further evaluate the performance of algorithms in reality. Both qualitative and quantitative results on existing and proposed benchmarks demonstrate the effectiveness of our methods. The codes will be available at https://github.com/wangzhuo2019/SSAN., Comment: Accepted by CVPR2022
Published: 2022

50. Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks

Author: Wu, Xing, Gao, Chaochen, Lin, Meng, Zang, Liangjun, Wang, Zhongyuan, and Hu, Songlin
Subjects: Computer Science - Computation and Language
Abstract: Before entering the neural network, a token is generally converted to the corresponding one-hot representation, which is a discrete distribution of the vocabulary. Smoothed representation is the probability of candidate tokens obtained from a pre-trained masked language model, which can be seen as a more informative substitution to the one-hot representation. We propose an efficient data augmentation method, termed text smoothing, by converting a sentence from its one-hot representation to a controllable smoothed representation. We evaluate text smoothing on different benchmarks in a low-resource regime. Experimental results show that text smoothing outperforms various mainstream data augmentation methods by a substantial margin. Moreover, text smoothing can be combined with those data augmentation methods to achieve better performance., Comment: ACL 2022 Main Conference Accepted
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,527 results on '"Wang, Zhongyuan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources