Author: "Yang, Jian" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yang, Jian"' showing total 46,478 results

Start Over Author "Yang, Jian"

46,478 results on '"Yang, Jian"'

1. Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

Author: Hu, Taihang, Li, Linxuan, van de Weijer, Joost, Gao, Hongcheng, Khan, Fahad Shahbaz, Yang, Jian, Cheng, Ming-Ming, Wang, Kai, and Wang, Yaxing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map. Additionally, to address potential confusion among main objects with complex textual prompts, we propose end token substitution as a complementary strategy. To further refine our approach in the initial stages of T2I generation, where layouts are determined, we incorporate two auxiliary losses, an entropy loss and a semantic binding loss, to iteratively update the composite token to improve the generation integrity. We conducted extensive experiments to validate the effectiveness of ToMe, comparing it against various existing methods on the T2I-CompBench and our proposed GPT-4o object binding benchmark. Our method is particularly effective in complex scenarios that involve multiple objects and attributes, which previous methods often fail to address. The code will be publicly available at \url{https://github.com/hutaihang/ToMe}., Comment: Accepted by Neurips2024
Published: 2024

2. United Domain Cognition Network for Salient Object Detection in Optical Remote Sensing Images

Author: Sun, Yanguang, Yang, Jian, and Luo, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, deep learning-based salient object detection (SOD) in optical remote sensing images (ORSIs) have achieved significant breakthroughs. We observe that existing ORSIs-SOD methods consistently center around optimizing pixel features in the spatial domain, progressively distinguishing between backgrounds and objects. However, pixel information represents local attributes, which are often correlated with their surrounding context. Even with strategies expanding the local region, spatial features remain biased towards local characteristics, lacking the ability of global perception. To address this problem, we introduce the Fourier transform that generate global frequency features and achieve an image-size receptive field. To be specific, we propose a novel United Domain Cognition Network (UDCNet) to jointly explore the global-local information in the frequency and spatial domains. Technically, we first design a frequency-spatial domain transformer block that mutually amalgamates the complementary local spatial and global frequency features to strength the capability of initial input features. Furthermore, a dense semantic excavation module is constructed to capture higher-level semantic for guiding the positioning of remote sensing objects. Finally, we devise a dual-branch joint optimization decoder that applies the saliency and edge branches to generate high-quality representations for predicting salient objects. Experimental results demonstrate the superiority of the proposed UDCNet method over 24 state-of-the-art models, through extensive quantitative and qualitative comparisons in three widely-used ORSIs-SOD datasets. The source code is available at: \href{https://github.com/CSYSI/UDCNet}{\color{blue} https://github.com/CSYSI/UDCNet}., Comment: Accepted at TGRS 2024
Published: 2024

3. Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Author: Chen, Zhennan, Li, Yajie, Wang, Haofan, Chen, Zhibo, Jiang, Zhengkai, Li, Jun, Wang, Qian, Yang, Jian, and Tai, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
Published: 2024

4. Compressive Spectrum Sensing with 1-bit ADCs

Author: Yang, Jian, Song, Zihang, Zhang, Han, and Gao, Yue
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Efficient wideband spectrum sensing (WSS) is essential for managing spectrum scarcity in wireless communications. However, existing compressed sensing (CS)-based WSS methods require high sampling rates and power consumption, particularly with high-precision analog-to-digital converters (ADCs). Although 1-bit CS with low-precision ADCs can mitigate these demands, most approaches still depend on multi-user cooperation and prior sparsity information, which are often unavailable in WSS scenarios. This paper introduces a non-cooperative WSS method using multicoset sampling with 1-bit ADCs to achieve sub-Nyquist sampling without requiring sparsity knowledge. We analyze the impact of 1-bit quantization on multiband signals, then apply eigenvalue decomposition to isolate the signal subspace from noise, enabling spectrum support estimation without signal reconstruction. This approach provides a power-efficient solution for WSS that eliminates the need for cooperation and prior information.
Published: 2024

5. Domain Generalization for Cross-Receiver Radio Frequency Fingerprint Identification

Author: Zhang, Ying, Li, Qiang, Liu, Hongli, Yang, Liu, and Yang, Jian
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Radio Frequency Fingerprint Identification (RFFI) technology uniquely identifies emitters by analyzing unique distortions in the transmitted signal caused by non-ideal hardware. Recently, RFFI based on deep learning methods has gained popularity and is seen as a promising way to address the device authentication problem for Internet of Things (IoT) systems. However, in cross-receiver scenarios, where the RFFI model is trained over RF signals from some receivers but deployed at a new receiver, the alteration of receivers' characteristics would lead to data distribution shift and cause significant performance degradation at the new receiver. To address this problem, we first perform a theoretical analysis of the cross-receiver generalization error bound and propose a sufficient condition, named Separable Condition (SC), to minimize the classification error probability on the new receiver. Guided by the SC, a Receiver-Independent Emitter Identification (RIEI)model is devised to decouple the received signals into emitter-related features and receiver-related features and only the emitter-related features are used for identification. Furthermore, by leveraging federated learning, we also develop a FedRIEI model to eliminate the need for centralized collection of raw data from multiple receivers. Experiments on two real-world datasets demonstrate the superiority of our proposed methods over some baseline methods., Comment: Accepted by IEEE Internet of Things Journal
Published: 2024

6. MdEval: Massively Multilingual Code Debugging

Author: Liu, Shukai, Chai, Linzheng, Yang, Jian, Shi, Jiajun, Zhu, He, Wang, Liran, Jin, Ke, Zhang, Wei, Zhu, Hualei, Guo, Shuyue, Sun, Tao, Liu, Jiaheng, Duan, Yunlong, Hao, Yu, Yang, Liqun, Niu, Guanglin, Zhang, Ge, and Li, Zhoujun
Subjects: Computer Science - Computation and Language
Abstract: Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their associated test cases, are used to assess the debugging capabilities of LLMs. However, many existing benchmarks primarily focus on Python and are often limited in terms of language diversity (e.g., DebugBench and DebugEval). To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.6K test samples of 18 programming languages and covers the automated program repair (APR) task, the code review (CR) task, and the bug identification (BI) task. Further, we introduce the debugging instruction corpora MDEVAL-INSTRUCT by injecting bugs into the correct multilingual queries and solutions (xDebugGen). Further, a multilingual debugger xDebugCoder trained on MDEVAL-INSTRUCT as a strong baseline specifically to handle the bugs of a wide range of programming languages (e.g. "Missing Mut" in language Rust and "Misused Macro Definition" in language C). Our extensive experiments on MDEVAL reveal a notable performance gap between open-source models and closed-source LLMs (e.g., GPT and Claude series), highlighting huge room for improvement in multilingual code debugging scenarios., Comment: 15 pages
Published: 2024

7. Efficient Non-Exemplar Class-Incremental Learning with Retrospective Feature Synthesis

Author: Bai, Liang, Song, Hong, Lin, Yucong, Fu, Tianyu, Xiao, Deqiang, Ai, Danni, Fan, Jingfan, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the outstanding performance in many individual tasks, deep neural networks suffer from catastrophic forgetting when learning from continuous data streams in real-world scenarios. Current Non-Exemplar Class-Incremental Learning (NECIL) methods mitigate forgetting by storing a single prototype per class, which serves to inject previous information when sequentially learning new classes. However, these stored prototypes or their augmented variants often fail to simultaneously capture spatial distribution diversity and precision needed for representing old classes. Moreover, as the model acquires new knowledge, these prototypes gradually become outdated, making them less effective. To overcome these limitations, we propose a more efficient NECIL method that replaces prototypes with synthesized retrospective features for old classes. Specifically, we model each old class's feature space using a multivariate Gaussian distribution and generate deep representations by sampling from high-likelihood regions. Additionally, we introduce a similarity-based feature compensation mechanism that integrates generated old class features with similar new class features to synthesize robust retrospective representations. These retrospective features are then incorporated into our incremental learning framework to preserve the decision boundaries of previous classes while learning new ones. Extensive experiments on CIFAR-100, TinyImageNet, and ImageNet-Subset demonstrate that our method significantly improves the efficiency of non-exemplar class-incremental learning and achieves state-of-the-art performance., Comment: 13 pages, 9 figures
Published: 2024

8. Novel Object Synthesis via Adaptive Text-Image Harmony

Author: Xiong, Zeren, Zhang, Zedong, Chen, Zikun, Chen, Shuo, Li, Xiang, Sun, Gan, Yang, Jian, and Li, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we study an object synthesis task that combines an object text with an object image to create a new object image. However, most diffusion models struggle with this task, \textit{i.e.}, often generating an object that predominantly reflects either the text or the image due to an imbalance between their inputs. To address this issue, we propose a simple yet effective method called Adaptive Text-Image Harmony (ATIH) to generate novel and surprising objects. First, we introduce a scale factor and an injection step to balance text and image features in cross-attention and to preserve image information in self-attention during the text-image inversion diffusion process, respectively. Second, to better integrate object text and image, we design a balanced loss function with a noise parameter, ensuring both optimal editability and fidelity of the object image. Third, to adaptively adjust these parameters, we present a novel similarity score function that not only maximizes the similarities between the generated object image and the input text/image but also balances these similarities to harmonize text and image integration. Extensive experiments demonstrate the effectiveness of our approach, showcasing remarkable object creations such as colobus-glass jar. Project page: https://xzr52.github.io/ATIH/., Comment: NeurIPS2024
Published: 2024

9. Grid4D: 4D Decomposed Hash Encoding for High-fidelity Dynamic Gaussian Splatting

Author: Xu, Jiawei, Fan, Zexin, Yang, Jian, and Xie, Jin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, Gaussian splatting has received more and more attention in the field of static scene rendering. Due to the low computational overhead and inherent flexibility of explicit representations, plane-based explicit methods are popular ways to predict deformations for Gaussian-based dynamic scene rendering models. However, plane-based methods rely on the inappropriate low-rank assumption and excessively decompose the space-time 4D encoding, resulting in overmuch feature overlap and unsatisfactory rendering quality. To tackle these problems, we propose Grid4D, a dynamic scene rendering model based on Gaussian splatting and employing a novel explicit encoding method for the 4D input through the hash encoding. Different from plane-based explicit representations, we decompose the 4D encoding into one spatial and three temporal 3D hash encodings without the low-rank assumption. Additionally, we design a novel attention module that generates the attention scores in a directional range to aggregate the spatial and temporal features. The directional attention enables Grid4D to more accurately fit the diverse deformations across distinct scene components based on the spatial encoded features. Moreover, to mitigate the inherent lack of smoothness in explicit representation methods, we introduce a smooth regularization term that keeps our model from the chaos of deformation prediction. Our experiments demonstrate that Grid4D significantly outperforms the state-of-the-art models in visual quality and rendering speed., Comment: Accepted by NeurIPS 2024
Published: 2024

10. M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation

Author: Liu, Jiaheng, Deng, Ken, Liu, Congnan, Yang, Jian, Liu, Shukai, Zhu, He, Zhao, Peng, Chai, Linzheng, Wu, Yanan, Jin, Ke, Zhang, Ge, Wang, Zekun, Zhang, Guoan, Xiang, Bangyu, Su, Wenbo, and Zheng, Bo
Subjects: Computer Science - Computation and Language, Computer Science - Software Engineering
Abstract: Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT., Comment: 19 pages
Published: 2024

11. AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

Author: Li, Ziming, Zang, Qianbo, Ma, David, Guo, Jiawei, Zheng, Tuney, Liu, Minghao, Niu, Xinyao, Wang, Yue, Yang, Jian, Liu, Jiaheng, Zhong, Wanjun, Zhou, Wangchunshu, Huang, Wenhao, and Zhang, Ge
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent system. AutoKaggle implements an iterative development process that combines code execution, debugging, and comprehensive unit testing to ensure code correctness and logic consistency. The framework offers highly customizable workflows, allowing users to intervene at each phase, thus integrating automated intelligence with human expertise. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution, enhancing productivity by streamlining common tasks. We selected 8 Kaggle competitions to simulate data processing workflows in real-world application scenarios. Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks., Comment: 44 pages, 10 figures
Published: 2024

12. FastPCI: Motion-Structure Guided Fast Point Cloud Frame Interpolation

Author: Zhang, Tianyu, Qian, Guocheng, Xie, Jin, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point cloud frame interpolation is a challenging task that involves accurate scene flow estimation across frames and maintaining the geometry structure. Prevailing techniques often rely on pre-trained motion estimators or intensive testing-time optimization, resulting in compromised interpolation accuracy or prolonged inference. This work presents FastPCI that introduces Pyramid Convolution-Transformer architecture for point cloud frame interpolation. Our hybrid Convolution-Transformer improves the local and long-range feature learning, while the pyramid network offers multilevel features and reduces the computation. In addition, FastPCI proposes a unique Dual-Direction Motion-Structure block for more accurate scene flow estimation. Our design is motivated by two facts: (1) accurate scene flow preserves 3D structure, and (2) point cloud at the previous timestep should be reconstructable using reverse motion from future timestep. Extensive experiments show that FastPCI significantly outperforms the state-of-the-art PointINet and NeuralPCI with notable gains (e.g. 26.6% and 18.3% reduction in Chamfer Distance in KITTI), while being more than 10x and 600x faster, respectively. Code is available at https://github.com/genuszty/FastPCI, Comment: To appear in ECCV 2024
Published: 2024

13. Aligning CodeLLMs with Direct Preference Optimization

Author: Miao, Yibo, Gao, Bofei, Quan, Shanghaoran, Lin, Junyang, Zan, Daoguang, Liu, Jiaheng, Yang, Jian, Liu, Tianyu, and Deng, Zhijie
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The last year has witnessed the rapid progress of large language models (LLMs) across diverse domains. Among them, CodeLLMs have garnered particular attention because they can not only assist in completing various programming tasks but also represent the decision-making and logical reasoning capabilities of LLMs. However, current CodeLLMs mainly focus on pre-training and supervised fine-tuning scenarios, leaving the alignment stage, which is important for post-training LLMs, under-explored. This work first identifies that the commonly used PPO algorithm may be suboptimal for the alignment of CodeLLM because the involved reward rules are routinely coarse-grained and potentially flawed. We then advocate addressing this using the DPO algorithm. Based on only preference data pairs, DPO can render the model rank data automatically, giving rise to a fine-grained rewarding pattern more robust than human intervention. We also contribute a pipeline for collecting preference pairs for DPO on CodeLLMs. Studies show that our method significantly improves the performance of existing CodeLLMs on benchmarks such as MBPP and HumanEval.
Published: 2024

14. DCDepth: Progressive Monocular Depth Estimation in Discrete Cosine Domain

Author: Wang, Kun, Yan, Zhiqiang, Fan, Junkai, Zhu, Wanlu, Li, Xiang, Li, Jun, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we introduce DCDepth, a novel framework for the long-standing monocular depth estimation task. Moving beyond conventional pixel-wise depth estimation in the spatial domain, our approach estimates the frequency coefficients of depth patches after transforming them into the discrete cosine domain. This unique formulation allows for the modeling of local depth correlations within each patch. Crucially, the frequency transformation segregates the depth information into various frequency components, with low-frequency components encapsulating the core scene structure and high-frequency components detailing the finer aspects. This decomposition forms the basis of our progressive strategy, which begins with the prediction of low-frequency components to establish a global scene context, followed by successive refinement of local details through the prediction of higher-frequency components. We conduct comprehensive experiments on NYU-Depth-V2, TOFDC, and KITTI datasets, and demonstrate the state-of-the-art performance of DCDepth. Code is available at https://github.com/w2kun/DCDepth., Comment: Accepted by NeurIPS-2024
Published: 2024

15. ERDDCI: Exact Reversible Diffusion via Dual-Chain Inversion for High-Quality Image Editing

Author: Dai, Jimin, Zhang, Yingzhen, Chen, Shuo, Yang, Jian, and Luo, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models (DMs) have been successfully applied to real image editing. These models typically invert images into latent noise vectors used to reconstruct the original images (known as inversion), and then edit them during the inference process. However, recent popular DMs often rely on the assumption of local linearization, where the noise injected during the inversion process is expected to approximate the noise removed during the inference process. While DM efficiently generates images under this assumption, it can also accumulate errors during the diffusion process due to the assumption, ultimately negatively impacting the quality of real image reconstruction and editing. To address this issue, we propose a novel method, referred to as ERDDCI (Exact Reversible Diffusion via Dual-Chain Inversion). ERDDCI uses the new Dual-Chain Inversion (DCI) for joint inference to derive an exact reversible diffusion process. By using DCI, our method effectively avoids the cumbersome optimization process in existing inversion approaches and achieves high-quality image editing. Additionally, to accommodate image operations under high guidance scales, we introduce a dynamic control strategy that enables more refined image reconstruction and editing. Our experiments demonstrate that ERDDCI significantly outperforms state-of-the-art methods in a 50-step diffusion process. It achieves rapid and precise image reconstruction with an SSIM of 0.999 and an LPIPS of 0.001, and also delivers competitive results in image editing.
Published: 2024

16. A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Author: Wu, Siwei, Peng, Zhongyuan, Du, Xinrun, Zheng, Tuney, Liu, Minghao, Wu, Jialong, Ma, Jiachen, Li, Yizhi, Yang, Jian, Zhou, Wangchunshu, Lin, Qunshu, Zhao, Junbo, Zhang, Zhaoxiang, Huang, Wenhao, Zhang, Ge, Lin, Chenghua, and Liu, J. H.
Subjects: Computer Science - Computation and Language
Abstract: Enabling Large Language Models (LLMs) to handle a wider range of complex tasks (e.g., coding, math) has drawn great attention from many researchers. As LLMs continue to evolve, merely increasing the number of model parameters yields diminishing performance improvements and heavy computational costs. Recently, OpenAI's o1 model has shown that inference strategies (i.e., Test-time Compute methods) can also significantly enhance the reasoning capabilities of LLMs. However, the mechanisms behind these methods are still unexplored. In our work, to investigate the reasoning patterns of o1, we compare o1 with existing Test-time Compute methods (BoN, Step-wise BoN, Agent Workflow, and Self-Refine) by using OpenAI's GPT-4o as a backbone on general reasoning benchmarks in three domains (i.e., math, coding, commonsense reasoning). Specifically, first, our experiments show that the o1 model has achieved the best performance on most datasets. Second, as for the methods of searching diverse responses (e.g., BoN), we find the reward models' capability and the search space both limit the upper boundary of these methods. Third, as for the methods that break the problem into many sub-problems, the Agent Workflow has achieved better performance than Step-wise BoN due to the domain-specific system prompt for planning better reasoning processes. Fourth, it is worth mentioning that we have summarized six reasoning patterns of o1, and provided a detailed analysis on several reasoning benchmarks.
Published: 2024

17. Degradation Oriented and Regularized Network for Blind Depth Super-Resolution

Author: Wang, Zhengxue, Yan, Zhiqiang, Pan, Jinshan, Gao, Guangwei, Zhang, Kai, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent RGB-guided depth super-resolution methods have achieved impressive performance under the assumption of fixed and known degradation (e.g., bicubic downsampling). However, in real-world scenarios, captured depth data often suffer from unconventional and unknown degradation due to sensor limitations and complex imaging environments (e.g., low reflective surfaces, varying illumination). Consequently, the performance of these methods significantly declines when real-world degradation deviate from their assumptions. In this paper, we propose the Degradation Oriented and Regularized Network (DORNet), a novel framework designed to adaptively address unknown degradation in real-world scenes through implicit degradation representations. Our approach begins with the development of a self-supervised degradation learning strategy, which models the degradation representations of low-resolution depth data using routing selection-based degradation regularization. To facilitate effective RGB-D fusion, we further introduce a degradation-oriented feature transformation module that selectively propagates RGB content into the depth data based on the learned degradation priors. Extensive experimental results on both real and synthetic datasets demonstrate the superiority of our DORNet in handling unknown degradation, outperforming existing methods. The code is available at https://github.com/yanzq95/DORNet., Comment: 10 pages
Published: 2024

18. MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Author: Yang, Jian, Yin, Dacheng, Zhou, Yizhou, Rao, Fengyun, Zhai, Wei, Cao, Yang, and Zha, Zheng-Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods inevitably suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation, the backbone model's hidden representation of the image is not limited to the last denoising step. To successfully train our method, we also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals. Through extensive evaluations on 18 image understanding benchmarks, MMAR demonstrates much more superior performance than other joint multi-modal models, matching the method that employs pretrained CLIP vision encoder, meanwhile being able to generate high quality images at the same time. We also showed that our method is scalable with larger data and model size.
Published: 2024

19. HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Author: Guan, Shanyan, Ge, Yanhao, Tai, Ying, Yang, Jian, Li, Wei, and You, Mingyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in text-to-image diffusion models have shown remarkable creative capabilities with textual prompts, but generating personalized instances based on specific subjects, known as subject-driven generation, remains challenging. To tackle this issue, we present a new hybrid framework called HybridBooth, which merges the benefits of optimization-based and direct-regression methods. HybridBooth operates in two stages: the Word Embedding Probe, which generates a robust initial word embedding using a fine-tuned encoder, and the Word Embedding Refinement, which further adapts the encoder to specific subject images by optimizing key parameters. This approach allows for effective and fast inversion of visual concepts into textual embedding, even from a single image, while maintaining the model's generalization capabilities., Comment: ECCV 2024, the project page: https://sites.google.com/view/hybridbooth
Published: 2024

20. In-Context Code-Text Learning for Bimodal Software Engineering

Author: Tang, Xunzhu, Wang, Liran, Liu, Yonghui, Chai, Linzheng, Yang, Jian, Li, Zhoujun, Tian, Haoye, Klein, Jacques, and Bissyande, Tegawende F.
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence
Abstract: Bimodal software analysis initially appeared to be within reach with the advent of large language models. Unfortunately, the complex interplay of natural language text and code in software engineering, presents unique challenges that prevent pretrained models to generalize to a variety of tasks. We postulate that in-context learning for the code-text bimodality is a promising avenue. This paper thus introduces a comprehensive study of in-context code-text learning, focusing on leveraging pretrained CodeLLAMA models. We consider a diverse dataset encompassing 23 software engineering tasks, which we transform in an in-context learning format. To effectively extract informative features, we propose a configurable prompt template. Our proposed pipeline, InCTRL, then unifies prompt learning across various software engineering tasks. Extensive evaluation on the study datasets demonstrates the superiority of INCTRL-models in few-shot performance, surpassing state-of-the-art models including the support model, CodeLLAMA. Typically, we observe that applied to the CodeLLAMA model, INCTRL brings improvements in terms of precision (at least about 12\%) and recall (up to 93.88\%) on various tasks. For example, on the task of program repair, INCTRL improves the BLEU score of CodeLLAMA by 85 points, while for clone detection, INCTRL achieves an improvement of 69 percentage points. Moreover, INCTRL-models offer state-of-the-art performance when using retrieval-augmented generation on individual downstream tasks. Finally, we qualitatively analyze the benefits of INCTRL over CodeLLAMA and open-source all models for broader impact. We make our code and dataset publicly available at: \begin{center} {\url{https://anonymous.4open.science/r/inctrl-B65B}} \end{center}
Published: 2024

21. KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Author: Ma, Kaijing, Du, Xinrun, Wang, Yunran, Zhang, Haoran, Wen, Zhoufutu, Qu, Xingwei, Yang, Jian, Liu, Jiaheng, Liu, Minghao, Yue, Xiang, Huang, Wenhao, and Zhang, Ge
Subjects: Computer Science - Databases
Abstract: In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), which minimizes the impact of domain-specific knowledge for a more accurate evaluation of models' reasoning abilities in out-of-distribution scenarios. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes the effectiveness of models in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, significantly outperforming Claude-3.5-Sonnet and GPT-4o, which score 58.96% and 58.00%, revealing considerable performance gaps and highlighting KOR-Bench's effectiveness. We conduct thorough analyses to identify bottlenecks in the Cipher task using Stepwise Prompting, discovering that two rounds of Self-Correction yield optimal results. Complex Task Processing evaluates model performance across three integrated tasks, while we also explore the impact of Tricks on the Puzzle task and visualize rule-focused attention to enhance our understanding of model behavior. KOR-Bench aims to enhance reasoning evaluation and support further research in this field.
Published: 2024

22. LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

Author: Yang, Jian, Wang, Xukun, Wang, Wentao, Li, Guoming, Fang, Qihang, Yuan, Ruihong, Wang, Tianyang, and Fan, Jason Zhaoxin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
Published: 2024

23. GrokLST: Towards High-Resolution Benchmark and Toolkit for Land Surface Temperature Downscaling

Author: Dai, Qun, Yuan, Chunyang, Dai, Yimian, Li, Yuxuan, Li, Xiang, Ni, Kang, Xu, Jianhui, Shu, Xiangbo, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Land Surface Temperature (LST) is a critical parameter for environmental studies, but obtaining high-resolution LST data remains challenging due to the spatio-temporal trade-off in satellite remote sensing. Guided LST downscaling has emerged as a solution, but current methods often neglect spatial non-stationarity and lack a open-source ecosystem for deep learning methods. To address these limitations, we propose the Modality-Conditional Large Selective Kernel (MoCoLSK) Networks, a novel architecture that dynamically fuses multi-modal data through modality-conditioned projections. MoCoLSK re-engineers our previous LSKNet to achieve a confluence of dynamic receptive field adjustment and multi-modal feature integration, leading to enhanced LST prediction accuracy. Furthermore, we establish the GrokLST project, a comprehensive open-source ecosystem featuring the GrokLST dataset, a high-resolution benchmark, and the GrokLST toolkit, an open-source PyTorch-based toolkit encapsulating MoCoLSK alongside 40+ state-of-the-art approaches. Extensive experimental results validate MoCoLSK's effectiveness in capturing complex dependencies and subtle variations within multispectral data, outperforming existing methods in LST downscaling. Our code, dataset, and toolkit are available at https://github.com/GrokCV/GrokLST.
Published: 2024

24. HazyDet: Open-source Benchmark for Drone-view Object Detection with Depth-cues in Hazy Scenes

Author: Feng, Changfeng, Chen, Zhenyuan, Kou, Renke, Gao, Guangwei, Wang, Chunping, Li, Xiang, Shu, Xiangbo, Dai, Yimian, Fu, Qiang, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Drone-based object detection in adverse weather conditions is crucial for enhancing drones' environmental perception, yet it remains largely unexplored due to the lack of relevant benchmarks. To bridge this gap, we introduce HazyDet, a large-scale dataset tailored for drone-based object detection in hazy scenes. It encompasses 383,000 real-world instances, collected from both naturally hazy environments and normal scenes with synthetically imposed haze effects to simulate adverse weather conditions. By observing the significant variations in object scale and clarity under different depth and haze conditions, we designed a Depth Conditioned Detector (DeCoDet) to incorporate this prior knowledge. DeCoDet features a Multi-scale Depth-aware Detection Head that seamlessly integrates depth perception, with the resulting depth cues harnessed by a dynamic Depth Condition Kernel module. Furthermore, we propose a Scale Invariant Refurbishment Loss to facilitate the learning of robust depth cues from pseudo-labels. Extensive evaluations on the HazyDet dataset demonstrate the flexibility and effectiveness of our method, yielding significant performance improvements. Our dataset and toolkit are available at https://github.com/GrokCV/HazyDet.
Published: 2024

25. RNG: Relightable Neural Gaussians

Author: Fan, Jiahui, Luan, Fujun, Yang, Jian, Hašan, Miloš, and Wang, Beibei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: 3D Gaussian Splatting (3DGS) has shown its impressive power in novel view synthesis. However, creating relightable 3D assets, especially for objects with ill-defined shapes (e.g., fur), is still a challenging task. For these scenes, the decomposition between the light, geometry, and material is more ambiguous, as neither the surface constraints nor the analytical shading model hold. To address this issue, we propose RNG, a novel representation of relightable neural Gaussians, enabling the relighting of objects with both hard surfaces or fluffy boundaries. We avoid any assumptions in the shading model but maintain feature vectors, which can be further decoded by an MLP into colors, in each Gaussian point. Following prior work, we utilize a point light to reduce the ambiguity and introduce a shadow-aware condition to the network. We additionally propose a depth refinement network to help the shadow computation under the 3DGS framework, leading to better shadow effects under point lights. Furthermore, to avoid the blurriness brought by the alpha-blending in 3DGS, we design a hybrid forward-deferred optimization strategy. As a result, we achieve about $20\times$ faster in training and about $600\times$ faster in rendering than prior work based on neural radiance fields, with $60$ frames per second on an RTX4090.
Published: 2024

26. Cascade Prompt Learning for Vision-Language Model Adaptation

Author: Wu, Ge, Zhang, Xin, Li, Zheng, Chen, Zhaowei, Liang, Jiajun, Yang, Jian, and Li, Xiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning CasPL framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It's worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL., Comment: ECCV2024
Published: 2024

27. HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Author: Que, Haoran, Duan, Feiyu, He, Liqun, Mou, Yutao, Zhou, Wangchunshu, Liu, Jiaheng, Rong, Wenge, Wang, Zekun Moore, Yang, Jian, Zhang, Ge, Peng, Junran, Zhang, Zhaoxiang, Zhang, Songyang, and Chen, Kai
Subjects: Computer Science - Computation and Language
Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in https://github.com/Quehry/HelloBench.
Published: 2024

28. OmniBench: Towards The Future of Universal Omni-Language Models

Author: Li, Yizhi, Zhang, Ge, Ma, Yinghao, Yuan, Ruibin, Zhu, Kang, Guo, Hangyu, Liang, Yiming, Liu, Jiaheng, Wang, Zekun, Yang, Jian, Wu, Siwei, Qu, Xingwei, Shi, Jinjie, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Zhang, Zhaoxiang, Liu, Zachary, Benetos, Emmanouil, Huang, Wenhao, and Lin, Chenghua
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs). OmniBench is distinguished by high-quality human annotations, ensuring that accurate responses require integrated understanding and reasoning across all three modalities. Our main findings reveal that: i) most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts; and ii) most baselines models perform poorly (below 50\% accuracy) even when provided with alternative textual representations of images or/and audio. These results suggest that the ability to construct a consistent context from text, image, and audio is often overlooked in existing MLLM training paradigms. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts. We advocate for future research to focus on developing more robust tri-modal integration techniques and training strategies to enhance OLM performance across diverse modalities. The codes and live leaderboard could be found at https://m-a-p.ai/OmniBench.
Published: 2024

29. Constructions and Applications of Irreducible Representations of Spin-Space Groups

Author: Song, Ziyin, Yang, A. Z., Jiang, Yi, Fang, Zhong, Yang, Jian, Fang, Chen, Weng, Hongming, and Liu, Zheng-Xin
Subjects: Condensed Matter - Materials Science
Abstract: Spin-space groups (SSGs), including the traditional space groups (SGs) and magnetic space groups (MSGs) as subsets, describe the complete symmetries of magnetic materials with weak spin-orbit coupling (SOC). In the present work, we systematically study the irreducible representations (irreps) of SSGs by focusing on the projective irreps of the little co-group $L(k)$ of any momentum point $\pmb k$. We analysis the factor systems of $L(k)$, and then reduce the projective regular representation of $L(k)$ into direct sum of irreps using the Hamiltonian approach. Especially, for collinear SSGs which contain continuous spin rotation operations, we adopt discrete subgroups to effectively capture their characteristics. Furthermore, we apply the representation theory of SSGs to study the band structure of electrons and magnons in magnetic materials. After identifying the SSG symmetry group, we extract relevant irreps and determine the $k\cdot p$ models. As an example, we illustrate how our approach works for the material \ch{Mn3Sn}. Degeneracies facilitated by SSG symmetry are observed, underscoring the effectiveness of application in material analysis. The SSG recognition and representation code is uploaded to GitHub, the information of irreps of all SSGs is also available in the online Database. Our work provides a practical toolkit for exploring the intricate symmetries of magnetic materials and paves the way for future advances in materials science.
Published: 2024

30. Qwen2.5-Coder Technical Report

Author: Hui, Binyuan, Yang, Jian, Cui, Zeyu, Yang, Jiaxi, Liu, Dayiheng, Zhang, Lei, Liu, Tianyu, Zhang, Jiajun, Yu, Bowen, Lu, Keming, Dang, Kai, Fan, Yang, Zhang, Yichang, Yang, An, Men, Rui, Huang, Fei, Zheng, Bo, Miao, Yibo, Quan, Shanghaoran, Feng, Yunlong, Ren, Xingzhang, Ren, Xuancheng, Zhou, Jingren, and Lin, Junyang
Subjects: Computer Science - Computation and Language
Abstract: In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.
Published: 2024

31. SoccerNet 2024 Challenges Results

Author: Cioppa, Anthony, Giancola, Silvio, Somers, Vladimir, Joos, Victor, Magera, Floriane, Held, Jan, Ghasemzadeh, Seyed Abolfazl, Zhou, Xin, Seweryn, Karolina, Kowalczyk, Mateusz, Mróz, Zuzanna, Łukasik, Szymon, Hałoń, Michał, Mkhallati, Hassan, Deliège, Adrien, Hinojosa, Carlos, Sanchez, Karen, Mansourian, Amir M., Miralles, Pierre, Barnich, Olivier, De Vleeschouwer, Christophe, Alahi, Alexandre, Ghanem, Bernard, Van Droogenbroeck, Marc, Gorski, Adam, Clapés, Albert, Boiarov, Andrei, Afanasiev, Anton, Xarles, Artur, Scott, Atom, Lim, ByoungKwon, Yeung, Calvin, Gonzalez, Cristian, Rüfenacht, Dominic, Pacilio, Enzo, Deuser, Fabian, Altawijri, Faisal Sami, Cachón, Francisco, Kim, HanKyul, Wang, Haobo, Choe, Hyeonmin, Kim, Hyunwoo J, Kim, Il-Min, Kang, Jae-Mo, Tursunboev, Jamshid, Yang, Jian, Hong, Jihwan, Lee, Jimin, Zhang, Jing, Lee, Junseok, Zhang, Kexin, Habel, Konrad, Jiao, Licheng, Li, Linyi, Gutiérrez-Pérez, Marc, Ortega, Marcelo, Li, Menglong, Lopatto, Milosz, Kasatkin, Nikita, Nemtsev, Nikolay, Oswald, Norbert, Udin, Oleg, Kononov, Pavel, Geng, Pei, Alotaibi, Saad Ghazai, Kim, Sehyung, Ulasen, Sergei, Escalera, Sergio, Zhang, Shanshan, Yang, Shuyuan, Moon, Sunghwan, Moeslund, Thomas B., Shandyba, Vasyl, Golovkin, Vladimir, Dai, Wei, Chung, WonTaek, Liu, Xinyu, Zhu, Yongqiang, Kim, Youngseo, Li, Yuan, Yang, Yuting, Xiao, Yuxuan, Cheng, Zehua, and Li, Zhihao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The SoccerNet 2024 challenges represent the fourth annual video understanding challenges organized by the SoccerNet team. These challenges aim to advance research across multiple themes in football, including broadcast video understanding, field understanding, and player understanding. This year, the challenges encompass four vision-based tasks. (1) Ball Action Spotting, focusing on precisely localizing when and which soccer actions related to the ball occur, (2) Dense Video Captioning, focusing on describing the broadcast with natural language and anchored timestamps, (3) Multi-View Foul Recognition, a novel task focusing on analyzing multiple viewpoints of a potential foul incident to classify whether a foul occurred and assess its severity, (4) Game State Reconstruction, another novel task focusing on reconstructing the game state from broadcast videos onto a 2D top-view map of the field. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet., Comment: 7 pages, 1 figure
Published: 2024

32. IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

Author: Guo, Hongcheng, Zhang, Wei, Chen, Junhao, Gu, Yaonan, Yang, Jian, Du, Junjia, Hui, Binyuan, Liu, Tianyu, Ma, Jianxin, Zhou, Chang, and Li, Zhoujun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of the robust benchmark specifically for assessing the Image-to-Web conversion proficiency of these large models. Primarily, it is essential to ensure the integrity of the web elements generated. These elements comprise visible and invisible categories. Previous evaluation methods (e.g., BLEU) are notably susceptible to significant alterations due to the presence of invisible elements in Web. Furthermore, it is crucial to measure the layout information of web pages, referring to the positional relationships between elements, which is overlooked by previous work. To address challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-Bench). Specifically, we propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree. Layout Accuracy is also proposed to analyze the positional relationships of elements by converting DOM tree into a common subsequence. Besides, we design a five-hop multimodal Chain-of-Thought Prompting for better performance, which contains five hop: 1) SoM prompt injection. 2) Inferring Elements. 3) Inferring Layout. 4) Inferring Web code. 5) Reflection. Our benchmark comprises 1200 pairs of images and web codes with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, offering insights into their performance and areas for improvement in image-to-web domain.
Published: 2024

33. GLCONet: Learning Multi-source Perception Representation for Camouflaged Object Detection

Author: Sun, Yanguang, Xuan, Hanyu, Yang, Jian, and Luo, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, biological perception has been a powerful tool for handling the camouflaged object detection (COD) task. However, most existing methods are heavily dependent on the local spatial information of diverse scales from convolutional operations to optimize initial features. A commonly neglected point in these methods is the long-range dependencies between feature pixels from different scale spaces that can help the model build a global structure of the object, inducing a more precise image representation. In this paper, we propose a novel Global-Local Collaborative Optimization Network, called GLCONet. Technically, we first design a collaborative optimization strategy from the perspective of multi-source perception to simultaneously model the local details and global long-range relationships, which can provide features with abundant discriminative information to boost the accuracy in detecting camouflaged objects. Furthermore, we introduce an adjacent reverse decoder that contains cross-layer aggregation and reverse optimization to integrate complementary information from different levels for generating high-quality representations. Extensive experiments demonstrate that the proposed GLCONet method with different backbones can effectively activate potentially significant pixels in an image, outperforming twenty state-of-the-art methods on three public COD datasets. The source code is available at: \https://github.com/CSYSI/GLCONet., Comment: Accepted at TNNLS 2024
Published: 2024

34. wgatools: an ultrafast toolkit for manipulating whole genome alignments

Author: Wei, Wenjie, Gui, Songtao, Yang, Jian, Garrison, Erik, Yan, Jianbing, and Liu, Hai-Jun
Subjects: Quantitative Biology - Genomics
Abstract: Summary: With the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole genome alignment (WGA) formats, offering practical tools for conversion, processing, statistical evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics. Availability and Implementation: wgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. wgatools is published as free software under the MIT open-source license, and its source code is freely available at https://github.com/wjwei-handsome/wgatools. Contact: weiwenjie@westlake.edu.cn (W.W.) or liuhaijun@yzwlab.cn (H.-J.L.).
Published: 2024

35. Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction

Author: Wu, Yuan, Yan, Zhiqiang, Wang, Zhengxue, Li, Xiang, Hui, Le, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decouple the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Code is available at https://github.com/yanzq95/DHD.
Published: 2024

36. LIME: Less Is More for MLLM Evaluation

Author: Zhu, King, Zang, Qianbo, Jia, Shian, Wu, Siwei, Fang, Feiteng, Li, Yizhi, Gavin, Shawn, Zheng, Tuney, Guo, Jiawei, Li, Bo, Wu, Haoning, Qu, Xingwei, Yang, Jian, Liu, Zachary, Yue, Xiang, Liu, J. H., Lin, Chenghua, Yang, Min, Ni, Shiwen, Huang, Wenhao, and Zhang, Ge
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated through a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that necessitate image-based understanding. Our experiments indicate that LIME reduces the number of samples by 76% and evaluation time by 77%, while also providing a more effective means of distinguishing the capabilities of different models. Notably, we find that traditional automatic metrics, such as CIDEr, are inadequate for assessing MLLMs' captioning performance; excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://github.com/kangreen0210/LIME.
Published: 2024

37. Towards a Unified View of Preference Learning for Large Language Models: A Survey

Author: Gao, Bofei, Song, Feifan, Miao, Yibo, Cai, Zefan, Yang, Zhe, Chen, Liang, Hu, Helan, Xu, Runxin, Dong, Qingxiu, Zheng, Ce, Quan, Shanghaoran, Xiao, Wen, Zhang, Ge, Zan, Daoguang, Lu, Keming, Yu, Bowen, Liu, Dayiheng, Cui, Zeyu, Yang, Jian, Sha, Lei, Wang, Houfeng, Sui, Zhifang, Wang, Peiyi, Liu, Tianyu, and Chang, Baobao
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences., Comment: 23 pages, 6 figures
Published: 2024

38. FuzzCoder: Byte-level Fuzzing Test via Large Language Model

Author: Yang, Liqun, Yang, Jian, Wei, Chaoren, Niu, Guanglin, Zhang, Ge, Wang, Yunli, ChaI, Linzheng, Xia, Wanxu, Guo, Hongcheng, Zhang, Shun, Liu, Jiaheng, Yin, Yuwei, Peng, Junran, Ma, Jiaxin, Sun, Liang, and Li, Zhoujun
Subjects: Computer Science - Computation and Language
Abstract: Fuzzing is an important dynamic program analysis technique designed for finding vulnerabilities in complex software. Fuzzing involves presenting a target program with crafted malicious input to cause crashes, buffer overflows, memory errors, and exceptions. Crafting malicious inputs in an efficient manner is a difficult open problem and the best approaches often apply uniform random mutations to pre-existing valid inputs. In this work, we propose to adopt fine-tuned large language models (FuzzCoder) to learn patterns in the input files from successful attacks to guide future fuzzing explorations. Specifically, we develop a framework to leverage the code LLMs to guide the mutation process of inputs in fuzzing. The mutation process is formulated as the sequence-to-sequence modeling, where LLM receives a sequence of bytes and then outputs the mutated byte sequence. FuzzCoder is fine-tuned on the created instruction dataset (Fuzz-Instruct), where the successful fuzzing history is collected from the heuristic fuzzing tool. FuzzCoder can predict mutation locations and strategies locations in input files to trigger abnormal behaviors of the program. Experimental results show that FuzzCoder based on AFL (American Fuzzy Lop) gain significant improvements in terms of effective proportion of mutation (EPM) and number of crashes (NC) for various input formats including ELF, JPG, MP3, and XML., Comment: 11 pages
Published: 2024

39. Frequency-Spatial Entanglement Learning for Camouflaged Object Detection

Author: Sun, Yanguang, Xu, Chunyan, Yang, Jian, Xuan, Hanyu, and Luo, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Camouflaged object detection has attracted a lot of attention in computer vision. The main challenge lies in the high degree of similarity between camouflaged objects and their surroundings in the spatial domain, making identification difficult. Existing methods attempt to reduce the impact of pixel similarity by maximizing the distinguishing ability of spatial features with complicated design, but often ignore the sensitivity and locality of features in the spatial domain, leading to sub-optimal results. In this paper, we propose a new approach to address this issue by jointly exploring the representation in the frequency and spatial domains, introducing the Frequency-Spatial Entanglement Learning (FSEL) method. This method consists of a series of well-designed Entanglement Transformer Blocks (ETB) for representation learning, a Joint Domain Perception Module for semantic enhancement, and a Dual-domain Reverse Parser for feature integration in the frequency and spatial domains. Specifically, the ETB utilizes frequency self-attention to effectively characterize the relationship between different frequency bands, while the entanglement feed-forward network facilitates information interaction between features of different domains through entanglement learning. Our extensive experiments demonstrate the superiority of our FSEL over 21 state-of-the-art methods, through comprehensive quantitative and qualitative comparisons in three widely-used datasets. The source code is available at: https://github.com/CSYSI/FSEL., Comment: Accepted at ECCV 2024
Published: 2024

40. Description of a new species of the genus Oreonectes (Teleostei, Cypriniformes, Nemacheilidae) from Guangxi, China

Author: Zhong, Jia-Hong, Yang, Jian, Mo, Hao-Lin, Chen, Weicai, and Pensoft Publishers
Subjects: Morphology, new species, Phylogeny, Stream fish, taxonomy
Published: 2024

41. Attention-Guided Multi-scale Interaction Network for Face Super-Resolution

Author: Wan, Xujie, Li, Wenjie, Gao, Guangwei, Lu, Huimin, Yang, Jian, and Lin, Chia-Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, CNN and Transformer hybrid networks demonstrated excellent performance in face super-resolution (FSR) tasks. Since numerous features at different scales in hybrid networks, how to fuse these multi-scale features and promote their complementarity is crucial for enhancing FSR. However, existing hybrid network-based FSR methods ignore this, only simply combining the Transformer and CNN. To address this issue, we propose an attention-guided Multi-scale interaction network (AMINet), which contains local and global feature interactions as well as encoder-decoder phases feature interactions. Specifically, we propose a Local and Global Feature Interaction Module (LGFI) to promote fusions of global features and different receptive fields' local features extracted by our Residual Depth Feature Extraction Module (RDFE). Additionally, we propose a Selective Kernel Attention Fusion Module (SKAF) to adaptively select fusions of different features within LGFI and encoder-decoder phases. Our above design allows the free flow of multi-scale features from within modules and between encoder and decoder, which can promote the complementarity of different scale features to enhance FSR. Comprehensive experiments confirm that our method consistently performs well with less computational consumption and faster inference., Comment: 12 pages, 8 figures, 8 tables
Published: 2024

42. Relaxed Rotational Equivariance via $G$-Biases in Vision

Author: Wu, Zhiqiang, Sun, Licheng, Liu, Yingjie, Yang, Jian, Dong, Hanlin, Lin, Shing-Ho J., Tang, Xuan, Mi, Jinpeng, Jin, Bo, and Wei, Xian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Group Equivariant Convolution (GConv) can effectively handle rotational symmetry data. They assume uniform and strict rotational symmetry across all features, as the transformations under the specific group. However, real-world data rarely conforms to strict rotational symmetry commonly referred to as Rotational Symmetry-Breaking in the system or dataset, making GConv unable to adapt effectively to this phenomenon. Motivated by this, we propose a simple but highly effective method to address this problem, which utilizes a set of learnable biases called the $G$-Biases under the group order to break strict group constraints and achieve \textbf{R}elaxed \textbf{R}otational \textbf{E}quivarant \textbf{Conv}olution (RREConv). We conduct extensive experiments to validate Relaxed Rotational Equivariance on rotational symmetry groups $\mathcal{C}_n$ (e.g. $\mathcal{C}_2$, $\mathcal{C}_4$, and $\mathcal{C}_6$ groups). Further experiments demonstrate that our proposed RREConv-based methods achieve excellent performance, compared to existing GConv-based methods in classification and detection tasks on natural image datasets.
Published: 2024

43. Correlation Effects in a Simplified Bilayer Two-Orbital Hubbard Model at Half Filling

Author: Yang, Jian-Jian, Yao, Dao-Xin, and Wu, Han-Qing
Subjects: Condensed Matter - Strongly Correlated Electrons
Abstract: Motivated by the discovery of high-temperature superconductivity in bilayer nickelate La$_3$Ni$_2$O$_7$ under pressure, we investigate the ground-state phase diagram and correlation effects using determinant quantum Monte Carlo simulations in a simplified bilayer two-orbital Hubbard model at half filling. Our results reveal the emergence of a nonmagnetic weakly insulating phase at weak on-site Hubbard interactions, transitioning to an antiferromagnetic Mott insulating phase as the interaction strength exceeds a critical value $U/t_1^x\approx 4.15$. This phase transition is consistent with the 3D O(3) Heisenberg universality class. Additionally, we analyze dynamical properties such as the single-particle spectral function and dynamic spin structure factor. The pronounced inter-layer correlation of $d_{3z^2-r^2}$ orbitals results in a downward trend and an extended flatness in the $\gamma$ band, mirroring the angle-resolved photoemission spectroscopy findings under ambient pressure. Our numerical results provide important clues for understanding the strong correlation effects in La$_3$Ni$_2$O$_7$., Comment: 9 pages, 11 figures
Published: 2024

44. SBDet: A Symmetry-Breaking Object Detector via Relaxed Rotation-Equivariance

Author: Wu, Zhiqiang, Liu, Yingjie, Dong, Hanlin, Tang, Xuan, Yang, Jian, Jin, Bo, Chen, Mingsong, and Wei, Xian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Introducing Group Equivariant Convolution (GConv) empowers models to explore symmetries hidden in visual data, improving their performance. However, in real-world scenarios, objects or scenes often exhibit perturbations of a symmetric system, specifically a deviation from a symmetric architecture, which can be characterized by a non-trivial action of a symmetry group, known as Symmetry-Breaking. Traditional GConv methods are limited by the strict operation rules in the group space, only ensuring features remain strictly equivariant under limited group transformations, making it difficult to adapt to Symmetry-Breaking or non-rigid transformations. Motivated by this, we introduce a novel Relaxed Rotation GConv (R2GConv) with our defined Relaxed Rotation-Equivariant group $\mathbf{R}_4$. Furthermore, we propose a Relaxed Rotation-Equivariant Network (R2Net) as the backbone and further develop the Symmetry-Breaking Object Detector (SBDet) for 2D object detection built upon it. Experiments demonstrate the effectiveness of our proposed R2GConv in natural image classification tasks, and SBDet achieves excellent performance in object detection tasks with improved generalization capabilities and robustness.
Published: 2024

45. Flatten: Video Action Recognition is an Image Classification task

Author: Chen, Junlin, Xu, Chengcheng, Xu, Yangfan, Yang, Jian, Li, Jun, and Shi, Zhiping
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional counterparts.To bridge the gap between image-understanding and video-understanding tasks while simplifying the complexity of video comprehension, we introduce a novel video representation architecture, Flatten, which serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network for efficient and effective 3D temporal data modeling.Specifically, by applying specific flattening operations (e.g., row-major transform), 3D spatiotemporal data is transformed into 2D spatial information, and then ordinary image understanding models are used to capture temporal dynamic and spatial semantic information, which in turn accomplishes effective and efficient video action recognition. Extensive experiments on commonly used datasets (Kinetics-400, Something-Something v2, and HMDB-51) and three classical image classification models (Uniformer, SwinV2, and ResNet), have demonstrated that embedding Flatten provides a significant performance improvements over original model., Comment: 13pages, 6figures
Published: 2024

46. TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Author: Wu, Xianjie, Yang, Jian, Chai, Linzheng, Zhang, Ge, Liu, Jiaheng, Du, Xinrun, Liang, Di, Shu, Daixin, Cheng, Xianfu, Sun, Tianzhen, Niu, Guanglin, Li, Tongliang, and Li, Zhoujun
Subjects: Computer Science - Computation and Language
Abstract: Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans., Comment: 12 pages
Published: 2024

47. Barbie: Text to Barbie-Style 3D Avatars

Author: Sun, Xiaokun, Zhang, Zhenyu, Tai, Ying, Wang, Qian, Tang, Hao, Yi, Zili, and Yang, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in text-guided 3D avatar generation have made substantial progress by distilling knowledge from diffusion models. Despite the plausible generated appearance, existing methods cannot achieve fine-grained disentanglement or high-fidelity modeling between inner body and outfit. In this paper, we propose Barbie, a novel framework for generating 3D avatars that can be dressed in diverse and high-quality Barbie-like garments and accessories. Instead of relying on a holistic model, Barbie achieves fine-grained disentanglement on avatars by semantic-aligned separated models for human body and outfits. These disentangled 3D representations are then optimized by different expert models to guarantee the domain-specific fidelity. To balance geometry diversity and reasonableness, we propose a series of losses for template-preserving and human-prior evolving. The final avatar is enhanced by unified texture refinement for superior texture consistency. Extensive experiments demonstrate that Barbie outperforms existing methods in both dressed human and outfit generation, supporting flexible apparel combination and animation. The code will be released for research purposes. Our project page is: https://xiaokunsun.github.io/Barbie.github.io/., Comment: 9 pages, 7 figures, Project page: https://xiaokunsun.github.io/Barbie.github.io/
Published: 2024

48. AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Author: Zhang, Bo-Wen, Wang, Liangdong, Yuan, Ye, Li, Jijie, Gu, Shuhao, Zhao, Mengdi, Wu, Xinya, Liu, Guang, Wu, Chengwei, Zhao, Hanyu, Du, Li, Ju, Yiming, Ma, Quanyue, Ao, Yulong, Zhao, Yingli, Zhu, Songhe, Cao, Zhou, Liang, Dong, Lin, Yonghua, Zhang, Ming, Wang, Shunfei, Zhou, Yanxin, Ye, Min, Chen, Xuekai, Yu, Xinyang, Huang, Xiangjun, and Yang, Jian
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
Published: 2024

49. From Words to Worth: Newborn Article Impact Prediction with LLM

Author: Zhao, Penghai, Xing, Qinghua, Dou, Kairan, Tian, Jinyu, Tai, Ying, Yang, Jian, Cheng, Ming-Ming, and Li, Xiang
Subjects: Computer Science - Computation and Language
Abstract: As the academic landscape expands, the challenge of efficiently identifying potentially high-impact articles among the vast number of newly published works becomes critical. This paper introduces a promising approach, leveraging the capabilities of fine-tuned LLMs to predict the future impact of newborn articles solely based on titles and abstracts. Moving beyond traditional methods heavily reliant on external information, the proposed method discerns the shared semantic features of highly impactful papers from a large collection of title-abstract and potential impact pairs. These semantic features are further utilized to regress an improved metric, TNCSI_SP, which has been endowed with value, field, and time normalization properties. Additionally, a comprehensive dataset has been constructed and released for fine-tuning the LLM, containing over 12,000 entries with corresponding titles, abstracts, and TNCSI_SP. The quantitative results, with an NDCG@20 of 0.901, demonstrate that the proposed approach achieves state-of-the-art performance in predicting the impact of newborn articles when compared to competitive counterparts. Finally, we demonstrate a real-world application for predicting the impact of newborn journal articles to demonstrate its noteworthy practical value. Overall, our findings challenge existing paradigms and propose a shift towards a more content-focused prediction of academic impact, offering new insights for assessing newborn article impact., Comment: 7 pages for main sections, plus 3 additional pages for appendices. Code, dataset are released at https://sway.cloud.microsoft/KOH09sPR21Ubojbc
Published: 2024

50. Pick of the Bunch: Detecting Infrared Small Targets Beyond Hit-Miss Trade-Offs via Selective Rank-Aware Attention

Author: Dai, Yimian, Pan, Peiwen, Qian, Yulei, Li, Yuxuan, Li, Xiang, Yang, Jian, and Wang, Huan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Infrared small target detection faces the inherent challenge of precisely localizing dim targets amidst complex background clutter. Traditional approaches struggle to balance detection precision and false alarm rates. To break this dilemma, we propose SeRankDet, a deep network that achieves high accuracy beyond the conventional hit-miss trade-off, by following the ``Pick of the Bunch'' principle. At its core lies our Selective Rank-Aware Attention (SeRank) module, employing a non-linear Top-K selection process that preserves the most salient responses, preventing target signal dilution while maintaining constant complexity. Furthermore, we replace the static concatenation typical in U-Net structures with our Large Selective Feature Fusion (LSFF) module, a dynamic fusion strategy that empowers SeRankDet with adaptive feature integration, enhancing its ability to discriminate true targets from false alarms. The network's discernment is further refined by our Dilated Difference Convolution (DDC) module, which merges differential convolution aimed at amplifying subtle target characteristics with dilated convolution to expand the receptive field, thereby substantially improving target-background separation. Despite its lightweight architecture, the proposed SeRankDet sets new benchmarks in state-of-the-art performance across multiple public datasets. The code is available at https://github.com/GrokCV/SeRankDet., Comment: IEEE TGRS 2024
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

46,478 results on '"Yang, Jian"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources