Author: "Han, Jianhua" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Han, Jianhua"' showing total 619 results

Start Over Author "Han, Jianhua"

619 results on '"Han, Jianhua"'

1. HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Author: Huang, Runhui, Ding, Xinpeng, Wang, Chunwei, Han, Jianhua, Liu, Yulong, Zhao, Hengshuang, Xu, Hang, Hou, Lu, Zhang, Wei, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compresses the vision tokens based on themselves, preserving the original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related and position-related tasks. Our comprehensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and on EntityGrid-QA, particularly on document-oriented tasks, establishing new standards for handling high-resolution inputs.
Published: 2024

2. HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Author: Fang, Guian, Yan, Wenbiao, Guo, Yuanfan, Han, Jianhua, Jiang, Zutao, Xu, Hang, Liao, Shengcai, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our data and code are available at https://github.com/Enderfga/HumanRefiner., Comment: Accepted by ECCV2024
Published: 2024

3. DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Author: Yao, Lewei, Pi, Renjie, Han, Jianhua, Liang, Xiaodan, Xu, Hang, Zhang, Wei, Li, Zhenguo, and Xu, Dan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability., Comment: Accepted to CVPR2024
Published: 2024

4. LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Author: Huang, Runhui, Cai, Kaixin, Han, Jianhua, Liang, Xiaodan, Pei, Renjing, Lu, Guansong, Xu, Songcen, Zhang, Wei, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the success of generating high-quality images given any text prompts by diffusion-based generative models, prior works directly generate the entire images, but cannot provide object-wise manipulation capability. To support wider real applications like professional graphic design and digital artistry, images are frequently created and manipulated in multiple layers to offer greater flexibility and control. Therefore in this paper, we propose a layer-collaborative diffusion model, named LayerDiff, specifically designed for text-guided, multi-layered, composable image synthesis. The composable image consists of a background layer, a set of foreground layers, and associated mask layers for each foreground element. To enable this, LayerDiff introduces a layer-based generation paradigm incorporating multiple layer-collaborative attention modules to capture inter-layer patterns. Specifically, an inter-layer attention module is designed to encourage information exchange and learning between layers, while a text-guided intra-layer attention module incorporates layer-specific prompts to direct the specific-content generation for each layer. A layer-specific prompt-enhanced module better captures detailed textual cues from the global prompt. Additionally, a self-mask guidance sampling strategy further unleashes the model's ability to generate multi-layered images. We also present a pipeline that integrates existing perceptual and generative models to produce a large dataset of high-quality, text-prompted, multi-layered images. Extensive experiments demonstrate that our LayerDiff model can generate high-quality multi-layered images with performance comparable to conventional whole-image generation methods. Moreover, LayerDiff enables a broader range of controllable generative applications, including layer-specific image editing and style transfer.
Published: 2024

5. NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

Author: Lin, Bingqian, Nie, Yunshuang, Wei, Ziming, Chen, Jiaqi, Ma, Shikui, Han, Jianhua, Xu, Hang, Chang, Xiaojun, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Robotics
Abstract: Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offline manner usually suffers from substantial domain gap between the VLN task and the LLM training corpus. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision, leading to a significant mitigation of the domain gap in a cost-effective manner. Specifically, at each timestep, the LLM is prompted to forecast the navigational chain-of-thought by: 1) acting as a world model to imagine the next observation according to the instruction, 2) selecting the candidate observation that best aligns with the imagination, and 3) determining the action based on the reasoning from the prior steps. Through constructing formalized labels for training, the LLM can learn to generate desired and reasonable chain-of-thought outputs for improving the action decision. Experimental results across various training settings and popular VLN benchmarks (e.g., Room-to-Room (R2R), Room-across-Room (RxR), Room-for-Room (R4R)) show the significant superiority of NavCoT over the direct action prediction variants. Through simple parameter-efficient finetuning, our NavCoT outperforms a recent GPT4-based approach with ~7% relative improvement on the R2R dataset. We believe that NavCoT will help unlock more task-adaptive and scalable LLM-based embodied agents, which are helpful for developing real-world robotics applications. Code is available at https://github.com/expectorlin/NavCoT.
Published: 2024

6. From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs

Author: Liu, Yulong, Yuan, Yunlong, Wang, Chunwei, Han, Jianhua, Ma, Yongqiang, Zhang, Li, Zheng, Nanning, and Xu, Hang
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The distinction between humans and animals lies in the unique ability of humans to use and create tools. Tools empower humans to overcome physiological limitations, fostering the creation of magnificent civilizations. Similarly, enabling foundational models like Large Language Models (LLMs) with the capacity to learn external tool usage may serve as a pivotal step toward realizing artificial general intelligence. Previous studies in this field have predominantly pursued two distinct approaches to augment the tool invocation capabilities of LLMs. The first approach emphasizes the construction of relevant datasets for model fine-tuning. The second approach, in contrast, aims to fully exploit the inherent reasoning abilities of LLMs through in-context learning strategies. In this work, we introduce a novel tool invocation pipeline designed to control massive real-world APIs. This pipeline mirrors the human task-solving process, addressing complicated real-life user queries. At each step, we guide LLMs to summarize the achieved results and determine the next course of action. We term this pipeline `from Summary to action', Sum2Act for short. Empirical evaluations of our Sum2Act pipeline on the ToolBench benchmark show significant performance improvements, outperforming established methods like ReAct and DFSDT. This highlights Sum2Act's effectiveness in enhancing LLMs for complex real-world tasks.
Published: 2024

7. Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts

Author: Liu, Zhili, Chen, Kai, Han, Jianhua, Hong, Lanqing, Xu, Hang, Li, Zhenguo, and Kwok, James T.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Masked Autoencoder~(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. However, when the various downstream tasks have data distributions different from the pre-training data, the semantically irrelevant pre-training information might result in negative transfer, impeding MAE's scalability. To address this issue, we propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE), which can be trained once but provides customized pre-training models for diverse downstream tasks. Different from the mixture of experts (MoE), our MoCE trains each expert only with semantically relevant images by using cluster-conditional gates. Thus, each downstream task can be allocated to its customized model pre-trained with data most similar to the downstream data. Experiments on a collection of 11 downstream tasks show that MoCE outperforms the vanilla MAE by 2.45\% on average. It also obtains new state-of-the-art self-supervised learning results on detection and segmentation., Comment: Accepted by ICLR 2023
Published: 2024

8. PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Author: Lu, Guansong, Guo, Yuanfan, Han, Jianhua, Niu, Minzhe, Zeng, Yihan, Xu, Songcen, Huang, Zeyi, Zhong, Zhao, Zhang, Wei, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: $\href{https://pangu-draw.github.io}{this~https~URL}$, Comment: 16 pages, 16 figures
Published: 2023

9. G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Author: Gao, Jiahui, Pi, Renjie, Zhang, Jipeng, Ye, Jiacheng, Zhong, Wanjun, Wang, Yufei, Hong, Lanqing, Han, Jianhua, Xu, Hang, Li, Zhenguo, and Kong, Lingpeng
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters., Comment: 10 pages
Published: 2023

10. Reason2Drive: Towards Interpretable and Chain-based Reasoning for Autonomous Driving

Author: Nie, Ming, Peng, Renyuan, Wang, Chunwei, Cai, Xinyue, Han, Jianhua, Xu, Hang, and Zhang, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large vision-language models (VLMs) have garnered increasing interest in autonomous driving areas, due to their advanced capabilities in complex reasoning tasks essential for highly autonomous vehicle behavior. Despite their potential, research in autonomous systems is hindered by the lack of datasets with annotated reasoning chains that explain the decision-making processes in driving. To bridge this gap, we present Reason2Drive, a benchmark dataset with over 600K video-text pairs, aimed at facilitating the study of interpretable reasoning in complex driving environments. We distinctly characterize the autonomous driving process as a sequential combination of perception, prediction, and reasoning steps, and the question-answer pairs are automatically collected from a diverse range of open-source outdoor driving datasets, including nuScenes, Waymo and ONCE. Moreover, we introduce a novel aggregated evaluation metric to assess chain-based reasoning performance in autonomous systems, addressing the semantic ambiguities of existing metrics such as BLEU and CIDEr. Based on the proposed benchmark, we conduct experiments to assess various existing VLMs, revealing insights into their reasoning capabilities. Additionally, we develop an efficient approach to empower VLMs to leverage object-level perceptual elements in both feature extraction and prediction, further enhancing their reasoning accuracy. The code and dataset will be released.
Published: 2023

11. Fluorine-free, corrosion-resistant aluminum surfaces with nickel hydroxide and stearic acid superhydrophobic coatings

Author: Han, Jianhua, Li, Yi, Zhou, Yuqin, Chen, Yingyu, Wang, Zengyi, Li, Yixuan, and Wang, Boyou
Published: 2024
Full Text: View/download PDF

12. Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

Author: Chen, Kai, Wang, Chunwei, Yang, Kuo, Han, Jianhua, Hong, Lanqing, Mi, Fei, Xu, Hang, Liu, Zhengying, Huang, Wenyong, Li, Zhenguo, Yeung, Dit-Yan, Shang, Lifeng, Jiang, Xin, and Liu, Qun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges. This becomes particularly evident when LLMs inadvertently generate harmful or toxic content, either unintentionally or because of intentional inducement. Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs. Conversely, this study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them. In this case, mistakes are repurposed into valuable data for alignment, effectively helping to avoid the production of erroneous responses. Without external models or human annotations, our method leverages a model's intrinsic ability to discern undesirable mistakes and improves the safety of its generated responses. Experimental results reveal that our method outperforms existing alignment approaches in enhancing model safety while maintaining the overall utility., Comment: Accepted by ICLR 2024
Published: 2023

13. Implicit Concept Removal of Diffusion Models

Author: Liu, Zhili, Chen, Kai, Zhang, Yifan, Han, Jianhua, Hong, Lanqing, Xu, Hang, Li, Zhenguo, Yeung, Dit-Yan, and Kwok, James
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. These concepts, termed as the "implicit concepts", could be unintentionally learned during training and then be generated uncontrollably during inference. Existing removal methods still struggle to eliminate implicit concepts primarily due to their dependency on the model's ability to recognize concepts it actually can not discern. To address this, we utilize the intrinsic geometric characteristics of implicit concepts and present the Geom-Erasing, a novel concept removal method based on the geometric-driven control. Specifically, once an unwanted implicit concept is identified, we integrate the existence and geometric information of the concept into the text prompts with the help of an accessible classifier or detector model. Subsequently, the model is optimized to identify and disentangle this information, which is then adopted as negative prompts during generation. Moreover, we introduce the Implicit Concept Dataset (ICD), a novel image-text dataset imbued with three typical implicit concepts (i.e., QR codes, watermarks, and text), reflecting real-life situations where implicit concepts are easily injected. Geom-Erasing effectively mitigates the generation of implicit concepts, achieving the state-of-the-art results on the Inappropriate Image Prompts (I2P) and our challenging Implicit Concept Dataset (ICD) benchmarks.
Published: 2023

14. HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving

Author: Ding, Xinpeng, Han, Jianhua, Xu, Hang, Zhang, Wei, and Li, Xiaomeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Autonomous driving systems generally employ separate models for different tasks resulting in intricate designs. For the first time, we leverage singular multimodal large language models (MLLMs) to consolidate multiple autonomous driving tasks from videos, i.e., the Risk Object Localization and Intention and Suggestion Prediction (ROLISP) task. ROLISP uses natural language to simultaneously identify and interpret risk objects, understand ego-vehicle intentions, and provide motion suggestions, eliminating the necessity for task-specific architectures. However, lacking high-resolution (HR) information, existing MLLMs often miss small objects (e.g., traffic cones) and overly focus on salient ones (e.g., large trucks) when applied to ROLISP. We propose HiLM-D (Towards High-Resolution Understanding in MLLMs for Autonomous Driving), an efficient method to incorporate HR information into MLLMs for the ROLISP task. Especially, HiLM-D integrates two branches: (i) the low-resolution reasoning branch, can be any MLLMs, processes low-resolution videos to caption risk objects and discern ego-vehicle intentions/suggestions; (ii) the high-resolution perception branch (HR-PB), prominent to HiLM-D,, ingests HR images to enhance detection by capturing vision-specific HR feature maps and prioritizing all potential risks over merely salient objects. Our HR-PB serves as a plug-and-play module, seamlessly fitting into current MLLMs. Experiments on the ROLISP benchmark reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
Published: 2023

15. Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

Author: Zheng, Qingping, Guo, Yuanfan, Deng, Jiankang, Han, Jianhua, Li, Ying, Xu, Songcen, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Stable diffusion, a generative model used in text-to-image synthesis, frequently encounters resolution-induced composition problems when generating images of varying sizes. This issue primarily stems from the model being trained on pairs of single-scale images and their corresponding text descriptions. Moreover, direct training on images of unlimited sizes is unfeasible, as it would require an immense number of text-image pairs and entail substantial computational expenses. To overcome these challenges, we propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to efficiently generate well-composed images of any size, while minimizing the need for high-memory GPU resources. Specifically, the initial stage, dubbed Any Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a restricted range of ratios to optimize the text-conditional diffusion model, thereby improving its ability to adjust composition to accommodate diverse image sizes. To support the creation of images at any desired size, we further introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the subsequent stage. This method allows for the rapid enlargement of the ASD output to any high-resolution size, avoiding seaming artifacts or memory overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks demonstrate that ASD can produce well-structured images of arbitrary sizes, cutting down the inference time by 2x compared to the traditional tiled algorithm.
Published: 2023

16. GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Author: Deng, Xinchi, Shi, Han, Huang, Runhui, Li, Changlin, Xu, Hang, Han, Jianhua, Kwok, James, Zhao, Shen, Zhang, Wei, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in the current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of the local minimum dilemma. Compared with the existing methods, GrowCLIP improves 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset., Comment: Accepted by ICCV2023
Published: 2023

17. DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Author: Huang, Runhui, Han, Jianhua, Lu, Guansong, Liang, Xiaodan, Zeng, Yihan, Zhang, Wei, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65% improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis., Comment: ICCV2023
Published: 2023

18. The Sequence of Heating and Loading Affects Shear Properties of Granite Fractures Under High Temperature

Author: Meng, Fanzhen, Han, Jianhua, Li, Zhiyuan, Wang, Feili, Yue, Zhufeng, Cai, Qijin, Cui, Guanghao, and Zhou, Hui
Published: 2024
Full Text: View/download PDF

19. CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation

Author: Liang, Xiwen, Ma, Liang, Guo, Shanshan, Han, Jianhua, Xu, Hang, Ma, Shikui, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Understanding and following natural language instructions while navigating through complex, real-world environments poses a significant challenge for general-purpose robots. These environments often include obstacles and pedestrians, making it essential for autonomous agents to possess the capability of self-corrected planning to adjust their actions based on feedback from the surroundings. However, the majority of existing vision-and-language navigation (VLN) methods primarily operate in less realistic simulator settings and do not incorporate environmental feedback into their decision-making processes. To address this gap, we introduce a novel zero-shot framework called CorNav, utilizing a large language model for decision-making and comprising two key components: 1) incorporating environmental feedback for refining future plans and adjusting its actions, and 2) multiple domain experts for parsing instructions, scene understanding, and refining predicted actions. In addition to the framework, we develop a 3D simulator that renders realistic scenarios using Unreal Engine 5. To evaluate the effectiveness and generalization of navigation agents in a zero-shot multi-task setting, we create a benchmark called NavBench. Extensive experiments demonstrate that CorNav consistently outperforms all baselines by a significant margin across all tasks. On average, CorNav achieves a success rate of 28.1\%, surpassing the best baseline's performance of 20.5\%., Comment: 22 pages
Published: 2023

20. Boosting Text-to-Image Diffusion Models with Fine-Grained Semantic Rewards

Author: Fang, Guian, Jiang, Zutao, Han, Jianhua, Lu, Guansong, Xu, Hang, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from given text prompts. However, previous methods fail to perform accurate modality alignment between text concepts and generated images due to the lack of fine-level semantic guidance that successfully diagnoses the modality discrepancy. In this paper, we propose FineRewards to improve the alignment between text and images in text-to-image diffusion models by introducing two new fine-grained semantic rewards: the caption reward and the Semantic Segment Anything (SAM) reward. From the global semantic view, the caption reward generates a corresponding detailed caption that depicts all important contents in the synthetic image via a BLIP-2 model and then calculates the reward score by measuring the similarity between the generated caption and the given prompt. From the local semantic view, the SAM reward segments the generated images into local parts with category labels, and scores the segmented parts by measuring the likelihood of each category appearing in the prompted scene via a large language model, i.e., Vicuna-7B. Additionally, we adopt an assemble reward-ranked learning strategy to enable the integration of multiple reward functions to jointly guide the model training. Adapting results of text-to-image models on the MS-COCO benchmark show that the proposed semantic reward outperforms other baseline reward functions with a considerable margin on both visual quality and semantic similarity with the input prompt. Moreover, by adopting the assemble reward-ranked learning strategy, we further demonstrate that model performance is further improved when adapting under the unifying of the proposed semantic reward with the current image rewards.
Published: 2023

21. DetGPT: Detect What You Need via Reasoning

Author: Pi, Renjie, Gao, Jiahui, Diao, Shizhe, Pan, Rui, Dong, Hanze, Zhang, Jipeng, Yao, Lewei, Han, Jianhua, Xu, Hang, Kong, Lingpeng, and Zhang, Tong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In recent years, the field of computer vision has seen significant advancements thanks to the development of large language models (LLMs). These models have enabled more effective and sophisticated interactions between humans and machines, paving the way for novel techniques that blur the lines between human and machine intelligence. In this paper, we introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions, allowing for a higher level of interactivity. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors to perform reasoning within the context of the user's instructions and the visual scene. This enables DetGPT to automatically locate the object of interest based on the user's expressed desires, even if the object is not explicitly mentioned. For instance, if a user expresses a desire for a cold beverage, DetGPT can analyze the image, identify a fridge, and use its knowledge of typical fridge contents to locate the beverage. This flexibility makes our system applicable across a wide range of fields, from robotics and automation to autonomous driving. Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines. We hope that our proposed paradigm and approach will provide inspiration to the community and open the door to more interative and versatile object detection systems. Our project page is launched at detgpt.github.io.
Published: 2023

22. DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

Author: Yao, Lewei, Han, Jianhua, Liang, Xiaodan, Xu, Dan, Zhang, Wei, Li, Zhenguo, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13X more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin., Comment: Accepted to CVPR2023
Published: 2023

23. CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

Author: Zeng, Yihan, Jiang, Chenhan, Mao, Jiageng, Han, Jianhua, Ye, Chaoqiang, Huang, Qingqiu, Yeung, Dit-Yan, Yang, Zhen, Liang, Xiaodan, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$^2$) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme., Comment: To appear at CVPR 2023
Published: 2023

24. Towards Universal Vision-language Omni-supervised Segmentation

Author: Dong, Bowen, Gu, Jiaxi, Han, Jianhua, Xu, Hang, and Zuo, Wangmeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing open-world universal segmentation approaches usually leverage CLIP and pre-computed proposal masks to treat open-world segmentation tasks as proposal classification. However, 1) these works cannot handle universal segmentation in an end-to-end manner, and 2) the limited scale of panoptic datasets restricts the open-world segmentation ability on things classes. In this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder. To improve the open-world segmentation ability, we leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy. To better improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Benefiting from the end-to-end training manner with proposed techniques, VLOSS can be applied to various open-world segmentation tasks without further adaptation. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in terms of mask AP on LVIS v1 dataset.
Published: 2023

25. CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Author: Long, Yanxin, Wen, Youpeng, Han, Jianhua, Xu, Hang, Ren, Pengzhen, Zhang, Wei, Zhao, Shen, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset., Comment: Accepted by CVPR2023
Published: 2023

26. Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving

Author: Liang, Xiwen, Niu, Minzhe, Han, Jianhua, Xu, Hang, Xu, Chunjing, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-task learning has emerged as a powerful paradigm to solve a range of tasks simultaneously with good efficiency in both computation resources and inference time. However, these algorithms are designed for different tasks mostly not within the scope of autonomous driving, thus making it hard to compare multi-task methods in autonomous driving. Aiming to enable the comprehensive evaluation of present multi-task learning methods in autonomous driving, we extensively investigate the performance of popular multi-task methods on the large-scale driving dataset, which covers four common perception tasks, i.e., object detection, semantic segmentation, drivable area segmentation, and lane detection. We provide an in-depth analysis of current multi-task learning methods under different common settings and find out that the existing methods make progress but there is still a large performance gap compared with single-task baselines. To alleviate this dilemma in autonomous driving, we present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting to guide the model toward learning high-quality task-specific representations. Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories and further mitigate the performance gap. Furthermore, we bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving. Comprehensive experimental results on the diverse self-driving dataset BDD100K show that the VE-Prompt improves the multi-task baseline and further surpasses single-task models., Comment: Accepted at CVPR 2023
Published: 2023

27. Frictional Sliding Behaviour of Rough Fracture in Granite Under True Triaxial Loading with Implications for Fault Reactivation

Author: Meng, Fanzhen, Yue, Zhufeng, Li, Muzi, Han, Jianhua, Cai, Qijin, Wang, Wei, Hu, Dawei, and Zhang, Chuanqing
Published: 2024
Full Text: View/download PDF

28. NLIP: Noise-robust Language-Image Pre-training

Author: Huang, Runhui, Long, Yanxin, Han, Jianhua, Xu, Hang, Liang, Xiwen, Xu, Chunjing, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale cross-modal pre-training paradigms have recently shown ubiquitous success on a wide range of downstream tasks, e.g., zero-shot classification, retrieval and image captioning. However, their successes highly rely on the scale and quality of web-crawled data that naturally contain incomplete and noisy information (e.g., wrong or irrelevant content). Existing works either design manual rules to clean data or generate pseudo-targets as auxiliary signals for reducing noise impact, which do not explicitly tackle both the incorrect and incomplete challenges simultaneously. In this paper, to automatically mitigate the impact of noise by solely mining over existing data, we propose a principled Noise-robust Language-Image Pre-training framework (NLIP) to stabilize pre-training via two schemes: noise-harmonization and noise-completion. First, in noise-harmonization scheme, NLIP estimates the noise probability of each pair according to the memorization effect of cross-modal transformers, then adopts noise-adaptive regularization to harmonize the cross-modal alignments with varying degrees. Second, in noise-completion scheme, to enrich the missing object information of text, NLIP injects a concept-conditioned cross-modal decoder to obtain semantic-consistent synthetic captions to complete noisy ones, which uses the retrieved visual concepts (i.e., objects' names) for the corresponding image to guide captioning generation. By collaboratively optimizing noise-harmonization and noise-completion schemes, our NLIP can alleviate the common noise effects during image-text pre-training in a more efficient way. Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, FILIP and BLIP) on 12 zero-shot classification datasets, MSCOCO image captioning and zero-shot image-text retrieval tasks., Comment: AAAI 2023
Published: 2022

29. Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Author: Long, Yanxin, Han, Jianhua, Huang, Runhui, Hang, Xu, Zhu, Yi, Xu, Chunjing, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Furthermore, we propose a visual prompt module to provide the prior task information (i.e., the categories need to be predicted) for the vision branch to better adapt the pre-trained VLM to the downstream tasks. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
Published: 2022
Full Text: View/download PDF

30. Generative Negative Text Replay for Continual Vision-Language Pretraining

Author: Yan, Shipeng, Hong, Lanqing, Xu, Hang, Han, Jianhua, Tuytelaars, Tinne, Li, Zhenguo, and He, Xuming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-language pre-training (VLP) has attracted increasing attention recently. With a large amount of image-text pairs, VLP models trained with contrastive loss have achieved impressive performance in various tasks, especially the zero-shot generalization on downstream datasets. In practical applications, however, massive data are usually collected in a streaming fashion, requiring VLP models to continuously integrate novel knowledge from incoming data and retain learned knowledge. In this work, we focus on learning a VLP model with sequential chunks of image-text pair data. To tackle the catastrophic forgetting issue in this multi-modal continual learning setting, we first introduce pseudo text replay that generates hard negative texts conditioned on the training images in memory, which not only better preserves learned knowledge but also improves the diversity of negative samples in the contrastive loss. Moreover, we propose multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models. We incrementally pre-train our model on both the instance and class incremental splits of the Conceptual Caption dataset, and evaluate the model on zero-shot image classification and image-text retrieval tasks. Our method consistently outperforms the existing baselines with a large margin, which demonstrates its superiority. Notably, we realize an average performance boost of $4.60\%$ on image-classification downstream datasets for the class incremental split., Comment: ECCV 2022
Published: 2022

31. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Author: Yao, Lewei, Han, Jianhua, Wen, Youpeng, Liang, Xiaodan, Xu, Dan, Zhang, Wei, Li, Zhenguo, Xu, Chunjing, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Open-world object detection, as a more general and challenging goal, aims to recognize and localize objects described by arbitrary category names. The recent work GLIP formulates this problem as a grounding problem by concatenating all category names of detection datasets into sentences, which leads to inefficient interaction between category names. This paper presents DetCLIP, a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. To achieve better learning efficiency, we propose a novel paralleled concept formulation that extracts concepts separately to better utilize heterogeneous datasets (i.e., detection, grounding, and image-text pairs) for training. We further design a concept dictionary~(with descriptions) from various online sources and detection datasets to provide prior knowledge for each concept. By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning. The proposed concept dictionary is further used to provide sufficient negative concepts for the construction of the word-region alignment loss\, and to complete labels for objects with missing descriptions in captions of image-text pair data. The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories compared to the fully-supervised model with the same backbone as ours., Comment: Accepted to NeurIPS 2022
Published: 2022

32. Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving

Author: Liang, Xiwen, Wu, Yangxin, Han, Jianhua, Xu, Hang, Xu, Chunjing, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Aiming towards a holistic understanding of multiple downstream tasks simultaneously, there is a need for extracting features with better transferability. Though many latest self-supervised pre-training methods have achieved impressive performance on various vision tasks under the prevailing pretrain-finetune paradigm, their generalization capacity to multi-task learning scenarios is yet to be explored. In this paper, we extensively investigate the transfer performance of various types of self-supervised methods, e.g., MoCo and SimCLR, on three downstream tasks, including semantic segmentation, drivable area segmentation, and traffic object detection, on the large-scale driving dataset BDD100K. We surprisingly find that their performances are sub-optimal or even lag far behind the single-task baseline, which may be due to the distinctions of training objectives and architectural design lied in the pretrain-finetune paradigm. To overcome this dilemma as well as avoid redesigning the resource-intensive pre-training stage, we propose a simple yet effective pretrain-adapt-finetune paradigm for general multi-task training, where the off-the-shelf pretrained models can be effectively adapted without increasing the training overhead. During the adapt stage, we utilize learnable multi-scale adapters to dynamically adjust the pretrained model weights supervised by multi-task objectives while leaving the pretrained knowledge untouched. Furthermore, we regard the vision-language pre-training model CLIP as a strong complement to the pretrain-adapt-finetune paradigm and propose a novel adapter named LV-Adapter, which incorporates language priors in the multi-task model via task-specific prompting and alignment between visual and textual features., Comment: Accepted at NeurIPS 2022
Published: 2022

33. Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding

Author: Liu, Quande, Wen, Youpeng, Han, Jianhua, Xu, Chunjing, Xu, Hang, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: To bridge the gap between supervised semantic segmentation and real-world applications that acquires one model to recognize arbitrary new concepts, recent zero-shot segmentation attracts a lot of attention by exploring the relationships between unseen and seen object categories, yet requiring large amounts of densely-annotated data with diverse base classes. In this paper, we propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations, by purely exploiting the image-caption data that naturally exist on the Internet. Our method, Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a text encoder to generate visual and text embeddings for the image-caption data, with two core components that endow its segmentation ability: First, the image encoder is jointly trained with a vision-based contrasting and a cross-modal contrasting, which encourage the visual embeddings to preserve both fine-grained semantics and high-level category information that are crucial for the segmentation task. Furthermore, an online clustering head is devised over the image encoder, which allows to dynamically segment the visual embeddings into distinct semantic groups such that they can be classified by comparing with various text embeddings to complete our segmentation pipeline. Experiments show that without using any data with dense annotations, our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets., Comment: Accepted to ECCV 2022 (revise acknowledgement)
Published: 2022

34. Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing

Author: Liu, Zhili, Han, Jianhua, Hong, Lanqing, Xu, Hang, Chen, Kai, Xu, Chunjing, and Li, Zhenguo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised learning (SSL), especially contrastive methods, has raised attraction recently as it learns effective transferable representations without semantic annotations. A common practice for self-supervised pre-training is to use as much data as possible. For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance, observed from our extensive experiments. On the other hand, for existing SSL methods, it is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks. To address this issue, we propose a novel SSL paradigm called Scalable Dynamic Routing (SDR), which can be trained once and deployed efficiently to different downstream tasks with task-customized pre-trained models. Specifically, we construct the SDRnet with various sub-nets and train each sub-net with only one subset of the data by data-aware progressive training. When a downstream task arrives, we route among all the pre-trained sub-nets to get the best along with its corresponding weights. Experiment results show that our SDR can train 256 sub-nets on ImageNet simultaneously, which provides better transfer performance than a unified model trained on the full ImageNet, achieving state-of-the-art (SOTA) averaged accuracy over 11 downstream classification tasks and AP on PASCAL VOC detection task.
Published: 2022

35. ONCE-3DLanes: Building Monocular 3D Lane Detection

Author: Yan, Fan, Nie, Ming, Cai, Xinyue, Han, Jianhua, Xu, Hang, Yang, Zhen, Ye, Chaoqiang, Fu, Yanwei, Mi, Michael Bi, and Zhang, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present ONCE-3DLanes, a real-world autonomous driving dataset with lane layout annotation in 3D space. Conventional 2D lane detection from a monocular image yields poor performance of following planning and control tasks in autonomous driving due to the case of uneven road. Predicting the 3D lane layout is thus necessary and enables effective and safe driving. However, existing 3D lane detection datasets are either unpublished or synthesized from a simulated environment, severely hampering the development of this field. In this paper, we take steps towards addressing these issues. By exploiting the explicit relationship between point clouds and image pixels, a dataset annotation pipeline is designed to automatically generate high-quality 3D lane locations from 2D lane annotations in 211K road scenes. In addition, we present an extrinsic-free, anchor-free method, called SALAD, regressing the 3D coordinates of lanes in image view without converting the feature map into the bird's-eye view (BEV). To facilitate future research on 3D lane detection, we benchmark the dataset and provide a novel evaluation metric, performing extensive experiments of both existing approaches and our proposed method. The aim of our work is to revive the interest of 3D lane detection in a real-world scenario. We believe our work can lead to the expected and unexpected innovations in both academia and industry., Comment: CVPR 2022. Project page at https://once-3dlanes.github.io
Published: 2022

36. Laneformer: Object-aware Row-Column Transformers for Lane Detection

Author: Han, Jianhua, Deng, Xiajun, Cai, Xinyue, Yang, Zhen, Xu, Hang, Xu, Chunjing, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present Laneformer, a conceptually simple yet powerful transformer-based architecture tailored for lane detection that is a long-standing research topic for visual perception in autonomous driving. The dominant paradigms rely on purely CNN-based architectures which often fail in incorporating relations of long-range lane points and global contexts induced by surrounding objects (e.g., pedestrians, vehicles). Inspired by recent advances of the transformer encoder-decoder architecture in various vision tasks, we move forwards to design a new end-to-end Laneformer architecture that revolutionizes the conventional transformers into better capturing the shape and semantic characteristics of lanes, with minimal overhead in latency. First, coupling with deformable pixel-wise self-attention in the encoder, Laneformer presents two new row and column self-attention operations to efficiently mine point context along with the lane shapes. Second, motivated by the appearing objects would affect the decision of predicting lane segments, Laneformer further includes the detected object instances as extra inputs of multi-head attention blocks in the encoder and decoder to facilitate the lane point detection by sensing semantic contexts. Specifically, the bounding box locations of objects are added into Key module to provide interaction with each pixel and query while the ROI-aligned features are inserted into Value module. Extensive experiments demonstrate our Laneformer achieves state-of-the-art performances on CULane benchmark, in terms of 77.1% F1 score. We hope our simple and effective Laneformer will serve as a strong baseline for future research in self-attention models for lane detection., Comment: AAAI2022
Published: 2022

37. CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving

Author: Li, Kaican, Chen, Kai, Wang, Haoyu, Hong, Lanqing, Ye, Chaoqiang, Han, Jianhua, Chen, Yukuai, Zhang, Wei, Xu, Chunjing, Yeung, Dit-Yan, Liang, Xiaodan, Li, Zhenguo, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Contemporary deep-learning object detection methods for autonomous driving usually assume prefixed categories of common traffic participants, such as pedestrians and cars. Most existing detectors are unable to detect uncommon objects and corner cases (e.g., a dog crossing a street), which may lead to severe accidents in some situations, making the timeline for the real-world application of reliable autonomous driving uncertain. One main reason that impedes the development of truly reliably self-driving systems is the lack of public datasets for evaluating the performance of object detectors on corner cases. Hence, we introduce a challenging dataset named CODA that exposes this critical problem of vision-based detectors. The dataset consists of 1500 carefully selected real-world driving scenes, each containing four object-level corner cases (on average), spanning more than 30 object categories. On CODA, the performance of standard object detectors trained on large-scale autonomous driving datasets significantly drops to no more than 12.8% in mAR. Moreover, we experiment with the state-of-the-art open-world object detector and find that it also fails to reliably identify the novel objects in CODA, suggesting that a robust perception system for autonomous driving is probably still far from reach. We expect our CODA dataset to facilitate further research in reliable detection for real-world autonomous driving. Our dataset will be released at https://coda-dataset.github.io., Comment: ECCV 2022
Published: 2022

38. Assisted reproductive technology induces different secondary sex ratio: parental and embryonic impacts

Author: Zhao, Jiansen, Shen, Haoran, Zhu, Qijiong, Liu, Jiong, Han, Jianhua, Yi, Ruiting, Li, Junxing, Lin, Yanshan, Liu, Tao, and Zhong, Xinqi
Published: 2023
Full Text: View/download PDF

39. Hexanary blends: a strategy towards thermally stable organic photovoltaics

Author: Paleti, Sri Harish Kumar, Hultmark, Sandra, Han, Jianhua, Wen, Yuanfan, Xu, Han, Chen, Si, Järsvall, Emmy, Jalan, Ishita, Villalva, Diego Rosas, Sharma, Anirudh, Khan, Jafar. I., Moons, Ellen, Li, Ruipeng, Yu, Liyang, Gorenflot, Julien, Laquai, Frédéric, Müller, Christian, and Baran, Derya
Published: 2023
Full Text: View/download PDF

40. GAP43-dependent mitochondria transfer from astrocytes enhances glioblastoma tumorigenicity

Author: Watson, Dionysios C., Bayik, Defne, Storevik, Simon, Moreino, Shannon Sherwin, Sprowls, Samuel A., Han, Jianhua, Augustsson, Mina Thue, Lauko, Adam, Sravya, Palavalasa, Røsland, Gro Vatne, Troike, Katie, Tronstad, Karl Johan, Wang, Sabrina, Sarnow, Katharina, Kay, Kristen, Lunavat, Taral R., Silver, Daniel J., Dayal, Sahil, Joseph, Justin Vareecal, Mulkearns-Hubert, Erin, Ystaas, Lars Andreas Rømo, Deshpande, Gauravi, Guyon, Joris, Zhou, Yadi, Magaut, Capucine R., Seder, Juliana, Neises, Laura, Williford, Sarah E., Meiser, Johannes, Scott, Andrew J., Sajjakulnukit, Peter, Mears, Jason A., Bjerkvig, Rolf, Chakraborty, Abhishek, Daubon, Thomas, Cheng, Feixiong, Lyssiotis, Costas A., Wahl, Daniel R., Hjelmeland, Anita B., Hossain, Jubayer A., Miletic, Hrvoje, and Lathia, Justin D.
Published: 2023
Full Text: View/download PDF

41. SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous Driving

Author: Han, Jianhua, Liang, Xiwen, Xu, Hang, Chen, Kai, Hong, Lanqing, Mao, Jiageng, Ye, Chaoqiang, Zhang, Wei, Li, Zhenguo, Liang, Xiaodan, and Xu, Chunjing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present a large-scale dataset for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data, which is the first and largest dataset to date. Existing autonomous driving systems heavily rely on `perfect' visual perception models (i.e., detection) trained using extensive annotated data to ensure safety. However, it is unrealistic to elaborately label instances of all scenarios and circumstances (i.e., night, extreme weather, cities) when deploying a robust autonomous driving system. Motivated by recent advances of self-supervised and semi-supervised learning, a promising direction is to learn a robust detection model by collaboratively exploiting large-scale unlabeled data and few labeled data. Existing datasets either provide only a small amount of data or covers limited domains with full annotation, hindering the exploration of large-scale pre-trained models. Here, we release a Large-Scale 2D Self/semi-supervised Object Detection dataset for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories. To improve diversity, the images are collected within 27833 driving hours under different weather conditions, periods and location scenes of 32 different cities. We provide extensive experiments and deep analyses of existing popular self/semi-supervised approaches, and give some interesting findings in autonomous driving scope. Experiments show that SODA10M can serve as a promising pre-training dataset for different self-supervised learning methods, which gives superior performance when fine-tuning with different downstream tasks (i.e., detection, semantic/instance segmentation) in autonomous driving domain. More information can refer to https://soda-2d.github.io., Comment: NeurIPS 2021 Datasets and Benchmarks Track
Published: 2021

42. Structural Innovation and Theoretical Optimization of Modern Cultural Industry Tax Administration Based on Lasso Regression Algorithm

Author: Han Yu and Han Jianhua
Subjects: lasso regression algorithm, indicator variables, normality test, tax model, modern cultural industry, 03d32, Mathematics, QA1-939
Abstract: China’s modern cultural industry faces difficulties in tax collection and high pressure on expenditure; in order to solve such problems, this paper constructs a modern cultural industry tax model based on Lasso regression algorithm to promote the development of modern cultural industry. In order to further study the relationship between economic factors and the tax revenue of the modern culture industry, the relevant data on China’s tax revenue between 2000 and 2021 are selected, and 12 economic factors affecting the tax revenue of the modern culture industry are screened by combining the least squares method, Lasso regression algorithm and linear regression model. After the screening of indicator variables, according to the linear regression theory, the relationship between the key economic factors screened out by Lasso regression algorithm and the fiscal revenues is estimated by fitting and tested for normality, and the modern culture industry tax revenue model based on Lasso regression algorithm is constructed. Based on the tax revenue of the modern cultural industry between 2000 and 2021, the Lasso regression algorithm is used to analyze the modern cultural industry tax revenue. Examples are used to analyze the results. The results show that the tax revenue of the culture industry in 2020 is obtained as 208, 988, 838.42 billion yuan, and the tax revenue of the culture industry in a province in 2021 is 221, 794, 675 billion yuan, and its growth rate is 6.13%, which indicates that the Lasso regression algorithm is able to extract the information contained in the tax revenue of the modern culture industry well, and the coefficients of the parameter part are also in line with the actual tax situation coincides with the actual tax situation. Through the innovation of tax management structures and theoretical optimization, this study aims to promote the healthy and rapid development of modern cultural industries.
Published: 2024
Full Text: View/download PDF

43. A smart electroplating approach to fabricate mechanically robust and fluorine-free Ni-W alloys based superhydrophobic coating on Al alloy

Author: Han, Jianhua, Wang, Zengyi, Zhi, Antong, Li, Yi, Zhao, Shuang, Yan, Huiyu, and Han, Qi
Published: 2023
Full Text: View/download PDF

44. High-performing organic electronics using terpene green solvents from renewable feedstocks

Author: Corzo, Daniel, Rosas-Villalva, Diego, C, Amruth, Tostado-Blázquez, Guillermo, Alexandre, Emily Bezerra, Hernandez, Luis Huerta, Han, Jianhua, Xu, Han, Babics, Maxime, De Wolf, Stefaan, and Baran, Derya
Published: 2023
Full Text: View/download PDF

45. Inhibition of a new AXL isoform, AXL3, induces apoptosis of mantle cell lymphoma cells

Author: Gelebart, Pascal, Eriksen Gjerstad, May, Benjaminsen, Susanne, Han, Jianhua, Karlsen, Ida, Safont, Mireia Mayoral, Leitch, Calum, Fandalyuk, Zinayida, Popa, Mihaela, Helgeland, Lars, Papp, Bela, Baran-Marszak, Fanny, and McCormack, Emmet
Published: 2023
Full Text: View/download PDF

46. Dissecting the structure-stability relationship of Y-series electron acceptors for real-world solar cell applications

Author: Xu, Han, Han, Jianhua, Chen, Si, Liu, Ye, Huerta Hernandez, Luis, Bertrandie, Jules, Babics, Maxime, Alam, Shahidul, Villalva, Diego Rosas, Paleti, Sri Harish Kumar, Gorenflot, Julien, Herok, Christoph, Ramos, Nicolas, Troughton, Joel, Sharma, Anirudh, Marder, Todd B., Engels, Bernd, Martin, Jaime, De Wolf, Stefaan, Laquai, Frédéric, and Baran, Derya
Published: 2023
Full Text: View/download PDF

47. Robust superhydrophobic film on aluminum alloy prepared with TiO2/SiO2-silane composite film for efficient self-cleaning, anti-corrosion and anti-icing

Author: Han, Jianhua, Liu, Enhong, Zhou, Yuqin, Zhao, Shuang, Yan, Huiyu, Hu, Chenxi, Kang, Jianhai, Han, Qi, and Su, Yuyu
Published: 2023
Full Text: View/download PDF

48. High-efficiency ternary sequential solution deposition structure organic solar cells with two polymer donors

Author: Wang, Junjie, Wen, Shuguang, Hu, Jian, Han, Jianhua, Yang, Chunpeng, Li, Jianfeng, Bao, Xichang, and Yan, Shouke
Published: 2023
Full Text: View/download PDF

49. Driving Factors, Co-occurrence Networks, and Metabolic Profiles of Soil Bacterial Communities Within the Root Proximity of Amaranthus palmeri

Author: Zhang, Mei, Li, Xueying, Shi, Cong, Qiu, Zhenlu, Han, Jianhua, Wang, Kefan, Zheng, Pufan, and Shi, Fuchen
Published: 2022
Full Text: View/download PDF

50. Boosts charge utilization and enables high performance organic solar cells by marco- and micro- synergistic method

Author: Zhang, Shuai, Bi, Fuzhen, Han, Jianhua, Shang, Chenyu, Kang, Xiao, and Bao, Xichang
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

619 results on '"Han, Jianhua"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources