Author: "Kang, Bingyi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kang, Bingyi"' showing total 257 results

Start Over Author "Kang, Bingyi"

257 results on '"Kang, Bingyi"'

1. Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Author: Chen, Sili, Guo, Hengkai, Zhu, Shengnan, Zhang, Feihu, Huang, Zilong, Feng, Jiashi, and Kang, Bingyi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS., Comment: Project page: https://videodepthanything.github.io/
Published: 2025

2. VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Author: Ren, Zhongwei, Wei, Yunchao, Guo, Xun, Zhao, Yao, Kang, Bingyi, Feng, Jiashi, and Jin, Xiaojie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work explores whether a deep generative model can learn complex knowledge solely from visual input, in contrast to the prevalent focus on text-based models like large language models (LLMs). We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks. Our experiments reveal two key findings: (1) video-only training provides sufficient information for learning knowledge, including rules, reasoning and planning capabilities, and (2) the representation of visual change is crucial for knowledge acquisition. To improve both the efficiency and efficacy of this process, we introduce the Latent Dynamics Model (LDM) as a key component of VideoWorld. Remarkably, VideoWorld reaches a 5-dan professional level in the Video-GoBench with just a 300-million-parameter model, without relying on search algorithms or reward mechanisms typical in reinforcement learning. In robotic tasks, VideoWorld effectively learns diverse control operations and generalizes across environments, approaching the performance of oracle models in CALVIN and RLBench. This study opens new avenues for knowledge acquisition from visual data, with all code, data, and models open-sourced for further research., Comment: Code and models are released at: https://maverickren.github.io/VideoWorld.github.io/
Published: 2025

3. Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Author: Li, Xinghang, Li, Peiyan, Liu, Minghuan, Wang, Dong, Liu, Jirong, Kang, Bingyi, Ma, Xiao, Kong, Tao, Zhang, Hanbo, and Liu, Huaping
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: Foundation Vision Language Models (VLMs) exhibit strong capabilities in multi-modal representation learning, comprehension, and reasoning. By injecting action components into the VLMs, Vision-Language-Action Models (VLAs) can be naturally formed and also show promising performance. Existing work has demonstrated the effectiveness and generalization of VLAs in multiple scenarios and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since existing VLAs differ in their backbones, action-prediction formulations, data distributions, and training recipes. This leads to a missing piece for a systematic understanding of the design choices of VLAs. In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we need VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets, and toolkits, along with detailed training and evaluation recipes at: robovlms.github.io., Comment: Project page: robovlms.github.io. Added limitations and future works. Fix categorization
Published: 2024

4. Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Author: Lin, Haotong, Peng, Sida, Chen, Jingxiao, Peng, Songyou, Sun, Jiaming, Liu, Minghuan, Bao, Hujun, Feng, Jiashi, Zhou, Xiaowei, and Kang, Bingyi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping., Comment: Project page: https://PromptDA.github.io/
Published: 2024

5. Image Understanding Makes for A Good Tokenizer for Image Generation

Author: Wang, Luting, Zhao, Yang, Zhang, Zijian, Feng, Jiashi, Liu, Si, and Kang, Bingyi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Abstract Modern image generation (IG) models have been shown to capture rich semantics valuable for image understanding (IU) tasks. However, the potential of IU models to improve IG performance remains uncharted. We address this issue using a token-based IG framework, which relies on effective tokenizers to project images into token sequences. Currently, pixel reconstruction (e.g., VQGAN) dominates the training objective for image tokenizers. In contrast, our approach adopts the feature reconstruction objective, where tokenizers are trained by distilling knowledge from pretrained IU encoders. Comprehensive comparisons indicate that tokenizers with strong IU capabilities achieve superior IG performance across a variety of metrics, datasets, tasks, and proposal networks. Notably, VQ-KD CLIP achieves $4.10$ FID on ImageNet-1k (IN-1k). Visualization suggests that the superiority of VQ-KD can be partly attributed to the rich semantics within the VQ-KD codebook. We further introduce a straightforward pipeline to directly transform IU encoders into tokenizers, demonstrating exceptional effectiveness for IG tasks. These discoveries may energize further exploration into image tokenizer research and inspire the community to reassess the relationship between IU and IG. The code is released at https://github.com/magic-research/vector_quantization., Comment: Accepted by NeurIPS 2024
Published: 2024

6. Classification Done Right for Vision-Language Pre-Training

Author: Huang, Zilong, Ye, Qinghao, Kang, Bingyi, Feng, Jiashi, and Fan, Haoqi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. https://github.com/x-cls/superclass, Comment: NeurIPS 2024
Published: 2024

7. How Far is Video Generation from World Model: A Physical Law Perspective

Author: Kang, Bingyi, Yue, Yang, Lu, Rui, Lin, Zhijie, Zhao, Yang, Wang, Kaixin, Huang, Gao, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io, Comment: preprint
Published: 2024

8. DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Author: Yue, Yang, Wang, Yulin, Kang, Bingyi, Han, Yizeng, Wang, Shenzhi, Song, Shiji, Feng, Jiashi, and Huang, Gao
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA., Comment: 25 pages, 6 figures, NeurIPS 2024
Published: 2024

9. Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Author: Wang, Yuqing, Xiong, Tianwei, Zhou, Daquan, Lin, Zhijie, Zhao, Yang, Kang, Bingyi, Feng, Jiashi, and Liu, Xihui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: https://epiphqny.github.io/Loong-video., Comment: Project page: https://epiphqny.github.io/Loong-video/
Published: 2024

10. Depth Anything V2

Author: Yang, Lihe, Kang, Bingyi, Huang, Zilong, Zhao, Zhen, Xu, Xiaogang, Feng, Jiashi, and Zhao, Hengshuang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research., Comment: Accepted by NeurIPS 2024. Project page: https://depth-anything-v2.github.io
Published: 2024

11. Improving Token-Based World Models with Parallel Observation Prediction

Author: Cohen, Lior, Wang, Kaixin, Kang, Bingyi, and Mannor, Shie
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Motivated by the success of Transformers when applied to sequences of discrete symbols, token-based world models (TBWMs) were recently proposed as sample-efficient methods. In TBWMs, the world model consumes agent experience as a language-like sequence of tokens, where each observation constitutes a sub-sequence. However, during imagination, the sequential token-by-token generation of next observations results in a severe bottleneck, leading to long training times, poor GPU utilization, and limited representations. To resolve this bottleneck, we devise a novel Parallel Observation Prediction (POP) mechanism. POP augments a Retentive Network (RetNet) with a novel forward mode tailored to our reinforcement learning setting. We incorporate POP in a novel TBWM agent named REM (Retentive Environment Model), showcasing a 15.4x faster imagination compared to prior TBWMs. REM attains superhuman performance on 12 out of 26 games of the Atari 100K benchmark, while training in less than 12 hours. Our code is available at \url{https://github.com/leor-c/REM}.
Published: 2024

12. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Author: Yang, Lihe, Kang, Bingyi, Huang, Zilong, Xu, Xiaogang, Feng, Jiashi, and Zhao, Hengshuang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything., Comment: Accepted by CVPR 2024. Project page: https://depth-anything.github.io
Published: 2024

13. Harnessing Diffusion Models for Visual Perception with Meta Prompts

Author: Wan, Qiang, Huang, Zilong, Kang, Bingyi, Feng, Jiashi, and Zhang, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual inputs, a feat made possible through its pre-training on large-scale image-text pairs. This leads to a natural inquiry: can diffusion models be utilized to tackle visual perception tasks? In this paper, we propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. Our key insight is to introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. The effect of meta prompts are two-fold. First, as a direct replacement of the text embeddings in the T2I models, it can activate task-relevant features during feature extraction. Second, it will be used to re-arrange the extracted features to ensures that the model focuses on the most pertinent features for the task on hand. Additionally, we design a recurrent refinement training strategy that fully leverages the property of diffusion models, thereby yielding stronger visual features. Extensive experiments across various benchmarks validate the effectiveness of our approach. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes. Concurrently, the proposed method attains results comparable to the current state-of-the-art in semantic segmentation on ADE20K and pose estimation on COCO datasets, further exemplifying its robustness and versatility.
Published: 2023

14. FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models

Author: Yang, Lihe, Xu, Xiaogang, Kang, Bingyi, Shi, Yinghuan, and Zhao, Hengshuang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Semantic segmentation has witnessed tremendous progress due to the proposal of various advanced network architectures. However, they are extremely hungry for delicate annotations to train, and the acquisition is laborious and unaffordable. Therefore, we present FreeMask in this work, which resorts to synthetic images from generative models to ease the burden of both data collection and annotation procedures. Concretely, we first synthesize abundant training images conditioned on the semantic masks provided by realistic datasets. This yields extra well-aligned image-mask training pairs for semantic segmentation models. We surprisingly observe that, solely trained with synthetic images, we already achieve comparable performance with real ones (e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we investigate the role of synthetic images by joint training with real images, or pre-training for real images. Meantime, we design a robust filtering principle to suppress incorrectly synthesized regions. In addition, we propose to inequally treat different semantic masks to prioritize those harder ones and sample more corresponding synthetic images for them. As a result, either jointly trained or pre-trained with our filtered and re-sampled synthesized images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on ADE20K. Code is available at https://github.com/LiheYoung/FreeMask., Comment: Accepted by NeurIPS 2023
Published: 2023

15. Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

Author: Yue, Yang, Lu, Rui, Kang, Bingyi, Song, Shiji, and Huang, Gao
Subjects: Computer Science - Machine Learning
Abstract: The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1 transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness., Comment: 31 pages, 20 figures
Published: 2023

16. BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs

Author: Zhao, Yang, Lin, Zhijie, Zhou, Daquan, Huang, Zilong, Feng, Jiashi, and Kang, Bingyi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs. Therefore, we propose BuboGPT, a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language, providing fine-grained understanding of visual objects and other given modalities. As a result, BuboGPT is able to point out the specific location of an object in the image, when it is generating response or description for that object. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. 2) A two-stage training scheme and instruction dataset to endow joint text-image-audio understanding. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human. It performs consistently well when provided by arbitrary modality combinations (either aligned or unaligned). Our code, model and dataset are available at https://bubo-gpt.github.io .
Published: 2023

17. Decoupled Prioritized Resampling for Offline RL

Author: Yue, Yang, Kang, Bingyi, Ma, Xiao, Yang, Qisen, Huang, Gao, Song, Shiji, and Yan, Shuicheng
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. To alleviate this issue, we propose Offline Prioritized Experience Replay (OPER), featuring a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training. Through theoretical analysis, we show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We develop two practical strategies to obtain priority weights by estimating advantages based on a fitted value network (OPER-A) or utilizing trajectory returns (OPER-R) for quick computation. OPER is a plug-and-play component for offline RL algorithms. As case studies, we evaluate OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and IQL. Extensive experiments demonstrate that both OPER-A and OPER-R significantly improve the performance for all baseline methods. Codes and priority weights are availiable at https://github.com/sail-sg/OPER., Comment: published on IEEE TNNLS
Published: 2023

18. Improving and Benchmarking Offline Reinforcement Learning Algorithms

Author: Kang, Bingyi, Ma, Xiao, Wang, Yirui, Yue, Yang, and Yan, Shuicheng
Subjects: Computer Science - Machine Learning
Abstract: Recently, Offline Reinforcement Learning (RL) has achieved remarkable progress with the emergence of various algorithms and datasets. However, these methods usually focus on algorithmic advancements, ignoring that many low-level implementation choices considerably influence or even drive the final performance. As a result, it becomes hard to attribute the progress in Offline RL as these choices are not sufficiently discussed and aligned in the literature. In addition, papers focusing on a dataset (e.g., D4RL) often ignore algorithms proposed on another dataset (e.g., RL Unplugged), causing isolation among the algorithms, which might slow down the overall progress. Therefore, this work aims to bridge the gaps caused by low-level choices and datasets. To this end, we empirically investigate 20 implementation choices using three representative algorithms (i.e., CQL, CRR, and IQL) and present a guidebook for choosing implementations. Following the guidebook, we find two variants CRR+ and CQL+ , achieving new state-of-the-art on D4RL. Moreover, we benchmark eight popular offline RL algorithms across datasets under unified training and evaluation framework. The findings are inspiring: the success of a learning paradigm severely depends on the data distribution, and some previous conclusions are biased by the dataset used. Our code is available at https://github.com/sail-sg/offbench., Comment: preprint
Published: 2023

19. Efficient Diffusion Policies for Offline Reinforcement Learning

Author: Kang, Bingyi, Ma, Xiao, Du, Chao, Pang, Tianyu, and Yan, Shuicheng
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline reinforcement learning (RL) aims to learn optimal policies from offline datasets, where the parameterization of policies is crucial but often overlooked. Recently, Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model, whose success relies on a parametrized Markov Chain with hundreds of steps for sampling. However, Diffusion-QL suffers from two critical limitations. 1) It is computationally inefficient to forward and backward through the whole Markov chain during training. 2) It is incompatible with maximum likelihood-based RL algorithms (e.g., policy gradient methods) as the likelihood of diffusion models is intractable. Therefore, we propose efficient diffusion policy (EDP) to overcome these two challenges. EDP approximately constructs actions from corrupted ones at training to avoid running the sampling chain. We conduct extensive experiments on the D4RL benchmark. The results show that EDP can reduce the diffusion policy training time from 5 days to 5 hours on gym-locomotion tasks. Moreover, we show that EDP is compatible with various offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on D4RL by large margins over previous methods. Our code is available at https://github.com/sail-sg/edp., Comment: Accepted by NeurIPS 2023
Published: 2023

20. MADiff: Offline Multi-agent Learning with Diffusion Models

Author: Zhu, Zhengbang, Liu, Minghuan, Mao, Liyuan, Kang, Bingyi, Xu, Minkai, Yu, Yong, Ermon, Stefano, and Zhang, Weinan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Offline reinforcement learning (RL) aims to learn policies from pre-existing datasets without further interactions, making it a challenging task. Q-learning algorithms struggle with extrapolation errors in offline settings, while supervised learning methods are constrained by model expressiveness. Recently, diffusion models (DMs) have shown promise in overcoming these limitations in single-agent learning, but their application in multi-agent scenarios remains unclear. Generating trajectories for each agent with independent DMs may impede coordination, while concatenating all agents' information can lead to low sample efficiency. Accordingly, we propose MADiff, which is realized with an attention-based diffusion model to model the complex coordination among behaviors of multiple agents. To our knowledge, MADiff is the first diffusion-based multi-agent learning framework, functioning as both a decentralized policy and a centralized controller. During decentralized executions, MADiff simultaneously performs teammate modeling, and the centralized controller can also be applied in multi-agent trajectory predictions. Our experiments demonstrate that MADiff outperforms baseline algorithms across various multi-agent learning tasks, highlighting its effectiveness in modeling complex multi-agent interactions. Our code is available at https://github.com/zbzhu99/madiff., Comment: 30 pages, 10 figures, 9 tables. Published at NeurIPS 2024
Published: 2023

21. Bag of Tricks for Training Data Extraction from Language Models

Author: Yu, Weichen, Pang, Tianyu, Liu, Qian, Du, Chao, Kang, Bingyi, Huang, Yan, Lin, Min, and Yan, Shuicheng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proof-of-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-then-ranking, i.e., generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e.g., sampling strategy) and text ranking (e.g., token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research. The code is available at https://github.com/weichen-yu/LM-Extraction., Comment: ICML 2023
Published: 2023

22. Boosting Offline Reinforcement Learning via Data Rebalancing

Author: Yue, Yang, Kang, Bingyi, Ma, Xiao, Xu, Zhongwen, Huang, Gao, and Yan, Shuicheng
Subjects: Computer Science - Machine Learning
Abstract: Offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. To address this problem, existing works mainly focus on designing sophisticated algorithms to explicitly or implicitly constrain the learned policy to be close to the behavior policy. The constraint applies not only to well-performing actions but also to inferior ones, which limits the performance upper bound of the learned policy. Instead of aligning the densities of two distributions, aligning the supports gives a relaxed constraint while still being able to avoid out-of-distribution actions. Therefore, we propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. More specifically, we construct a better behavior policy by resampling each transition in an old dataset according to its episodic return. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time. Extensive experiments demonstrate that ReD is effective at boosting offline RL performance and orthogonal to decoupling strategies in long-tailed classification. New state-of-the-arts are achieved on the D4RL benchmark., Comment: 8 pages, 2 figures
Published: 2022

23. Mutual Information Regularized Offline Reinforcement Learning

Author: Ma, Xiao, Kang, Bingyi, Xu, Zhongwen, Lin, Min, and Yan, Shuicheng
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy or value for deviating from the behavior policy during policy improvement or evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. Hence, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding mutual information regularizations. MISA is a general framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance. In addition, our extensive experiments show MISA significantly outperforms a wide range of baselines on various tasks of the D4RL benchmark,e.g., achieving 742.9 total points on gym-locomotion tasks. Our code is available at https://github.com/sail-sg/MISA., Comment: NeurIPS 2023
Published: 2022

24. Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning

Author: Yue, Yang, Kang, Bingyi, Xu, Zhongwen, Huang, Gao, and Yan, Shuicheng
Subjects: Computer Science - Machine Learning
Abstract: Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and promising for boosting sample efficiency in RL. These methods usually rely on contrastive learning and data augmentation to train a transition model for state prediction, which is different from how the model is used in RL--performing value-based planning. Accordingly, the learned representation by these visual methods may be good for recognition but not optimal for estimating state value and solving the decision problem. To address this issue, we propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. More specifically, VCR trains a model to predict the future state (also referred to as the ''imagined state'') based on the current one and a sequence of actions. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state. We develop two implementations of the above idea for the discrete and continuous action spaces respectively. We conduct experiments on Atari 100K and DeepMind Control Suite benchmarks to validate their effectiveness for improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms., Comment: 13 pages, 4 figures
Published: 2022

25. Deceptive evidence detection of belief functions based on reinforcement learning in partial label environment

Author: Chang, Yuhang, Pan, Junhao, Zhao, Xuan, and Kang, Bingyi
Published: 2024
Full Text: View/download PDF

26. A transformation-based Z-number distance measure using belief functions and ERP method

Author: Zhang, Yunjia, Tang, Yaxian, Xu, Yongzhuo, Zhang, Huimin, and Kang, Bingyi
Published: 2025
Full Text: View/download PDF

27. A new approximate belief mutual information derived from the developed fuzzy mutual information considering belief negation

Author: Zhao, Xuan, Tang, Yaxian, and Kang, Bingyi
Published: 2024
Full Text: View/download PDF

28. Evaluate the competence of fuzzy preference using statistical conflict in the frame of DS evidence theory

Author: Zhang, Huimin, Chang, Yuhang, Zhang, Yunjia, and Kang, Bingyi
Published: 2024
Full Text: View/download PDF

29. Deep Long-Tailed Learning: A Survey

Author: Zhang, Yifan, Kang, Bingyi, Hooi, Bryan, Yan, Shuicheng, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep long-tailed learning, one of the most challenging problems in visual recognition, aims to train well-performing deep models from a large number of images that follow a long-tailed class distribution. In the last decade, deep learning has emerged as a powerful recognition model for learning high-quality image representations and has led to remarkable breakthroughs in generic visual recognition. However, long-tailed class imbalance, a common problem in practical visual recognition tasks, often limits the practicality of deep network based recognition models in real-world applications, since they can be easily biased towards dominant classes and perform poorly on tail classes. To address this problem, a large number of studies have been conducted in recent years, making promising progress in the field of deep long-tailed learning. Considering the rapid evolution of this field, this paper aims to provide a comprehensive survey on recent advances in deep long-tailed learning. To be specific, we group existing deep long-tailed learning studies into three main categories (i.e., class re-balancing, information augmentation and module improvement), and review these methods following this taxonomy in detail. Afterward, we empirically analyze several state-of-the-art methods by evaluating to what extent they address the issue of class imbalance via a newly proposed evaluation metric, i.e., relative accuracy. We conclude the survey by highlighting important applications of deep long-tailed learning and identifying several promising directions for future research., Comment: Published in IEEE Transactions on Pattern Analysis and Machine Intelligence
Published: 2021

30. A novel belief Rényi divergence based on belief and plausibility function and its applications in multi-source data fusion

Author: Jin, Xiaofei, Chang, Yuhang, Zhang, Huimin, Kang, Bingyi, and Zhang, Jianfeng
Published: 2024
Full Text: View/download PDF

31. Evaluating the reliability and relative weight of the evidence using approximate evidential mutual information

Author: Zhao, Xuan, Zhang, Mengyi, Xiao, Zilong, and Kang, Bingyi
Published: 2024
Full Text: View/download PDF

32. Refiner: Refining Self-attention for Vision Transformers

Author: Zhou, Daquan, Shi, Yujun, Kang, Bingyi, Yu, Weihao, Jiang, Zihang, Li, Yuan, Jin, Xiaojie, Hou, Qibin, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.
Published: 2021

33. DeepViT: Towards Deeper Vision Transformer

Author: Zhou, Daquan, Kang, Bingyi, Jin, Xiaojie, Yang, Linjie, Lian, Xiaochen, Jiang, Zihang, Hou, Qibin, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. More specifically, we empirically observe that such scaling difficulty is caused by the attention collapse issue: as the transformer goes deeper, the attention maps gradually become similar and even much the same after certain layers. In other words, the feature maps tend to be identical in the top layers of deep ViT models. This fact demonstrates that in deeper layers of ViTs, the self-attention mechanism fails to learn effective concepts for representation learning and hinders the model from getting expected performance gain. Based on above observation, we propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The pro-posed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modification to existing ViT models. Notably, when training a deep ViT model with 32 transformer blocks, the Top-1 classification accuracy can be improved by 1.6% on ImageNet. Code is publicly available at https://github.com/zhoudaquan/dvit_repo.
Published: 2021

34. Deceptive evidence detection in information fusion of belief functions based on reinforcement learning

Author: Kang, Bingyi and Zhao, Chunjiang
Published: 2024
Full Text: View/download PDF

35. Improving Generalization in Reinforcement Learning with Mixture Regularization

Author: Wang, Kaixin, Kang, Bingyi, Shao, Jie, and Feng, Jiashi
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Deep reinforcement learning (RL) agents trained in a limited set of environments tend to suffer overfitting and fail to generalize to unseen testing environments. To improve their generalizability, data augmentation approaches (e.g. cutout and random convolution) are previously explored to increase the data diversity. However, we find these approaches only locally perturb the observations regardless of the training environments, showing limited effectiveness on enhancing the data diversity and the generalization performance. In this work, we introduce a simple approach, named mixreg, which trains agents on a mixture of observations from different training environments and imposes linearity constraints on the observation interpolations and the supervision (e.g. associated reward) interpolations. Mixreg increases the data diversity more effectively and helps learn smoother policies. We verify its effectiveness on improving generalization by conducting extensive experiments on the large-scale Procgen benchmark. Results show mixreg outperforms the well-established baselines on unseen testing environments by a large margin. Mixreg is simple, effective and general. It can be applied to both policy-based and value-based RL algorithms. Code is available at https://github.com/kaixin96/mixreg ., Comment: NeurIPS 2020
Published: 2020

36. Few-shot Classification via Adaptive Attention

Author: Jiang, Zihang, Kang, Bingyi, Zhou, Kuangqi, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Training a neural network model that can quickly adapt to a new task is highly desirable yet challenging for few-shot learning problems. Recent few-shot learning methods mostly concentrate on developing various meta-learning strategies from two aspects, namely optimizing an initial model or learning a distance metric. In this work, we propose a novel few-shot learning method via optimizing and fast adapting the query sample representation based on very few reference samples. To be specific, we devise a simple and efficient meta-reweighting strategy to adapt the sample representations and generate soft attention to refine the representation such that the relevant features from the query and support samples can be extracted for a better few-shot classification. Such an adaptive attention model is also able to explain what the classification model is looking for as the evidence for classification to some extent. As demonstrated experimentally, the proposed model achieves state-of-the-art classification results on various benchmark few-shot classification and fine-grained recognition datasets., Comment: 11 pages, 4 figures
Published: 2020

37. The Devil is in Classification: A Simple Framework for Long-tail Object Detection and Instance Segmentation

Author: Wang, Tao, Li, Yu, Kang, Bingyi, Li, Junnan, Liew, Junhao, Tang, Sheng, Hoi, Steven, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most existing object instance detection and segmentation models only work well on fairly balanced benchmarks where per-category training sample numbers are comparable, such as COCO. They tend to suffer performance drop on realistic datasets that are usually long-tailed. This work aims to study and address such open challenges. Specifically, we systematically investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset, and unveil that a major cause is the inaccurate classification of object proposals. Based on such an observation, we first consider various techniques for improving long-tail classification performance which indeed enhance instance segmentation results. We then propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach. Without bells and whistles, it significantly boosts the performance of instance segmentation for tail classes on the recent LVIS dataset and our sampled COCO-LT dataset. Our analysis provides useful insights for solving long-tail instance detection and segmentation problems, and the straightforward \emph{SimCal} method can serve as a simple but strong baseline. With the method we have won the 2019 LVIS challenge. Codes and models are available at https://github.com/twangnh/SimCal., Comment: LVIS 2019 challenge winner, performance significantly improved after challenge submission, accepted at ECCV 2020
Published: 2020

38. Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax

Author: Li, Yu, Wang, Tao, Kang, Bingyi, Tang, Sheng, Wang, Chunfeng, Li, Jintao, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Solving long-tail large vocabulary object detection with deep learning based models is a challenging and demanding task, which is however under-explored.In this work, we provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We find existing detection methods are unable to model few-shot classes when the dataset is extremely skewed, which can result in classifier imbalance in terms of parameter magnitude. Directly adapting long-tail classification models to detection frameworks can not solve this problem due to the intrinsic difference between detection and classification.In this work, we propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. It implicitly modulates the training process for the head and tail classes and ensures they are both sufficiently trained, without requiring any extra sampling for the instances from the tail classes.Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors with various backbones and frameworks on both object detection and instance segmentation. It beats all state-of-the-art methods transferred from long-tail image classification and establishes new state-of-the-art.Code is available at https://github.com/FishYuLi/BalancedGroupSoftmax., Comment: CVPR 2020 (Oral). Code is available at https://github.com/FishYuLi/BalancedGroupSoftmax
Published: 2020

39. Evaluate the reliability of information sources using the non-parametric plausibility ReliefF algorithm for multi-source information fusion

Author: Zhang, Mengyi, Cui, Huizi, Tian, Xu, Kang, Bingyi, and Huang, Lvwen
Published: 2023
Full Text: View/download PDF

40. Classification Calibration for Long-tail Instance Segmentation

Author: Wang, Tao, Li, Yu, Kang, Bingyi, Li, Junnan, Liew, Jun Hao, Tang, Sheng, Hoi, Steven, and Feng, Jiashi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Remarkable progress has been made in object instance detection and segmentation in recent years. However, existing state-of-the-art methods are mostly evaluated with fairly balanced and class-limited benchmarks, such as Microsoft COCO dataset [8]. In this report, we investigate the performance drop phenomenon of state-of-the-art two-stage instance segmentation models when processing extreme long-tail training data based on the LVIS [5] dataset, and find a major cause is the inaccurate classification of object proposals. Based on this observation, we propose to calibrate the prediction of classification head to improve recognition performance for the tail classes. Without much additional cost and modification of the detection model architecture, our calibration method improves the performance of the baseline by a large margin on the tail classes. Codes will be available. Importantly, after the submission, we find significant improvement can be further achieved by modifying the calibration head, which we will update later., Comment: This report presents our winning solution to LVIS 2019 challenge
Published: 2019

41. Exploring Simple and Transferable Recognition-Aware Image Processing

Author: Liu, Zhuang, Wang, Hung-Ju, Zhou, Tinghui, Shen, Zhiqiang, Kang, Bingyi, Shelhamer, Evan, and Darrell, Trevor
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent progress in image recognition has stimulated the deployment of vision systems at an unprecedented scale. As a result, visual data are now often consumed not only by humans but also by machines. Existing image processing methods only optimize for better human perception, yet the resulting images may not be accurately recognized by machines. This can be undesirable, e.g., the images can be improperly handled by search engines or recommendation systems. In this work, we examine simple approaches to improve machine recognition of processed images: optimizing the recognition loss directly on the image processing network or through an intermediate input transformation model. Interestingly, the processing model's ability to enhance recognition quality can transfer when evaluated on models of different architectures, recognized categories, tasks and training datasets. This makes the methods applicable even when we do not have the knowledge of future recognition models, e.g., when uploading processed images to the Internet. We conduct experiments on multiple image processing tasks paired with ImageNet classification and PASCAL VOC detection as recognition tasks. With these simple yet effective methods, substantial accuracy gain can be achieved with strong transferability and minimal image quality loss. Through a user study we further show that the accuracy gain can transfer to a black-box cloud model. Finally, we try to explain this transferability phenomenon by demonstrating the similarities of different models' decision boundaries. Code is available at https://github.com/liuzhuang13/Transferable_RA ., Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Published: 2019

42. Decoupling Representation and Classifier for Long-Tailed Recognition

Author: Kang, Bingyi, Xie, Saining, Rohrbach, Marcus, Yan, Zhicheng, Gordo, Albert, Feng, Jiashi, and Kalantidis, Yannis
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem. Existing solutions usually involve class-balancing strategies, e.g., by loss re-weighting, data re-sampling, or transfer learning from head- to tail-classes, but most of them adhere to the scheme of jointly learning representations and classifiers. In this work, we decouple the learning procedure into representation learning and classification, and systematically explore how different balancing strategies affect them for long-tailed recognition. The findings are surprising: (1) data imbalance might not be an issue in learning high-quality representations; (2) with representations learned with the simplest instance-balanced (natural) sampling, it is also possible to achieve strong long-tailed recognition ability by adjusting only the classifier. We conduct extensive experiments and set new state-of-the-art performance on common long-tailed benchmarks like ImageNet-LT, Places-LT and iNaturalist, showing that it is possible to outperform carefully designed losses, sampling strategies, even complex modules with memory, by using a straightforward approach that decouples representation and classification. Our code is available at https://github.com/facebookresearch/classifier-balancing.
Published: 2019

43. Regularization Matters in Policy Optimization

Author: Liu, Zhuang, Li, Xuanlin, Kang, Bingyi, and Darrell, Trevor
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., $L_2$ regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. Our findings are shown to be robust against training hyperparameter variations. We also compare these techniques with the more widely used entropy regularization. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. We further analyze why regularization may help generalization in RL from four perspectives - sample complexity, reward distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms. Our code is available at https://github.com/xuanlinli17/iclr2021_rlreg ., Comment: Published at ICLR 2021; please cite this paper's ICLR 2021 version at https://github.com/xuanlinli17/iclr2021_rlreg#citation or the arXiv version from "Export Bibtex Citation" on the right, instead of the "2019 OpenReview" version in Google Scholar. Thanks!
Published: 2019

44. A multi-criteria group decision-making method based on OWA aggregation operator and Z-numbers

Author: Cheng, Ruolan, Zhu, Ruonan, Tian, Ye, Kang, Bingyi, and Zhang, Jianfeng
Published: 2023
Full Text: View/download PDF

45. BGC: Belief gravitational clustering approach and its application in the counter-deception of belief functions

Author: Cui, Huizi, Zhang, Huaqing, Chang, Yuhang, and Kang, Bingyi
Published: 2023
Full Text: View/download PDF

46. Determine the number of unknown targets in the open world from the perspective of bidirectional analysis using Gap statistic and Isolation forest

Author: Cui, Huizi, Chang, Yuhang, Zhang, Huaqing, Mi, Xiangjun, and Kang, Bingyi
Published: 2023
Full Text: View/download PDF

47. The aggregation of Z-numbers based on overlap functions and grouping functions and its application on group decision-making

Author: Qi, Gongao, Li, Juanrui, Kang, Bingyi, and Yang, Bin
Published: 2023
Full Text: View/download PDF

48. A Novel Z-TOPSIS Method Based on Improved Distance Measure of Z-Numbers

Author: Cheng, Ruolan, Zhang, Jianfeng, and Kang, Bingyi
Published: 2022
Full Text: View/download PDF

49. Similarity R-C3D for Few-shot Temporal Activity Detection

Author: Xu, Huijuan, Kang, Bingyi, Sun, Ximeng, Feng, Jiashi, Saenko, Kate, and Darrell, Trevor
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Many activities of interest are rare events, with only a few labeled examples available. Therefore models for temporal activity detection which are able to learn from a few examples are desirable. In this paper, we present a conceptually simple and general yet novel framework for few-shot temporal activity detection which detects the start and end time of the few-shot input activities in an untrimmed video. Our model is end-to-end trainable and can benefit from more few-shot examples. At test time, each proposal is assigned the label of the few-shot activity class corresponding to the maximum similarity score. Our Similarity R-C3D method outperforms previous work on three large-scale benchmarks for temporal activity detection (THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets) in the few-shot setting. Our code will be made available.
Published: 2018

50. Few-shot Object Detection via Feature Reweighting

Author: Kang, Bingyi, Liu, Zhuang, Wang, Xin, Yu, Fisher, Feng, Jiashi, and Darrell, Trevor
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Conventional training of a deep CNN based object detector demands a large number of bounding box annotations, which may be unavailable for rare categories. In this work we develop a few-shot object detector that can learn to detect novel objects from only a few annotated examples. Our proposed model leverages fully labeled base classes and quickly adapts to novel classes, using a meta feature learner and a reweighting module within a one-stage detection architecture. The feature learner extracts meta features that are generalizable to detect novel object classes, using training data from base classes with sufficient samples. The reweighting module transforms a few support examples from the novel classes to a global vector that indicates the importance or relevance of meta features for detecting the corresponding objects. These two modules, together with a detection prediction module, are trained end-to-end based on an episodic few-shot learning scheme and a carefully designed loss function. Through extensive experiments we demonstrate that our model outperforms well-established baselines by a large margin for few-shot object detection, on multiple datasets and settings. We also present analysis on various aspects of our proposed model, aiming to provide some inspiration for future few-shot detection works., Comment: ICCV 2019
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

257 results on '"Kang, Bingyi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources