Author: "Kira, Zsolt" / Publication Year Range: Last 50 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kira, Zsolt"' showing total 320 results

Start Over Author "Kira, Zsolt" Publication Year Range Last 50 years

320 results on '"Kira, Zsolt"'

1. From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Author: Szot, Andrew, Mazoure, Bogdan, Attia, Omar, Timofeev, Aleksei, Agrawal, Harsh, Hjelm, Devon, Gan, Zhe, Kira, Zsolt, and Toshev, Alexander
Subjects: Computer Science - Machine Learning
Abstract: We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.
Published: 2024

2. Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Author: Halbe, Shaunak, Tian, Junjiao, Joseph, K J, Smith, James Seale, Stevo, Katherine, Balasubramanian, Vineeth N, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach. Code available at https://github.com/shaunak27/grain-clip .
Published: 2024

3. Adversarial Attacks Using Differentiable Rendering: A Survey

Author: Hull, Matthew, Zhang, Chao, Kira, Zsolt, and Chau, Duen Horng
Subjects: Computer Science - Machine Learning, Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Differentiable rendering methods have emerged as a promising means for generating photo-realistic and physically plausible adversarial attacks by manipulating 3D objects and scenes that can deceive deep neural networks (DNNs). Recently, differentiable rendering capabilities have evolved significantly into a diverse landscape of libraries, such as Mitsuba, PyTorch3D, and methods like Neural Radiance Fields and 3D Gaussian Splatting for solving inverse rendering problems that share conceptually similar properties commonly used to attack DNNs, such as back-propagation and optimization. However, the adversarial machine learning research community has not yet fully explored or understood such capabilities for generating attacks. Some key reasons are that researchers often have different attack goals, such as misclassification or misdetection, and use different tasks to accomplish these goals by manipulating different representation in a scene, such as the mesh or texture of an object. This survey adopts a task-oriented unifying framework that systematically summarizes common tasks, such as manipulating textures, altering illumination, and modifying 3D meshes to exploit vulnerabilities in DNNs. Our framework enables easy comparison of existing works, reveals research gaps and spotlights exciting future research directions in this rapidly evolving field. Through focusing on how these tasks enable attacks on various DNNs such as image classification, facial recognition, object detection, optical flow and depth estimation, our survey helps researchers and practitioners better understand the vulnerabilities of computer vision systems against photorealistic adversarial attacks that could threaten real-world applications.
Published: 2024

4. Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models

Author: Tian, Junjiao, Huang, Chengyue, and Kira, Zsolt
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern optimizers such as AdamW, equipped with momentum and adaptive learning rate, are designed to escape local minima and explore the vast parameter space. This exploration is beneficial for finding good loss basins when training from scratch. It is not necessarily ideal when resuming from a powerful foundation model because it can lead to large deviations from the pre-trained initialization and, consequently, worse robustness and generalization. At the same time, strong regularization on all parameters can lead to under-fitting. We hypothesize that selectively regularizing the parameter space is the key to fitting and retraining the pre-trained knowledge. This paper proposes a new weight decay technique, Selective Projection Decay (SPD), that selectively imposes a strong penalty on certain layers while allowing others to change freely. Intuitively, SPD expands and contracts the parameter search space for layers with consistent and inconsistent loss reduction, respectively. Experimentally, when equipped with SPD, Adam consistently provides better in-distribution generalization and out-of-distribution robustness performance on multiple popular vision and language benchmarks. Code available at~\url{https://github.com/GT-RIPL/Selective-Projection-Decay.git}, Comment: Accepted to Neurips 2024
Published: 2024

5. Neural Fields in Robotics: A Survey

Author: Irshad, Muhammad Zubair, Comi, Mauro, Lin, Yen-Chen, Heppert, Nick, Valada, Abhinav, Ambrus, Rares, Kira, Zsolt, and Tremblay, Jonathan
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sensor data, and generation of novel viewpoints. This survey explores their applications in robotics, emphasizing their potential to enhance perception, planning, and control. Their compactness, memory efficiency, and differentiability, along with seamless integration with foundation and generative models, make them ideal for real-time applications, improving robot adaptability and decision-making. This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers. First, we present four key Neural Fields frameworks: Occupancy Networks, Signed Distance Fields, Neural Radiance Fields, and Gaussian Splatting. Second, we detail Neural Fields' applications in five major robotics domains: pose estimation, manipulation, navigation, physics, and autonomous driving, highlighting key works and discussing takeaways and open challenges. Finally, we outline the current limitations of Neural Fields in robotics and propose promising directions for future research. Project page: https://robonerf.github.io, Comment: 20 pages, 20 figures. Project Page: https://robonerf.github.io
Published: 2024

6. ReLIC: A Recipe for 64k Steps of In-Context Reinforcement Learning for Embodied AI

Author: Elawady, Ahmad, Chhablani, Gunjan, Ramrakhya, Ram, Yadav, Karmesh, Batra, Dhruv, Kira, Zsolt, and Szot, Andrew
Subjects: Computer Science - Machine Learning
Abstract: Intelligent embodied agents need to quickly adapt to new scenarios by integrating long histories of experience into decision-making. For instance, a robot in an unfamiliar house initially wouldn't know the locations of objects needed for tasks and might perform inefficiently. However, as it gathers more experience, it should learn the layout of its environment and remember where objects are, allowing it to complete new tasks more efficiently. To enable such rapid adaptation to new tasks, we present ReLIC, a new approach for in-context reinforcement learning (RL) for embodied agents. With ReLIC, agents are capable of adapting to new environments using 64,000 steps of in-context experience with full attention while being trained through self-generated experience via RL. We achieve this by proposing a novel policy update scheme for on-policy RL called "partial updates'' as well as a Sink-KV mechanism that enables effective utilization of a long observation history for embodied agents. Our method outperforms a variety of meta-RL baselines in adapting to unseen houses in an embodied multi-object navigation task. In addition, we find that ReLIC is capable of few-shot imitation learning despite never being trained with expert demonstrations. We also provide a comprehensive analysis of ReLIC, highlighting that the combination of large-scale RL training, the proposed partial updates scheme, and the Sink-KV are essential for effective in-context learning. The code for ReLIC and all our experiments is at https://github.com/aielawady/relic
Published: 2024

7. Towards Open-World Mobile Manipulation in Homes: Lessons from the Neurips 2023 HomeRobot Open Vocabulary Mobile Manipulation Challenge

Author: Yenamandra, Sriram, Ramachandran, Arun, Khanna, Mukul, Yadav, Karmesh, Vakil, Jay, Melnik, Andrew, Büttner, Michael, Harz, Leon, Brown, Lyon, Nandi, Gora Chand, PS, Arjun, Yadav, Gaurav Kumar, Kala, Rahul, Haschke, Robert, Luo, Yang, Zhu, Jinxin, Han, Yansen, Lu, Bingyi, Gu, Xuan, Liu, Qinyuan, Zhao, Yaping, Ye, Qiting, Dou, Chenxiao, Chua, Yansong, Kuzma, Volodymyr, Humennyy, Vladyslav, Partsey, Ruslan, Francis, Jonathan, Chaplot, Devendra Singh, Chhablani, Gunjan, Clegg, Alexander, Gervet, Theophile, Jain, Vidhi, Ramrakhya, Ram, Szot, Andrew, Wang, Austin, Yang, Tsung-Yen, Edsinger, Aaron, Kemp, Charlie, Shah, Binit, Kira, Zsolt, Batra, Dhruv, Mottaghi, Roozbeh, Bisk, Yonatan, and Paxton, Chris
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: In order to develop robots that can effectively serve as versatile and capable home assistants, it is crucial for them to reliably perceive and interact with a wide variety of objects across diverse environments. To this end, we proposed Open Vocabulary Mobile Manipulation as a key benchmark task for robotics: finding any object in a novel environment and placing it on any receptacle surface within that environment. We organized a NeurIPS 2023 competition featuring both simulation and real-world components to evaluate solutions to this task. Our baselines on the most challenging version of this task, using real perception in simulation, achieved only an 0.8% success rate; by the end of the competition, the best participants achieved an 10.8\% success rate, a 13x improvement. We observed that the most successful teams employed a variety of methods, yet two common threads emerged among the best solutions: enhancing error detection and recovery, and improving the integration of perception with decision-making processes. In this paper, we detail the results and methodologies used, both in simulation and real-world settings. We discuss the lessons learned and their implications for future research. Additionally, we compare performance in real and simulated environments, emphasizing the necessity for robust generalization to novel settings.
Published: 2024

8. Reinforcement Learning via Auxiliary Task Distillation

Author: Harish, Abhinav Narayan, Heck, Larry, Hanna, Josiah P., Kira, Zsolt, and Szot, Andrew
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves $2.3 \times$ higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.
Published: 2024

9. ICE-G: Image Conditional Editing of 3D Gaussian Splats

Author: Jaganathan, Vishnu, Huang, Hannah Hanyun, Irshad, Muhammad Zubair, Jampani, Varun, Raj, Amit, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recently many techniques have emerged to create high quality 3D assets and scenes. When it comes to editing of these objects, however, existing approaches are either slow, compromise on quality, or do not provide enough customization. We introduce a novel approach to quickly edit a 3D model from a single reference view. Our technique first segments the edit image, and then matches semantically corresponding regions across chosen segmented dataset views using DINO features. A color or texture change from a particular region of the edit image can then be applied to other views automatically in a semantically sensible manner. These edited views act as an updated dataset to further train and re-style the 3D scene. The end-result is therefore an edited 3D model. Our framework enables a wide variety of editing tasks such as manual local edits, correspondence based style transfer from any example image, and a combination of different styles from multiple example images. We use Gaussian Splats as our primary 3D representation due to their speed and ease of local editing, but our technique works for other methods such as NeRFs as well. We show through multiple examples that our method produces higher quality results while offering fine-grained control of editing. Project page: ice-gaussian.github.io, Comment: Accepted to CVPR AI4CC Workshop 2024. Project page: https://ice-gaussian.github.io
Published: 2024

10. Grounding Multimodal Large Language Models in Actions

Author: Szot, Andrew, Mazoure, Bogdan, Agrawal, Harsh, Hjelm, Devon, Kira, Zsolt, and Toshev, Alexander
Subjects: Computer Science - Machine Learning
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
Published: 2024

11. Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Author: Gupta, Gunshi, Yadav, Karmesh, Gal, Yarin, Batra, Dhruv, Kira, Zsolt, Lu, Cong, and Rudner, Tim G. J.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Robotics, Statistics - Machine Learning
Abstract: Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs. Such capabilities are difficult to learn solely from task-specific data. This has led to the emergence of pre-trained vision-language models as a tool for transferring representations learned from internet-scale data to downstream tasks and new domains. However, commonly used contrastively trained representations such as in CLIP have been shown to fail at enabling embodied agents to gain a sufficiently fine-grained scene understanding -- a capability vital for control. To address this shortcoming, we consider representations from pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts and as such, contain text-conditioned representations that reflect highly fine-grained visuo-spatial information. Using pre-trained text-to-image diffusion models, we construct Stable Control Representations which allow learning downstream control policies that generalize to complex, open-ended environments. We show that policies learned using Stable Control Representations are competitive with state-of-the-art representation learning approaches across a broad range of simulated control settings, encompassing challenging manipulation and navigation tasks. Most notably, we show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
Published: 2024

12. Adaptive Memory Replay for Continual Learning

Author: Smith, James Seale, Valkov, Lazar, Halbe, Shaunak, Gutta, Vyshnavi, Feris, Rogerio, Kira, Zsolt, and Karlinsky, Leonid
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Foundation Models (FMs) have become the hallmark of modern AI, however, these models are trained on massive data, leading to financially expensive training. Updating FMs as new data becomes available is important, however, can lead to `catastrophic forgetting', where models underperform on tasks related to data sub-populations observed too long ago. This continual learning (CL) phenomenon has been extensively studied, but primarily in a setting where only a small amount of past data can be stored. We advocate for the paradigm where memory is abundant, allowing us to keep all previous data, but computational resources are limited. In this setting, traditional replay-based CL approaches are outperformed by a simple baseline which replays past data selected uniformly at random, indicating that this setting necessitates a new approach. We address this by introducing a framework of adaptive memory replay for continual learning, where sampling of past data is phrased as a multi-armed bandit problem. We utilize Bolzmann sampling to derive a method which dynamically selects past data for training conditioned on the current task, assuming full data access and emphasizing training efficiency. Through extensive evaluations on both vision and language pre-training tasks, we demonstrate the effectiveness of our approach, which maintains high performance while reducing forgetting by up to 10% at no training efficiency cost., Comment: CVPR-W 2024 (Spotlight)
Published: 2024

13. GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation

Author: Khanna, Mukul, Ramrakhya, Ram, Chhablani, Gunjan, Yenamandra, Sriram, Gervet, Theophile, Chang, Matthew, Kira, Zsolt, Chaplot, Devendra Singh, Batra, Dhruv, and Mottaghi, Roozbeh
Subjects: Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: The Embodied AI community has made significant strides in visual navigation tasks, exploring targets from 3D coordinates, objects, language descriptions, and images. However, these navigation models often handle only a single input modality as the target. With the progress achieved so far, it is time to move towards universal navigation models capable of handling various goal types, enabling more effective user interaction with robots. To facilitate this goal, we propose GOAT-Bench, a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to a sequence of targets specified by the category name, language description, or image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task, analyzing their performance across modalities, the role of explicit and implicit scene memories, their robustness to noise in goal specifications, and the impact of memory in lifelong scenarios.
Published: 2024

14. NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

Author: Irshad, Muhammad Zubair, Zakharov, Sergey, Guizilini, Vitor, Gaidon, Adrien, Kira, Zsolt, and Ambrus, Rares
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection., Comment: Accepted to ECCV 2024. Project Page: https://nerf-mae.github.io/
Published: 2024

15. N-QR: Natural Quick Response Codes for Multi-Robot Instance Correspondence

Author: Glaser, Nathaniel Moore, Ravi, Rajashree, and Kira, Zsolt
Subjects: Computer Science - Robotics
Abstract: Image correspondence serves as the backbone for many tasks in robotics, such as visual fusion, localization, and mapping. However, existing correspondence methods do not scale to large multi-robot systems, and they struggle when image features are weak, ambiguous, or evolving. In response, we propose Natural Quick Response codes, or N-QR, which enables rapid and reliable correspondence between large-scale teams of heterogeneous robots. Our method works like a QR code, using keypoint-based alignment, rapid encoding, and error correction via ensembles of image patches of natural patterns. We deploy our algorithm in a production-scale robotic farm, where groups of growing plants must be matched across many robots. We demonstrate superior performance compared to several baselines, obtaining a retrieval accuracy of 88.2%. Our method generalizes to a farm with 100 robots, achieving a 12.5x reduction in bandwidth and a 20.5x speedup. We leverage our method to correspond 700k plants and confirm a link between a robotic seeding policy and germination., Comment: IEEE International Conference on Robotics and Automation (ICRA), 2024
Published: 2024

16. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis

Author: Hu, Yafei, Xie, Quanting, Jain, Vidhi, Francis, Jonathan, Patrikar, Jay, Keetha, Nikhil, Kim, Seungchan, Xie, Yaqi, Zhang, Tianyi, Fang, Hao-Shu, Zhao, Shibo, Omidshafiei, Shayegan, Kim, Dong-Ki, Agha-mohammadi, Ali-akbar, Sycara, Katia, Johnson-Roberson, Matthew, Batra, Dhruv, Wang, Xiaolong, Scherer, Sebastian, Wang, Chen, Kira, Zsolt, Xia, Fei, and Bisk, Yonatan
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Building general-purpose robots that operate seamlessly in any environment, with any object, and utilizing various skills to complete diverse tasks has been a long-standing goal in Artificial Intelligence. However, as a community, we have been constraining most robotic systems by designing them for specific tasks, training them on specific datasets, and deploying them within specific environments. These systems require extensively-labeled data and task-specific models. When deployed in real-world scenarios, such systems face several generalization issues and struggle to remain robust to distribution shifts. Motivated by the impressive open-set performance and content generation capabilities of web-scale, large-capacity pre-trained models (i.e., foundation models) in research fields such as Natural Language Processing (NLP) and Computer Vision (CV), we devote this survey to exploring (i) how these existing foundation models from NLP and CV can be applied to the field of general-purpose robotics, and also exploring (ii) what a robotics-specific foundation model would look like. We begin by providing a generalized formulation of how foundation models are used in robotics, and the fundamental barriers to making generalist robots universally applicable. Next, we establish a taxonomy to discuss current work exploring ways to leverage existing foundation models for robotics and develop ones catered to robotics. Finally, we discuss key challenges and promising future directions in using foundation models for enabling general-purpose robotic systems. We encourage readers to view our living GitHub repository 2 of resources, including papers reviewed in this survey, as well as related projects and repositories for developing foundation models for robotics.
Published: 2023

17. Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters

Author: Smith, James Seale, Hsu, Yen-Chang, Kira, Zsolt, Shen, Yilin, and Jin, Hongxia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent work has demonstrated a remarkable ability to customize text-to-image diffusion models to multiple, fine-grained concepts in a sequential (i.e., continual) manner while only providing a few example images for each concept. This setting is known as continual diffusion. Here, we ask the question: Can we scale these methods to longer concept sequences without forgetting? Although prior work mitigates the forgetting of previously learned concepts, we show that its capacity to learn new tasks reaches saturation over longer sequences. We address this challenge by introducing a novel method, STack-And-Mask INcremental Adapters (STAMINA), which is composed of low-ranked attention-masked adapters and customized MLP tokens. STAMINA is designed to enhance the robust fine-tuning properties of LoRA for sequential concept learning via learnable hard-attention masks parameterized with low rank MLPs, enabling precise, scalable learning via sparse adaptation. Notably, all introduced trainable parameters can be folded back into the model after training, inducing no additional inference parameter costs. We show that STAMINA outperforms the prior SOTA for the setting of text-to-image continual customization on a 50-concept benchmark composed of landmarks and human faces, with no stored replay data. Additionally, we extended our method to the setting of continual learning for image classification, demonstrating that our gains also translate to state-of-the-art performance in this standard benchmark., Comment: CVPR-W 2024
Published: 2023

18. DAMEX: Dataset-aware Mixture-of-Experts for visual understanding of mixture-of-datasets

Author: Jain, Yash, Behl, Harkirat, Kira, Zsolt, and Vineet, Vibhav
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Construction of a universal detector poses a crucial question: How can we most effectively train a model on a large mixture of datasets? The answer lies in learning dataset-specific features and ensembling their knowledge but do all this in a single model. Previous methods achieve this by having separate detection heads on a common backbone but that results in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose Dataset-Aware Mixture-of-Experts, DAMEX where we train the experts to become an `expert' of a dataset by learning to route each dataset tokens to its mapped expert. Experiments on Universal Object-Detection Benchmark show that we outperform the existing state-of-the-art by average +10.2 AP score and improve over our non-MoE baseline by average +2.0 AP score. We also observe consistent gains while mixing datasets with (1) limited availability, (2) disparate domains and (3) divergent label sets. Further, we qualitatively show that DAMEX is robust against expert representation collapse., Comment: https://github.com/jinga-lala/DAMEX
Published: 2023

19. Fast Trainable Projection for Robust Fine-Tuning

Author: Tian, Junjiao, Liu, Yen-Cheng, Smith, James Seale, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Robust fine-tuning aims to achieve competitive in-distribution (ID) performance while maintaining the out-of-distribution (OOD) robustness of a pre-trained model when transferring it to a downstream task. Recently, projected gradient descent has been successfully used in robust fine-tuning by constraining the deviation from the initialization of the fine-tuned model explicitly through projection. However, algorithmically, two limitations prevent this method from being adopted more widely, scalability and efficiency. In this paper, we propose a new projection-based fine-tuning algorithm, Fast Trainable Projection (FTP) for computationally efficient learning of per-layer projection constraints, resulting in an average $35\%$ speedup on our benchmarks compared to prior works. FTP can be combined with existing optimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we show that FTP is a special instance of hyper-optimizers that tune the hyper-parameters of optimizers in a learnable manner through nested differentiation. Empirically, we show superior robustness on OOD datasets, including domain shifts and natural corruptions, across four different vision tasks with five different pre-trained models. Additionally, we demonstrate that FTP is broadly applicable and beneficial to other learning scenarios such as low-label and continual learning settings thanks to its easy adaptability. The code will be available at https://github.com/GT-RIPL/FTP.git., Comment: Accepted to NeurIPS 2023
Published: 2023

20. Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots

Author: Puig, Xavier, Undersander, Eric, Szot, Andrew, Cote, Mikael Dallaire, Yang, Tsung-Yen, Partsey, Ruslan, Desai, Ruta, Clegg, Alexander William, Hlavac, Michal, Min, So Yeon, Vondruš, Vladimír, Gervet, Theophile, Berges, Vincent-Pierre, Turner, John M., Maksymets, Oleksandr, Kira, Zsolt, Kalakrishnan, Mrinal, Malik, Jitendra, Chaplot, Devendra Singh, Jain, Unnat, Batra, Dhruv, Rai, Akshara, and Mottaghi, Roozbeh
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Multiagent Systems, Computer Science - Robotics
Abstract: We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities., Comment: Project page: http://aihabitat.org/habitat3
Published: 2023

21. FSD: Fast Self-Supervised Single RGB-D to Categorical 3D Objects

Author: Lunayach, Mayank, Zakharov, Sergey, Chen, Dian, Ambrus, Rares, Kira, Zsolt, and Irshad, Muhammad Zubair
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: In this work, we address the challenging task of 3D object recognition without the reliance on real-world 3D labeled data. Our goal is to predict the 3D shape, size, and 6D pose of objects within a single RGB-D image, operating at the category level and eliminating the need for CAD models during inference. While existing self-supervised methods have made strides in this field, they often suffer from inefficiencies arising from non-end-to-end processing, reliance on separate models for different object categories, and slow surface extraction during the training of implicit reconstruction models; thus hindering both the speed and real-world applicability of the 3D recognition process. Our proposed method leverages a multi-stage training pipeline, designed to efficiently transfer synthetic performance to the real-world domain. This approach is achieved through a combination of 2D and 3D supervised losses during the synthetic domain training, followed by the incorporation of 2D supervised and 3D self-supervised losses on real-world data in two additional learning stages. By adopting this comprehensive strategy, our method successfully overcomes the aforementioned limitations and outperforms existing self-supervised 6D pose and size estimation baselines on the NOCS test-set with a 16.4% absolute improvement in mAP for 6D pose estimation while running in near real-time at 5 Hz., Comment: Project page: https://fsd6d.github.io
Published: 2023

22. Memory in Plain Sight: Surveying the Uncanny Resemblances of Associative Memories and Diffusion Models

Author: Hoover, Benjamin, Strobelt, Hendrik, Krotov, Dmitry, Hoffman, Judy, Kira, Zsolt, and Chau, Duen Horng
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Dynamical Systems
Abstract: The generative process of Diffusion Models (DMs) has recently set state-of-the-art on many AI generation benchmarks. Though the generative process is traditionally understood as an "iterative denoiser", there is no universally accepted language to describe it. We introduce a novel perspective to describe DMs using the mathematical language of memory retrieval from the field of energy-based Associative Memories (AMs), making efforts to keep our presentation approachable to newcomers to both of these fields. Unifying these two fields provides insight that DMs can be seen as a particular kind of AM where Lyapunov stability guarantees are bypassed by intelligently engineering the dynamics (i.e., the noise and step size schedules) of the denoising process. Finally, we present a growing body of evidence that records DMs exhibiting empirical behavior we would expect from AMs, and conclude by discussing research opportunities that are revealed by understanding DMs as a form of energy-based memory., Comment: 15 pages, 4 figures
Published: 2023

23. LatentDR: Improving Model Generalization Through Sample-Aware Latent Degradation and Restoration

Author: Liu, Ran, Khose, Sahil, Xiao, Jingyun, Sathidevi, Lakshmi, Ramnath, Keerthan, Kira, Zsolt, and Dyer, Eva L.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Despite significant advances in deep learning, models often struggle to generalize well to new, unseen domains, especially when training data is limited. To address this challenge, we propose a novel approach for distribution-aware latent augmentation that leverages the relationships across samples to guide the augmentation procedure. Our approach first degrades the samples stochastically in the latent space, mapping them to augmented labels, and then restores the samples from their corrupted versions during training. This process confuses the classifier in the degradation step and restores the overall class distribution of the original samples, promoting diverse intra-class/cross-domain variability. We extensively evaluate our approach on a diverse set of datasets and tasks, including domain generalization benchmarks and medical imaging datasets with strong domain shift, where we show our approach achieves significant improvements over existing methods for latent space augmentation. We further show that our method can be flexibly adapted to long-tail recognition tasks, demonstrating its versatility in building more generalizable models. Code is available at https://github.com/nerdslab/LatentDR.
Published: 2023

24. NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes

Author: Irshad, Muhammad Zubair, Zakharov, Sergey, Liu, Katherine, Guizilini, Vitor, Kollar, Thomas, Gaidon, Adrien, Kira, Zsolt, and Ambrus, Rares
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed from very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields for sparse view synthesis of outdoor scenes. NeO 360 is a generalizable method that reconstructs 360{\deg} scenes from a single or a few posed RGB images. The essence of our approach is in capturing the distribution of complex real-world outdoor 3D scenes and using a hybrid image-conditional triplanar representation that can be queried from any world point. Our representation combines the best of both voxel-based and bird's-eye-view (BEV) representations and is more effective and expressive than each. NeO 360's representation allows us to learn from a large collection of unbounded 3D scenes while offering generalizability to new views and novel scenes from as few as a single image during inference. We demonstrate our approach on the proposed challenging 360{\deg} unbounded dataset, called NeRDS 360, and show that NeO 360 outperforms state-of-the-art generalizable methods for novel view synthesis while also offering editing and composition capabilities. Project page: https://zubair-irshad.github.io/projects/neo360.html, Comment: Accepted to International Conference on Computer Vision (ICCV), 2023. Project page: https://zubair-irshad.github.io/projects/neo360.html
Published: 2023

25. Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Author: Tian, Junjiao, Aggarwal, Lavisha, Colaco, Andrea, Kira, Zsolt, and Gonzalez-Franco, Mar
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. The project page is at \url{https://sites.google.com/view/diffseg/home}., Comment: Accepted to CVPR2024
Published: 2023

26. HomeRobot: Open-Vocabulary Mobile Manipulation

Author: Yenamandra, Sriram, Ramachandran, Arun, Yadav, Karmesh, Wang, Austin, Khanna, Mukul, Gervet, Theophile, Yang, Tsung-Yen, Jain, Vidhi, Clegg, Alexander William, Turner, John, Kira, Zsolt, Savva, Manolis, Chang, Angel, Chaplot, Devendra Singh, Batra, Dhruv, Mottaghi, Roozbeh, Bisk, Yonatan, and Paxton, Chris
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/., Comment: 37 pages, 22 figures, 8 tables
Published: 2023

27. Continual Adaptation of Vision Transformers for Federated Learning

Author: Halbe, Shaunak, Smith, James Seale, Tian, Junjiao, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. In this paper, we attempt to tackle forgetting and heterogeneity while minimizing overhead costs and without requiring access to any stored data. We study this problem in the context of Vision Transformers and explore parameter-efficient approaches to adapt to dynamic distributions while minimizing forgetting. We achieve this by leveraging a prompting based approach (such that only prompts and classifier heads have to be communicated) and proposing a novel and lightweight generation and distillation scheme to consolidate client models at the server. We formulate this problem for image classification and establish strong baselines for comparison, conduct experiments on CIFAR-100 as well as challenging, large-scale datasets like ImageNet-R and DomainNet. Our approach outperforms both existing methods and our own baselines by as much as 7% while significantly reducing communication and client-level computation costs. Code available at https://github.com/shaunak27/hepco-fed., Comment: Transactions on Machine Learning Research (TMLR) 2024
Published: 2023

28. Adaptive Coordination in Social Embodied Rearrangement

Author: Szot, Andrew, Jain, Unnat, Batra, Dhruv, Kira, Zsolt, Desai, Ruta, and Rai, Akshara
Subjects: Computer Science - Machine Learning, Computer Science - Multiagent Systems, Computer Science - Robotics
Abstract: We present the task of "Social Rearrangement", consisting of cooperative everyday tasks like setting up the dinner table, tidying a house or unpacking groceries in a simulated multi-agent environment. In Social Rearrangement, two robots coordinate to complete a long-horizon task, using onboard sensing and egocentric observations, and no privileged information about the environment. We study zero-shot coordination (ZSC) in this task, where an agent collaborates with a new partner, emulating a scenario where a robot collaborates with a new human partner. Prior ZSC approaches struggle to generalize in our complex and visually rich setting, and on further analysis, we find that they fail to generate diverse coordination behaviors at training time. To counter this, we propose Behavior Diversity Play (BDP), a novel ZSC approach that encourages diversity through a discriminability objective. Our results demonstrate that BDP learns adaptive agents that can tackle visual coordination, and zero-shot generalize to new partners in unseen environments, achieving 35% higher success and 32% higher efficiency compared to baselines.
Published: 2023

29. HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

Author: Kuo, Chia-Wen and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design., Comment: Paper accepted in CVPR-23; Project page and code available here: https://sites.google.com/view/chiawen-kuo/home/haav
Published: 2023

30. Training Energy-Based Normalizing Flow with Score-Matching Objectives

Author: Chao, Chen-Hao, Sun, Wei-Fang, Hsu, Yen-Chang, Kira, Zsolt, and Lee, Chun-Yi
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In this paper, we establish a connection between the parameterization of flow-based and energy-based generative models, and present a new flow-based modeling approach called energy-based normalizing flow (EBFlow). We demonstrate that by optimizing EBFlow with score-matching objectives, the computation of Jacobian determinants for linear transformations can be entirely bypassed. This feature enables the use of arbitrary linear layers in the construction of flow-based models without increasing the computational time complexity of each training iteration from $O(D^2L)$ to $O(D^3L)$ for an $L$-layered model that accepts $D$-dimensional inputs. This makes the training of EBFlow more efficient than the commonly-adopted maximum likelihood training method. In addition to the reduction in runtime, we enhance the training stability and empirical performance of EBFlow through a number of techniques developed based on our analysis of the score-matching methods. The experimental results demonstrate that our approach achieves a significant speedup compared to maximum likelihood estimation while outperforming prior methods with a noticeable margin in terms of negative log-likelihood (NLL)., Comment: Published at NeurIPS 2023. Code: https://github.com/chen-hao-chao/ebflow
Published: 2023

31. CLIP-GCD: Simple Language Guided Generalized Category Discovery

Author: Ouldnoughi, Rabah, Kuo, Chia-Wen, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data. Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods. In this paper, we posit that such methods are still prone to poor performance on out-of-distribution categories, and do not leverage a key ingredient: Semantic relationships between object categories. We therefore propose to leverage multi-modal (vision and language) models, in two complementary ways. First, we establish a strong baseline by replacing uni-modal features with CLIP, inspired by its zero-shot performance. Second, we propose a novel retrieval-based mechanism that leverages CLIP's aligned vision-language representations by mining text descriptions from a text corpus for the labeled and unlabeled set. We specifically use the alignment between CLIP's visual encoding of the image and textual encoding of the corpus to retrieve top-k relevant pieces of text and incorporate their embeddings to perform joint image+text semi-supervised clustering. We perform rigorous experimentation and ablations (including on where to retrieve from, how much to retrieve, and how to combine information), and validate our results on several datasets including out-of-distribution domains, demonstrating state-of-art results.
Published: 2023

32. We Need to Talk: Identifying and Overcoming Communication-Critical Scenarios for Self-Driving

Author: Glaser, Nathaniel Moore and Kira, Zsolt
Subjects: Computer Science - Robotics, Computer Science - Multiagent Systems
Abstract: In this work, we consider the task of collision-free trajectory planning for connected self-driving vehicles. We specifically consider communication-critical situations--situations where single-agent systems have blindspots that require multi-agent collaboration. To identify such situations, we propose a method which (1) simulates multi-agent perspectives from real self-driving datasets, (2) finds scenarios that are challenging for isolated agents, and (3) augments scenarios with adversarial obstructions. To overcome these challenges, we propose to extend costmap-based trajectory evaluation to a distributed multi-agent setting. We demonstrate that our bandwidth-efficient, uncertainty-aware method reduces collision rates by up to 62.5% compared to single agent baselines., Comment: Submitted to ICRA 2023 Workshop on Collaborative Perception
Published: 2023

33. Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation

Author: Maheshwari, Harsh, Liu, Yen-Cheng, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Using multiple spatial modalities has been proven helpful in improving semantic segmentation performance. However, there are several real-world challenges that have yet to be addressed: (a) improving label efficiency and (b) enhancing robustness in realistic scenarios where modalities are missing at the test time. To address these challenges, we first propose a simple yet efficient multi-modal fusion mechanism Linear Fusion, that performs better than the state-of-the-art multi-modal models even with limited supervision. Second, we propose M3L: Multi-modal Teacher for Masked Modality Learning, a semi-supervised framework that not only improves the multi-modal performance but also makes the model robust to the realistic missing modality scenario using unlabeled data. We create the first benchmark for semi-supervised multi-modal semantic segmentation and also report the robustness to missing modalities. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines. Our code is available at https://github.com/harshm121/M3L
Published: 2023

34. Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA

Author: Smith, James Seale, Hsu, Yen-Chang, Zhang, Lingyu, Hua, Ting, Kira, Zsolt, Shen, Yilin, and Jin, Hongxia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent works demonstrate a remarkable ability to customize text-to-image diffusion models while only providing a few example images. What happens if you try to customize such models using multiple, fine-grained concepts in a sequential (i.e., continual) manner? In our work, we show that recent state-of-the-art customization of text-to-image models suffer from catastrophic forgetting when new concepts arrive sequentially. Specifically, when adding a new concept, the ability to generate high quality images of past, similar concepts degrade. To circumvent this forgetting, we propose a new method, C-LoRA, composed of a continually self-regularized low-rank adaptation in cross attention layers of the popular Stable Diffusion model. Furthermore, we use customization prompts which do not include the word of the customized object (i.e., "person" for a human face dataset) and are initialized as completely random embeddings. Importantly, our method induces only marginal additional parameter costs and requires no storage of user data for replay. We show that C-LoRA not only outperforms several baselines for our proposed setting of text-to-image continual customization, which we refer to as Continual Diffusion, but that we achieve a new state-of-the-art in the well-established rehearsal-free continual learning setting for image classification. The high achieving performance of C-LoRA in two separate domains positions it as a compelling solution for a wide range of applications, and we believe it has significant potential for practical impact. Project page: https://jamessealesmith.github.io/continual-diffusion/, Comment: Transactions on Machine Learning Research (TMLR) 2024
Published: 2023

35. BC-IRL: Learning Generalizable Reward Functions from Demonstrations

Author: Szot, Andrew, Zhang, Amy, Batra, Dhruv, Kira, Zsolt, and Meier, Franziska
Subjects: Computer Science - Machine Learning
Abstract: How well do reward functions learned with inverse reinforcement learning (IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.
Published: 2023

36. Trainable Projected Gradient Method for Robust Fine-tuning

Author: Tian, Junjiao, Dai, Xiaoliang, Ma, Chih-Yao, He, Zecheng, Liu, Yen-Cheng, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent studies on transfer learning have shown that selectively fine-tuning a subset of layers or customizing different learning rates for each layer can greatly improve robustness to out-of-distribution (OOD) data and retain generalization capability in the pre-trained models. However, most of these methods employ manually crafted heuristics or expensive hyper-parameter searches, which prevent them from scaling up to large datasets and neural networks. To solve this problem, we propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization. This is motivated by formulating fine-tuning as a bi-level constrained optimization problem. Specifically, TPGM maintains a set of projection radii, i.e., distance constraints between the fine-tuned model and the pre-trained model, for each layer, and enforces them through weight projections. To learn the constraints, we propose a bi-level optimization to automatically learn the best set of projection radii in an end-to-end manner. Theoretically, we show that the bi-level optimization formulation could explain the regularization capability of TPGM. Empirically, with little hyper-parameter search cost, TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance. For example, when fine-tuned on DomainNet-Real and ImageNet, compared to vanilla fine-tuning, TPGM shows $22\%$ and $10\%$ relative OOD improvement respectively on their sketch counterparts. Code is available at \url{https://github.com/PotatoTian/TPGM}., Comment: Accepted to CVPR2023
Published: 2023

37. OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

Author: Yadav, Karmesh, Majumdar, Arjun, Ramrakhya, Ram, Yokoyama, Naoki, Baevski, Alexei, Kira, Zsolt, Maksymets, Oleksandr, and Batra, Dhruv
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in ") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, mapping, or planning modules. Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks. Our work builds upon the recent success of self-supervised learning (SSL) for pre-training vision transformers (ViT). However, while the training recipes for convolutional networks are mature and robust, the recipes for ViTs are contingent and brittle, and in the case of ViTs for visual navigation, yet to be fully discovered. Specifically, we find that vanilla ViTs do not outperform ResNets on visual navigation. We propose the use of a compression layer operating over ViT patch representations to preserve spatial information along with policy training improvements. These improvements allow us to demonstrate positive scaling laws for the first time in visual navigation tasks. Consequently, our model advances state-of-the-art performance on ImageNav from 54.2% to 82.0% success and performs competitively against concurrent state-of-art on ObjectNav with success rate of 64.0% vs. 65.0%. Overall, this work does not present a fundamentally new approach, but rather recommendations for training a general-purpose architecture that achieves state-of-art performance today and could serve as a strong baseline for future methods., Comment: 15 pages, 7 figures, 9 tables
Published: 2023

38. Communication-Critical Planning via Multi-Agent Trajectory Exchange

Author: Glaser, Nathaniel Moore and Kira, Zsolt
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper addresses the task of joint multi-agent perception and planning, especially as it relates to the real-world challenge of collision-free navigation for connected self-driving vehicles. For this task, several communication-enabled vehicles must navigate through a busy intersection while avoiding collisions with each other and with obstacles. To this end, this paper proposes a learnable costmap-based planning mechanism, given raw perceptual data, that is (1) distributed, (2) uncertainty-aware, and (3) bandwidth-efficient. Our method produces a costmap and uncertainty-aware entropy map to sort and fuse candidate trajectories as evaluated across multiple-agents. The proposed method demonstrates several favorable performance trends on a suite of open-source overhead datasets as well as within a novel communication-critical simulator. It produces accurate semantic occupancy forecasts as an intermediate perception output, attaining a 72.5% average pixel-wise classification accuracy. By selecting the top trajectory, the multi-agent method scales well with the number of agents, reducing the hard collision rate by up to 57% with eight agents compared to the single-agent version., Comment: Accepted to ICRA 2023
Published: 2023

39. System Design for an Integrated Lifelong Reinforcement Learning Agent for Real-Time Strategy Games

Author: Sur, Indranil, Daniels, Zachary, Rahman, Abrar, Faber, Kamil, Gallardo, Gianmarco J., Hayes, Tyler L., Taylor, Cameron E., Gurbuz, Mustafa Burak, Smith, James, Joshi, Sahana, Japkowicz, Nathalie, Baron, Michael, Kira, Zsolt, Kanan, Christopher, Corizzo, Roberto, Divakaran, Ajay, Piacentino, Michael, Hostetler, Jesse, and Raghavan, Aswin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: As Artificial and Robotic Systems are increasingly deployed and relied upon for real-world applications, it is important that they exhibit the ability to continually learn and adapt in dynamically-changing environments, becoming Lifelong Learning Machines. Continual/lifelong learning (LL) involves minimizing catastrophic forgetting of old tasks while maximizing a model's capability to learn new tasks. This paper addresses the challenging lifelong reinforcement learning (L2RL) setting. Pushing the state-of-the-art forward in L2RL and making L2RL useful for practical applications requires more than developing individual L2RL algorithms; it requires making progress at the systems-level, especially research into the non-trivial problem of how to integrate multiple L2RL algorithms into a common framework. In this paper, we introduce the Lifelong Reinforcement Learning Components Framework (L2RLCF), which standardizes L2RL systems and assimilates different continual learning components (each addressing different aspects of the lifelong learning problem) into a unified system. As an instantiation of L2RLCF, we develop a standard API allowing easy integration of novel lifelong learning components. We describe a case study that demonstrates how multiple independently-developed LL components can be integrated into a single realized system. We also introduce an evaluation environment in order to measure the effect of combining various system components. Our evaluation environment employs different LL scenarios (sequences of tasks) consisting of Starcraft-2 minigames and allows for the fair, comprehensive, and quantitative comparison of different combinations of components within a challenging common evaluation environment., Comment: The Second International Conference on AIML Systems, October 12--15, 2022, Bangalore, India
Published: 2022
Full Text: View/download PDF

40. CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning

Author: Smith, James Seale, Karlinsky, Leonid, Gutta, Vyshnavi, Cascante-Bonilla, Paola, Kim, Donghyun, Arbelle, Assaf, Panda, Rameswar, Feris, Rogerio, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy. We also outperform the state of art by as much as 4.4% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings. Our code is available at https://github.com/GT-RIPL/CODA-Prompt, Comment: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
Published: 2022

41. Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation

Author: Kuo, Chia-Wen, Ma, Chih-Yao, Hoffman, Judy, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In Vision-and-Language Navigation (VLN), researchers typically take an image encoder pre-trained on ImageNet without fine-tuning on the environments that the agent will be trained or tested on. However, the distribution shift between the training images from ImageNet and the views in the navigation environments may render the ImageNet pre-trained image encoder suboptimal. Therefore, in this paper, we design a set of structure-encoding auxiliary tasks (SEA) that leverage the data in the navigation environments to pre-train and improve the image encoder. Specifically, we design and customize (1) 3D jigsaw, (2) traversability prediction, and (3) instance classification to pre-train the image encoder. Through rigorous ablations, our SEA pre-trained features are shown to better encode structural information of the scenes, which ImageNet pre-trained features fail to properly encode but is crucial for the target navigation task. The SEA pre-trained features can be easily plugged into existing VLN agents without any tuning. For example, on Test-Unseen environments, the VLN agents combined with our SEA pre-trained features achieve absolute success rate improvement of 12% for Speaker-Follower, 5% for Env-Dropout, and 4% for AuxRN.
Published: 2022

42. ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

Author: Smith, James Seale, Cascante-Bonilla, Paola, Arbelle, Assaf, Kim, Donghyun, Panda, Rameswar, Cox, David, Yang, Diyi, Kira, Zsolt, Feris, Rogerio, and Karlinsky, Leonid
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL, Comment: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
Published: 2022

43. Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks

Author: Liu, Yen-Cheng, Ma, Chih-Yao, Tian, Junjiao, He, Zijian, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Adapting large-scale pretrained models to various downstream tasks via fine-tuning is a standard method in machine learning. Recently, parameter-efficient fine-tuning methods show promise in adapting a pretrained model to different tasks while training only a few parameters. Despite their success, most existing methods are proposed in Natural Language Processing tasks with language Transformers, and adaptation to Computer Vision tasks with Vision Transformers remains under-explored, especially for dense vision tasks. Further, in multi-task settings, individually fine-tuning and storing separate models for different tasks is inefficient. In this work, we provide an extensive multi-task parameter-efficient benchmark and examine existing parameter-efficient fine-tuning NLP methods for vision tasks. Our results on four different dense vision tasks showed that existing methods cannot be efficiently integrated due to the hierarchical nature of the Hierarchical Vision Transformers. To overcome this issue, we propose Polyhistor and Polyhistor-Lite, consisting of Decomposed HyperNetworks and Layer-wise Scaling Kernels, to share information across different tasks with a few trainable parameters. This leads to favorable performance improvements against existing parameter-efficient methods while using fewer trainable parameters. Specifically, Polyhistor achieves competitive accuracy compared to the state-of-the-art while only using ~10% of their trainable parameters. Furthermore, our methods show larger performance gains when large networks and more pretraining data are used., Comment: Accepted to NeurIPS 2022; Project Page is at https://ycliu93.github.io/projects/polyhistor.html
Published: 2022

44. FedFOR: Stateless Heterogeneous Federated Learning with First-Order Regularization

Author: Tian, Junjiao, Smith, James Seale, and Kira, Zsolt
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Federated Learning (FL) seeks to distribute model training across local clients without collecting data in a centralized data-center, hence removing data-privacy concerns. A major challenge for FL is data heterogeneity (where each client's data distribution can differ) as it can lead to weight divergence among local clients and slow global convergence. The current SOTA FL methods designed for data heterogeneity typically impose regularization to limit the impact of non-IID data and are stateful algorithms, i.e., they maintain local statistics over time. While effective, these approaches can only be used for a special case of FL involving only a small number of reliable clients. For the more typical applications of FL where the number of clients is large (e.g., edge-device and mobile applications), these methods cannot be applied, motivating the need for a stateless approach to heterogeneous FL which can be used for any number of clients. We derive a first-order gradient regularization to penalize inconsistent local updates due to local data heterogeneity. Specifically, to mitigate weight divergence, we introduce a first-order approximation of the global data distribution into local objectives, which intuitively penalizes updates in the opposite direction of the global update. The end result is a stateless FL algorithm that achieves 1) significantly faster convergence (i.e., fewer communication rounds) and 2) higher overall converged performance than SOTA methods under non-IID data distribution. Importantly, our approach does not impose unrealistic limits on the client size, enabling learning from a large number of clients as is typical in most FL applications.
Published: 2022

45. On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition

Author: Rahman, Farrukh, Mubarek, Ömer, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, I.2.10
Abstract: Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future., Comment: Accepted to NeurIPS 2022 Workshop on Vision Transformers: Theory and applications (VTTA)
Published: 2022

46. Open-Set Semi-Supervised Object Detection

Author: Liu, Yen-Cheng, Ma, Chih-Yao, Dai, Xiaoliang, Tian, Junjiao, Vajda, Peter, He, Zijian, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent developments for Semi-Supervised Object Detection (SSOD) have shown the promise of leveraging unlabeled data to improve an object detector. However, thus far these methods have assumed that the unlabeled data does not contain out-of-distribution (OOD) classes, which is unrealistic with larger-scale unlabeled datasets. In this paper, we consider a more practical yet challenging problem, Open-Set Semi-Supervised Object Detection (OSSOD). We first find the existing SSOD method obtains a lower performance gain in open-set conditions, and this is caused by the semantic expansion, where the distracting OOD objects are mispredicted as in-distribution pseudo-labels for the semi-supervised training. To address this problem, we consider online and offline OOD detection modules, which are integrated with SSOD methods. With the extensive studies, we found that leveraging an offline OOD detector based on a self-supervised vision transformer performs favorably against online OOD detectors due to its robustness to the interference of pseudo-labeling. In the experiment, our proposed framework effectively addresses the semantic expansion issue and shows consistent improvements on many OSSOD benchmarks, including large-scale COCO-OpenImages. We also verify the effectiveness of our framework under different OSSOD conditions, including varying numbers of in-distribution classes, different degrees of supervision, and different combinations of unlabeled sets., Comment: Project Page is at https://ycliu93.github.io/projects/ossod.html
Published: 2022

47. ShAPO: Implicit Representations for Multi-Object Shape, Appearance, and Pose Optimization

Author: Irshad, Muhammad Zubair, Zakharov, Sergey, Ambrus, Rares, Kollar, Thomas, Kira, Zsolt, and Gaidon, Adrien
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Our method studies the complex task of object-centric 3D understanding from a single RGB-D observation. As it is an ill-posed problem, existing methods suffer from low performance for both 3D shape and 6D pose and size estimation in complex multi-object scenarios with occlusions. We present ShAPO, a method for joint multi-object detection, 3D textured reconstruction, 6D object pose and size estimation. Key to ShAPO is a single-shot pipeline to regress shape, appearance and pose latent codes along with the masks of each object instance, which is then further refined in a sparse-to-dense fashion. A novel disentangled shape and appearance database of priors is first learned to embed objects in their respective shape and appearance space. We also propose a novel, octree-based differentiable optimization step, allowing us to further improve object shape, pose and appearance simultaneously under the learned latent space, in an analysis-by-synthesis fashion. Our novel joint implicit textured object representation allows us to accurately identify and reconstruct novel unseen objects without having access to their 3D meshes. Through extensive experiments, we show that our method, trained on simulated indoor scenes, accurately regresses the shape, appearance and pose of novel objects in the real-world with minimal fine-tuning. Our method significantly out-performs all baselines on the NOCS dataset with an 8% absolute improvement in mAP for 6D pose estimation. Project page: https://zubair-irshad.github.io/projects/ShAPO.html, Comment: Accepted to European Conference on Computer Vision (ECCV), 2022
Published: 2022

48. Unbiased Teacher v2: Semi-supervised Object Detection for Anchor-free and Anchor-based Detectors

Author: Liu, Yen-Cheng, Ma, Chih-Yao, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: With the recent development of Semi-Supervised Object Detection (SS-OD) techniques, object detectors can be improved by using a limited amount of labeled data and abundant unlabeled data. However, there are still two challenges that are not addressed: (1) there is no prior SS-OD work on anchor-free detectors, and (2) prior works are ineffective when pseudo-labeling bounding box regression. In this paper, we present Unbiased Teacher v2, which shows the generalization of SS-OD method to anchor-free detectors and also introduces Listen2Student mechanism for the unsupervised regression loss. Specifically, we first present a study examining the effectiveness of existing SS-OD methods on anchor-free detectors and find that they achieve much lower performance improvements under the semi-supervised setting. We also observe that box selection with centerness and the localization-based labeling used in anchor-free detectors cannot work well under the semi-supervised setting. On the other hand, our Listen2Student mechanism explicitly prevents misleading pseudo-labels in the training of bounding box regression; we specifically develop a novel pseudo-labeling selection mechanism based on the Teacher and Student's relative uncertainties. This idea contributes to favorable improvement in the regression branch in the semi-supervised setting. Our method, which works for both anchor-free and anchor-based methods, consistently performs favorably against the state-of-the-art methods in VOC, COCO-standard, and COCO-additional., Comment: Project Page is at http://ycliu93.github.io/projects/unbiasedteacher2.html
Published: 2022

49. Lifelong Wandering: A realistic few-shot online continual learning setting

Author: Lunayach, Mayank, Smith, James, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Online few-shot learning describes a setting where models are trained and evaluated on a stream of data while learning emerging classes. While prior work in this setting has achieved very promising performance on instance classification when learning from data-streams composed of a single indoor environment, we propose to extend this setting to consider object classification on a series of several indoor environments, which is likely to occur in applications such as robotics. Importantly, our setting, which we refer to as online few-shot continual learning, injects the well-studied issue of catastrophic forgetting into the few-shot online learning paradigm. In this work, we benchmark several existing methods and adapted baselines within our setting, and show there exists a trade-off between catastrophic forgetting and online performance. Our findings motivate the need for future work in this setting, which can achieve better online performance without catastrophic forgetting., Comment: CVPR 2022 Workshop on Continual Learning
Published: 2022

50. Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Author: Kuo, Chia-Wen and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions. Further, object detector models are frozen and do not have sufficient richness to allow the captioning model to properly ground them. As a result, we propose to condition both the detector and description outputs on the image, and show qualitatively and quantitatively that this can improve grounding. We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art, specifically +7.5% in CIDEr and +1.3% in BLEU-4 metrics., Comment: paper accepted in CVPR 2022
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

320 results on '"Kira, Zsolt"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources