Author: "Lazebnik, Svetlana" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lazebnik, Svetlana"' showing total 672 results

Start Over Author "Lazebnik, Svetlana"

672 results on '"Lazebnik, Svetlana"'

1. ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Author: Shah, Viraj, Ruiz, Nataniel, Cole, Forrester, Lu, Erika, Lazebnik, Svetlana, Li, Yuanzhen, Jampani, Varun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

2. Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now

Author: Sarkar, Ayush, Mai, Hanlin, Mahapatra, Amitabh, Lazebnik, Svetlana, Forsyth, D. A., and Bhattad, Anand
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images, prequalified to fool simple, signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels, and look only at derived geometric features. The first classifier looks at the perspective field of the image, the second looks at lines detected in the image, and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors, for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images., Comment: Project Page: https://projective-geometry.github.io | First three authors contributed equally
Published: 2023

3. Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images

Author: Cui, Aiyu, Mahajan, Jay, Shah, Viraj, Gomathinayagam, Preeti, Liu, Chang, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Most virtual try-on research is motivated to serve the fashion business by generating images to demonstrate garments on studio models at a lower cost. However, virtual try-on should be a broader application that also allows customers to visualize garments on themselves using their own casual photos, known as in-the-wild try-on. Unfortunately, the existing methods, which achieve plausible results for studio try-on settings, perform poorly in the in-the-wild context. This is because these methods often require paired images (garment images paired with images of people wearing the same garment) for training. While such paired data is easy to collect from shopping websites for studio settings, it is difficult to obtain for in-the-wild scenes. In this work, we fill the gap by (1) introducing a StreetTryOn benchmark to support in-the-wild virtual try-on applications and (2) proposing a novel method to learn virtual try-on from a set of in-the-wild person images directly without requiring paired data. We tackle the unique challenges, including warping garments to more diverse human poses and rendering more complex backgrounds faithfully, by a novel DensePose warping correction method combined with diffusion-based conditional inpainting. Our experiments show competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks., Comment: The abstract and intro are updated. Some typos and some pdf rendering errors have been fixed in the version
Published: 2023

4. ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs

Author: Shah, Viraj, Ruiz, Nataniel, Cole, Forrester, Lu, Erika, Lazebnik, Svetlana, Li, Yuanzhen, and Jampani, Varun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Methods for finetuning generative models for concept-driven personalization generally achieve strong results for subject-driven or style-driven generation. Recently, low-rank adaptations (LoRA) have been proposed as a parameter-efficient way of achieving concept-driven personalization. While recent work explores the combination of separate LoRAs to achieve joint generation of learned styles and subjects, existing techniques do not reliably address the problem; they often compromise either subject fidelity or style fidelity. We propose ZipLoRA, a method to cheaply and effectively merge independently trained style and subject LoRAs in order to achieve generation of any user-provided subject in any user-provided style. Experiments on a wide range of subject and style combinations show that ZipLoRA can generate compelling results with meaningful improvements over baselines in subject and style fidelity while preserving the ability to recontextualize. Project page: https://ziplora.github.io, Comment: Project page: https://ziplora.github.io
Published: 2023

5. JoIN: Joint GANs Inversion for Intrinsic Image Decomposition

Author: Shah, Viraj, Lazebnik, Svetlana, and Philip, Julien
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work, we propose to solve ill-posed inverse imaging problems using a bank of Generative Adversarial Networks (GAN) as a prior and apply our method to the case of Intrinsic Image Decomposition for faces and materials. Our method builds on the demonstrated success of GANs to capture complex image distributions. At the core of our approach is the idea that the latent space of a GAN is a well-suited optimization domain to solve inverse problems. Given an input image, we propose to jointly inverse the latent codes of a set of GANs and combine their outputs to reproduce the input. Contrary to most GAN inversion methods which are limited to inverting only a single GAN, we demonstrate that it is possible to maintain distribution priors while inverting several GANs jointly. We show that our approach is modular, allowing various forward imaging models, and that it can successfully decompose both synthetic and real images., Comment: Project webpage is available at https://virajshah.com/join
Published: 2023

6. One-Shot Stylization for Full-Body Human Images

Author: Cui, Aiyu and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: The goal of human stylization is to transfer full-body human photos to a style specified by a single art character reference image. Although previous work has succeeded in example-based stylization of faces and generic scenes, full-body human stylization is a more complex domain. This work addresses several unique challenges of stylizing full-body human images. We propose a method for one-shot fine-tuning of a pose-guided human generator to preserve the "content" (garments, face, hair, pose) of the input photo and the "style" of the artistic reference. Since body shape deformation is an essential component of an art character's style, we incorporate a novel skeleton deformation module to reshape the pose of the input person and modify the DiOr pose-guided person generator to be more robust to the rescaled poses falling outside the distribution of the realistic poses that the generator is originally trained on. Several human studies verify the effectiveness of our approach.
Published: 2023

7. In Memoriam: Xiaoou Tang

Author: Matsushita , Yasuyuki, Lazebnik, Svetlana, and Matas, Jiri
Published: 2024
Full Text: View/download PDF

8. Robust Online Video Instance Segmentation with Track Queries

Author: Zhan, Zitong, McKee, Daniel, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, transformer-based methods have achieved impressive results on Video Instance Segmentation (VIS). However, most of these top-performing methods run in an offline manner by processing the entire video clip at once to predict instance mask volumes. This makes them incapable of handling the long videos that appear in challenging new video instance segmentation datasets like UVO and OVIS. We propose a fully online transformer-based video instance segmentation model that performs comparably to top offline methods on the YouTube-VIS 2019 benchmark and considerably outperforms them on UVO and OVIS. This method, called Robust Online Video Segmentation (ROVIS), augments the Mask2Former image instance segmentation model with track queries, a lightweight mechanism for carrying track information from frame to frame, originally introduced by the TrackFormer method for multi-object tracking. We show that, when combined with a strong enough image segmentation architecture, track queries can exhibit impressive accuracy while not being constrained to short videos.
Published: 2022

9. MultiStyleGAN: Multiple One-shot Image Stylizations using a Single GAN

Author: Shah, Viraj, Sarkar, Ayush, Anitha, Sudharsan Krishnakumar, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image stylization aims at applying a reference style to arbitrary input images. A common scenario is one-shot stylization, where only one example is available for each reference style. Recent approaches for one-shot stylization such as JoJoGAN fine-tune a pre-trained StyleGAN2 generator on a single style reference image. However, such methods cannot generate multiple stylizations without fine-tuning a new model for each style separately. In this work, we present a MultiStyleGAN method that is capable of producing multiple different stylizations at once by fine-tuning a single generator. The key component of our method is a learnable transformation module called Style Transformation Network. It takes latent codes as input, and learns linear mappings to different regions of the latent space to produce distinct codes for each style, resulting in a multistyle space. Our model inherently mitigates overfitting since it is trained on multiple styles, hence improving the quality of stylizations. Our method can learn upwards of $120$ image stylizations at once, bringing $8\times$ to $60\times$ improvement in training time over recent competing methods. We support our results through user studies and quantitative results that indicate meaningful improvements over existing methods., Comment: Project webpage available at https://virajshah.com/multistyle
Published: 2022

10. Transfer of Representations to Video Label Propagation: Implementation Factors Matter

Author: McKee, Daniel, Zhan, Zitong, Shuai, Bing, Modolo, Davide, Tighe, Joseph, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency. In the literature, these methods have been evaluated with an array of inconsistent settings, making it difficult to discern trends or compare performance fairly. Starting with a unified formulation of the label propagation algorithm that encompasses most existing variations, we systematically study the impact of important implementation factors in feature extraction and label propagation. Along the way, we report the accuracies of properly tuned supervised and unsupervised still image baselines, which are higher than those found in previous works. We also demonstrate that augmenting video-based correspondence cues with still-image-based ones can further improve performance. We then attempt a fair comparison of recent video-based methods on the DAVIS benchmark, showing convergence of best methods to performance levels near our strong ImageNet baseline, despite the usage of a variety of specialized video-based losses and training particulars. Additional comparisons on JHMDB and VIP datasets confirm the similar performance of current methods. We hope that this study will help to improve evaluation practices and better inform future research directions in temporal correspondence.
Published: 2022

11. Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents

Author: Patel, Shivansh, Wani, Saim, Jain, Unnat, Schwing, Alexander, Lazebnik, Svetlana, Savva, Manolis, and Chang, Angel X.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: Communication between embodied AI agents has received increasing attention in recent years. Despite its use, it is still unclear whether the learned communication is interpretable and grounded in perception. To study the grounding of emergent forms of communication, we first introduce the collaborative multi-object navigation task CoMON. In this task, an oracle agent has detailed environment information in the form of a map. It communicates with a navigator agent that perceives the environment visually and is tasked to find a sequence of goals. To succeed at the task, effective communication is essential. CoMON hence serves as a basis to study different communication mechanisms between heterogeneous agents, that is, agents with different capabilities and roles. We study two common communication mechanisms and analyze their communication patterns through an egocentric and spatial lens. We show that the emergent communication can be grounded to the agent observations and the spatial structure of the 3D environment. Video summary: https://youtu.be/kLv2rxO9t0g, Comment: Project page: https://shivanshpatel35.github.io/comon/ ; the first three authors contributed equally
Published: 2021

12. Multi-Object Tracking with Hallucinated and Unlabeled Videos

Author: McKee, Daniel, Shuai, Bing, Berneshawi, Andrew, Wang, Manchen, Modolo, Davide, Lazebnik, Svetlana, and Tighe, Joseph
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we explore learning end-to-end deep neural trackers without tracking annotations. This is important as large-scale training data is essential for training deep neural trackers while tracking annotations are expensive to acquire. In place of tracking annotations, we first hallucinate videos from images with bounding box annotations using zoom-in/out motion transformations to obtain free tracking labels. We add video simulation augmentations to create a diverse tracking dataset, albeit with simple motion. Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data. For hard example mining, we propose an optimization-based connecting process to first identify and then rectify hard examples from the pool of unlabeled videos. Finally, we train our tracker jointly on hallucinated data and mined hard video examples. Our weakly supervised tracker achieves state-of-the-art performance on the MOT17 and TAO-person datasets. On MOT17, we further demonstrate that the combination of our self-generated data and the existing manually-annotated data leads to additional improvements.
Published: 2021

13. Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing

Author: Cui, Aiyu, McKee, Daniel, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a flexible person generation framework called Dressing in Order (DiOr), which supports 2D pose transfer, virtual try-on, and several fashion editing tasks. The key to DiOr is a novel recurrent generation pipeline to sequentially put garments on a person, so that trying on the same garments in different orders will result in different looks. Our system can produce dressing effects not achievable by existing work, including different interactions of garments (e.g., wearing a top tucked into the bottom or over it), as well as layering of multiple garments of the same type (e.g., jacket over shirt over t-shirt). DiOr explicitly encodes the shape and texture of each garment, enabling these elements to be edited separately. Joint training on pose transfer and inpainting helps with detail preservation and coherence of generated garments. Extensive evaluations show that DiOr outperforms other recent methods like ADGAN in terms of output quality, and handles a wide range of editing functions for which there is no direct supervision., Comment: ICCV 2021
Published: 2021

14. GridToPix: Training Embodied Agents with Minimal Supervision

Author: Jain, Unnat, Liu, Iou-Jen, Lazebnik, Svetlana, Kembhavi, Aniruddha, Weihs, Luca, and Schwing, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: While deep reinforcement learning (RL) promises freedom from hand-labeled data, great successes, especially for Embodied AI, require significant work to create supervision via carefully shaped rewards. Indeed, without shaped rewards, i.e., with only terminal rewards, present-day Embodied AI results degrade significantly across Embodied AI problems from single-agent Habitat-based PointGoal Navigation (SPL drops from 55 to 0) and two-agent AI2-THOR-based Furniture Moving (success drops from 58% to 1%) to three-agent Google Football-based 3 vs. 1 with Keeper (game score drops from 0.6 to 0.1). As training from shaped rewards doesn't scale to more realistic tasks, the community needs to improve the success of training with terminal rewards. For this we propose GridToPix: 1) train agents with terminal rewards in gridworlds that generically mirror Embodied AI environments, i.e., they are independent of the task; 2) distill the learned policy into agents that reside in complex visual worlds. Despite learning from only terminal rewards with identical models and RL algorithms, GridToPix significantly improves results across tasks: from PointGoal Navigation (SPL improves from 0 to 64) and Furniture Moving (success improves from 1% to 25%) to football gameplay (game score improves from 0.1 to 0.6). GridToPix even helps to improve the results of shaped reward training., Comment: Project page: https://unnat.github.io/gridtopix/ ; last two authors contributed equally
Published: 2021

15. Bridging the Imitation Gap by Adaptive Insubordination

Author: Weihs, Luca, Jain, Unnat, Liu, Iou-Jen, Salvador, Jordi, Lazebnik, Svetlana, Kembhavi, Aniruddha, and Schwing, Alexander
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: In practice, imitation learning is preferred over pure reinforcement learning whenever it is possible to design a teaching agent to provide expert supervision. However, we show that when the teaching agent makes decisions with access to privileged information that is unavailable to the student, this information is marginalized during imitation learning, resulting in an "imitation gap" and, potentially, poor results. Prior work bridges this gap via a progression from imitation learning to reinforcement learning. While often successful, gradual progression fails for tasks that require frequent switches between exploration and memorization. To better address these tasks and alleviate the imitation gap we propose 'Adaptive Insubordination' (ADVISOR). ADVISOR dynamically weights imitation and reward-based reinforcement learning losses during training, enabling on-the-fly switching between imitation and exploration. On a suite of challenging tasks set within gridworlds, multi-agent particle environments, and high-fidelity 3D simulators, we show that on-the-fly switching with ADVISOR outperforms pure imitation, pure reinforcement learning, as well as their sequential and parallel combinations., Comment: NeurIPS'21 version. The first two authors contributed equally. Project page: https://unnat.github.io/advisor/
Published: 2020

16. A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks

Author: Jain, Unnat, Weihs, Luca, Kolve, Eric, Farhadi, Ali, Lazebnik, Svetlana, Kembhavi, Aniruddha, and Schwing, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: Autonomous agents must learn to collaborate. It is not scalable to develop a new centralized agent every time a task's difficulty outpaces a single agent's abilities. While multi-agent collaboration research has flourished in gridworld-like environments, relatively little work has considered visually rich domains. Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies and, in tasks requiring close coordination, the number of failed actions dominates successful actions. To confront these challenges we introduce SYNC-policies (synchronize your actions coherently) and CORDIAL (coordination loss). Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines. Our dataset, code, and pretrained models are available at https://unnat.github.io/cordial-sync ., Comment: Accepted to ECCV 2020 (spotlight); Project page: https://unnat.github.io/cordial-sync
Published: 2020

17. Memory-Efficient Incremental Learning Through Feature Adaptation

Author: Iscen, Ahmet, Zhang, Jeffrey, Lazebnik, Svetlana, and Schmid, Cordelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce an approach for incremental learning that preserves feature descriptors of training images from previously learned classes, instead of the images themselves, unlike most existing work. Keeping the much lower-dimensional feature embeddings of images reduces the memory footprint significantly. We assume that the model is updated incrementally for new classes as new data becomes available sequentially.This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images. Feature adaptation is learned with a multi-layer perceptron, which is trained on feature pairs corresponding to the outputs of the original and updated network on a training image. We validate experimentally that such a transformation generalizes well to the features of the previous set of classes, and maps features to a discriminative subspace in the feature space. As a result, the classifier is optimized jointly over new and old classes without requiring old class images. Experimental results show that our method achieves state-of-the-art classification accuracy in incremental learning benchmarks, while having at least an order of magnitude lower memory footprint compared to image-preserving strategies.
Published: 2020

18. Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation

Author: Hung, Zih-Siou, Mallya, Arun, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Relations amongst entities play a central role in image understanding. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE, we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object. Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate $\approx$ union (subject, object) $-$ subject $-$ object. In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation.
Published: 2019
Full Text: View/download PDF

19. Two Body Problem: Collaborative Visual Task Completion

Author: Jain, Unnat, Weihs, Luca, Kolve, Eric, Rastegari, Mohammad, Lazebnik, Svetlana, Farhadi, Ali, Schwing, Alexander, and Kembhavi, Aniruddha
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems
Abstract: Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities. Addressed extensively in both conventional and modern AI, multi-agent collaboration has often been studied in the context of simple grid worlds. We argue that there are inherently visual aspects to collaboration which should be studied in visually rich environments. A key element in collaboration is communication that can be either explicit, through messages, or implicit, through perception of the other agents and the visual world. Learning to collaborate in a visual environment entails learning (1) to perform the task, (2) when and what to communicate, and (3) how to act based on these communications and the perception of the visual world. In this paper we study the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrate the benefits of explicit and implicit modes of communication to perform visual tasks. Refer to our project page for more details: https://prior.allenai.org/projects/two-body-problem, Comment: Accepted to CVPR 2019
Published: 2019

20. Revisiting Image-Language Networks for Open-ended Phrase Detection

Author: Plummer, Bryan A., Shih, Kevin J., Li, Yichen, Xu, Ke, Lazebnik, Svetlana, Sclaroff, Stan, and Saenko, Kate
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to an open-ended vocabulary, introducing elements of few- and zero-shot detection. We propose an approach for this task that extends Faster R-CNN to relate image regions and phrases. By carefully initializing the classification layers of our network using canonical correlation analysis (CCA), we encourage a solution that is more discerning when reasoning between similar phrases, resulting in over double the performance compared to a naive adaptation on three popular phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, with test-time phrase vocabulary sizes of 5K, 32K, and 159K, respectively., Comment: Accepted to TPAMI
Published: 2018
Full Text: View/download PDF

21. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering

Author: Narasimhan, Medhini, Lazebnik, Svetlana, and Schwing, Alexander G.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction a novel `fact-based' visual question answering (FVQA) task has been introduced recently along with a large set of curated facts which link two entities, i.e., two possible answers, via a relation. Given a question-image pair, deep network techniques have been employed to successively reduce the large set of facts until one of the two entities of the final remaining fact is predicted as the answer. We observe that a successive process which considers one fact at a time to form a local decision is sub-optimal. Instead, we develop an entity graph and use a graph convolutional network to `reason' about the correct answer by jointly considering all entities. We show on the challenging FVQA dataset that this leads to an improvement in accuracy of around 7% compared to the state of the art., Comment: Accepted to NIPS 2018
Published: 2018

22. Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering

Author: Jain, Unnat, Lazebnik, Svetlana, and Schwing, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Human conversation is a complex mechanism with subtle nuances. It is hence an ambitious goal to develop artificial intelligence agents that can participate fluently in a conversation. While we are still far from achieving this goal, recent progress in visual question answering, image captioning, and visual question generation shows that dialog systems may be realizable in the not too distant future. To this end, a novel dataset was introduced recently and encouraging results were demonstrated, particularly for question answering. In this paper, we demonstrate a simple symmetric discriminative baseline, that can be applied to both predicting an answer as well as predicting a question. We show that this method performs on par with the state of the art, even memory net based methods. In addition, for the first time on the visual dialog dataset, we assess the performance of a system asking questions, and demonstrate how visual dialog can be generated from discriminative question generation and question answering., Comment: Accepted to CVPR 2018
Published: 2018

23. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights

Author: Mallya, Arun, Davis, Dillon, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents a method for adapting a single, fixed deep neural network to multiple tasks without affecting performance on already learned tasks. By building upon ideas from network quantization and pruning, we learn binary masks that piggyback on an existing network, or are applied to unmodified weights of that network to provide good performance on a new task. These masks are learned in an end-to-end differentiable fashion, and incur a low overhead of 1 bit per network parameter, per task. Even though the underlying network is fixed, the ability to mask individual weights allows for the learning of a large number of filters. We show performance comparable to dedicated fine-tuned networks for a variety of classification tasks, including those with large domain shifts from the initial task (ImageNet), and a variety of network architectures. Unlike prior work, we do not suffer from catastrophic forgetting or competition between tasks, and our performance is agnostic to task ordering. Code available at https://github.com/arunmallya/piggyback.
Published: 2018

24. Conditional Image-Text Embedding Networks

Author: Plummer, Bryan A., Kordas, Paige, Kiapour, M. Hadi, Zheng, Shuai, Piramuthu, Robinson, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents an approach for grounding phrases in images which jointly learns multiple text-conditioned embeddings in a single end-to-end model. In order to differentiate text phrases into semantically distinct subspaces, we propose a concept weight branch that automatically assigns phrases to embeddings, whereas prior works predefine such assignments. Our proposed solution simplifies the representation requirements for individual embeddings and allows the underrepresented concepts to take advantage of the shared representations before feeding them into concept-specific layers. Comprehensive experiments verify the effectiveness of our approach across three phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a strong region-phrase embedding baseline., Comment: ECCV 2018 accepted paper
Published: 2017

25. Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Author: Wang, Liwei, Schwing, Alexander G., and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around $K$ components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.
Published: 2017

26. PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning

Author: Mallya, Arun and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a method for adding multiple tasks to a single deep neural network while avoiding catastrophic forgetting. Inspired by network pruning techniques, we exploit redundancies in large deep networks to free up parameters that can then be employed to learn new tasks. By performing iterative pruning and network re-training, we are able to sequentially "pack" multiple tasks into a single network while ensuring minimal drop in performance and minimal storage overhead. Unlike prior work that uses proxy losses to maintain accuracy on older tasks, we always optimize for the task at hand. We perform extensive experiments on a variety of network architectures and large-scale datasets, and observe much better robustness against catastrophic forgetting than prior work. In particular, we are able to add three fine-grained classification tasks to a single ImageNet-trained VGG-16 network and achieve accuracies close to those of separately trained networks for each task. Code available at https://github.com/arunmallya/packnet
Published: 2017

27. Learning Two-Branch Neural Networks for Image-Text Matching Tasks

Author: Wang, Liwei, Li, Yin, Huang, Jing, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets., Comment: accepted version in TPAMI 2018
Published: 2017

28. Recurrent Models for Situation Recognition

Author: Mallya, Arun and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work proposes Recurrent Neural Network (RNN) models to predict structured 'image situations' -- actions and noun entities fulfilling semantic roles related to the action. In contrast to prior work relying on Conditional Random Fields (CRFs), we use a specialized action prediction network followed by an RNN for noun prediction. Our system obtains state-of-the-art accuracy on the challenging recent imSitu dataset, beating CRF-based models, including ones trained with additional data. Further, we show that specialized features learned from situation prediction can be transferred to the task of image captioning to more accurately describe human-object interactions., Comment: To appear at ICCV 2017
Published: 2017

29. Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

Author: Plummer, Bryan A., Mallya, Arun, Cervantes, Christopher M., Hockenmaier, Julia, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption. The resulting system produces state of the art performance on phrase localization on the Flickr30k Entities dataset and visual relationship detection on the Stanford VRD dataset., Comment: IEEE ICCV 2017 accepted paper
Published: 2016

30. Combining Multiple Cues for Visual Madlibs Question Answering

Author: Tommasi, Tatiana, Mallya, Arun, Plummer, Bryan, Lazebnik, Svetlana, Berg, Alexander C., and Berg, Tamara L.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction., Comment: submitted to IJCV -- under review
Published: 2016

31. Solving Visual Madlibs with Multiple Cues

Author: Tommasi, Tatiana, Mallya, Arun, Plummer, Bryan, Lazebnik, Svetlana, Berg, Alexander C., and Berg, Tamara L.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the ImageNet dataset, despite the wide scope of questions. In contrast, our approach employs features derived from networks trained for specialized tasks of scene classification, person activity prediction, and person and object attribute prediction. We also present a method for selecting sub-regions of an image that are relevant for evaluating the appropriateness of a putative answer. Visual features are computed both from the whole image and from local regions, while sentences are mapped to a common space using a simple normalized canonical correlation analysis (CCA) model. Our results show a significant improvement over the previous state of the art, and indicate that answering different question types benefits from examining a variety of image cues and carefully choosing informative image sub-regions., Comment: accepted at BMVC 2016
Published: 2016

32. Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

Author: Mallya, Arun and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task.
Published: 2016

33. Adaptive Object Detection Using Adjacency and Zoom Prediction

Author: Lu, Yongxi, Javidi, Tara, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: State-of-the-art object detection systems rely on an accurate set of region proposals. Several recent methods use a neural network architecture to hypothesize promising object locations. While these approaches are computationally efficient, they rely on fixed image regions as anchors for predictions. In this paper we propose to use a search strategy that adaptively directs computational resources to sub-regions likely to contain objects. Compared to methods based on fixed anchor locations, our approach naturally adapts to cases where object instances are sparse and small. Our approach is comparable in terms of accuracy to the state-of-the-art Faster R-CNN approach while using two orders of magnitude fewer anchors on average. Code is publicly available., Comment: Accepted to CVPR 2016
Published: 2015

34. Learning Deep Structure-Preserving Image-Text Embeddings

Author: Wang, Liwei, Li, Yin, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Learning
Abstract: This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is trained using a large margin objective that combines cross-view ranking constraints with within-view neighborhood structure preservation constraints inspired by metric learning literature. Extensive experiments show that our approach gains significant improvements in accuracy for image-to-text and text-to-image retrieval. Our method achieves new state-of-the-art results on the Flickr30K and MSCOCO image-sentence datasets and shows promise on the new task of phrase localization on the Flickr30K Entities dataset.
Published: 2015

35. Active Object Localization with Deep Reinforcement Learning

Author: Caicedo, Juan C. and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present an active detection model for localizing objects in scenes. The model is class-specific and allows an agent to focus attention on candidate regions for identifying the correct location of a target object. This agent learns to deform a bounding box using simple transformation actions, with the goal of determining the most specific location of target objects following top-down reasoning. The proposed localization agent is trained using deep reinforcement learning, and evaluated on the Pascal VOC 2007 dataset. We show that agents guided by the proposed model are able to localize a single instance of an object after analyzing only between 11 and 25 regions in an image, and obtain the best detection results among systems that do not use object proposals for object localization., Comment: IEEE ICCV 2015
Published: 2015

36. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Author: Plummer, Bryan A., Wang, Liwei, Cervantes, Chris M., Caicedo, Juan C., Hockenmaier, Julia, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
Published: 2015

37. Training Deeper Convolutional Networks with Deep Supervision

Author: Wang, Liwei, Lee, Chen-Yu, Tu, Zhuowen, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: One of the most promising ways of improving the performance of deep convolutional neural networks is by increasing the number of convolutional layers. However, adding layers makes training more difficult and computationally expensive. In order to train deeper networks, we propose to add auxiliary supervision branches after certain intermediate layers during training. We formulate a simple rule of thumb to determine where these branches should be added. The resulting deeply supervised structure makes the training much easier and also produces better classification results on ImageNet and the recently released, larger MIT Places dataset
Published: 2015

38. Multi-scale Orderless Pooling of Deep Convolutional Activation Features

Author: Gong, Yunchao, Wang, Liwei, Guo, Ruiqi, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep convolutional neural networks (CNN) have shown their promise as a universal representation for recognition. However, global CNN activations lack geometric invariance, which limits their robustness for classification and matching of highly variable scenes. To improve the invariance of CNN activations without degrading their discriminative power, this paper presents a simple but effective scheme called multi-scale orderless pooling (MOP-CNN). This scheme extracts CNN activations for local patches at multiple scale levels, performs orderless VLAD pooling of these activations at each level separately, and concatenates the result. The resulting MOP-CNN representation can be used as a generic feature for either supervised or unsupervised recognition tasks, from image classification to instance-level retrieval; it consistently outperforms global CNN activations without requiring any joint training of prediction layers for a particular target dataset. In absolute terms, it achieves state-of-the-art results on the challenging SUN397 and MIT Indoor Scenes classification datasets, and competitive results on ILSVRC2012/2013 classification and INRIA Holidays retrieval datasets.
Published: 2014

39. Conditional Image-Text Embedding Networks

Author: Plummer, Bryan A., Kordas, Paige, Kiapour, M. Hadi, Zheng, Shuai, Piramuthu, Robinson, Lazebnik, Svetlana, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Ferrari, Vittorio, editor, Hebert, Martial, editor, Sminchisescu, Cristian, editor, and Weiss, Yair, editor
Published: 2018
Full Text: View/download PDF

40. A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics

Author: Gong, Yunchao, Ke, Qifa, Isard, Michael, and Lazebnik, Svetlana
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Learning, Computer Science - Multimedia
Abstract: This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets., Comment: To Appear: International Journal of Computer Vision
Published: 2012

41. A recursive procedure for density estimation on the binary hypercube

Author: Raginsky, Maxim, Silva, Jorge, Lazebnik, Svetlana, and Willett, Rebecca
Subjects: Mathematics - Statistics Theory, Statistics - Machine Learning, 62G07 (Primary) 62G20, 62C20 (Secondary)
Abstract: This paper describes a recursive estimation procedure for multivariate binary densities (probability distributions of vectors of Bernoulli random variables) using orthogonal expansions. For $d$ covariates, there are $2^d$ basis coefficients to estimate, which renders conventional approaches computationally prohibitive when $d$ is large. However, for a wide class of densities that satisfy a certain sparsity condition, our estimator runs in probabilistic polynomial time and adapts to the unknown sparsity of the underlying density in two key ways: (1) it attains near-minimax mean-squared error for moderate sample sizes, and (2) the computational complexity is lower for sparser densities. Our method also allows for flexible control of the trade-off between mean-squared error and computational complexity., Comment: revision submitted to Electronic Journal of Statistics
Published: 2011

42. A Cordial Sync: Going Beyond Marginal Policies for Multi-agent Embodied Tasks

Author: Jain, Unnat, primary, Weihs, Luca, additional, Kolve, Eric, additional, Farhadi, Ali, additional, Lazebnik, Svetlana, additional, Kembhavi, Aniruddha, additional, and Schwing, Alexander, additional
Published: 2020
Full Text: View/download PDF

43. Memory-Efficient Incremental Learning Through Feature Adaptation

Author: Iscen, Ahmet, primary, Zhang, Jeffrey, additional, Lazebnik, Svetlana, additional, and Schmid, Cordelia, additional
Published: 2020
Full Text: View/download PDF

44. Combining Multiple Cues for Visual Madlibs Question Answering

Author: Tommasi, Tatiana, Mallya, Arun, Plummer, Bryan, Lazebnik, Svetlana, Berg, Alexander C., and Berg, Tamara L.
Published: 2019
Full Text: View/download PDF

45. Object Class Recognition (Categorization)

Author: Lazebnik, Svetlana and Ikeuchi, Katsushi, editor
Published: 2014
Full Text: View/download PDF

46. Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections

Author: Gong, Yunchao, Wang, Liwei, Hodosh, Micah, Hockenmaier, Julia, Lazebnik, Svetlana, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Kobsa, Alfred, Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Fleet, David, editor, Pajdla, Tomas, editor, Schiele, Bernt, editor, and Tuytelaars, Tinne, editor
Published: 2014
Full Text: View/download PDF

47. Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights

Author: Mallya, Arun, primary, Davis, Dillon, additional, and Lazebnik, Svetlana, additional
Published: 2018
Full Text: View/download PDF

48. Conditional Image-Text Embedding Networks

Author: Plummer, Bryan A., primary, Kordas, Paige, additional, Kiapour, M. Hadi, additional, Zheng, Shuai, additional, Piramuthu, Robinson, additional, and Lazebnik, Svetlana, additional
Published: 2018
Full Text: View/download PDF

49. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Author: Plummer, Bryan A., Wang, Liwei, Cervantes, Chris M., Caicedo, Juan C., Hockenmaier, Julia, and Lazebnik, Svetlana
Published: 2017
Full Text: View/download PDF

50. Building Rome on a Cloudless Day

Author: Frahm, Jan-Michael, Fite-Georgel, Pierre, Gallup, David, Johnson, Tim, Raguram, Rahul, Wu, Changchang, Jen, Yi-Hung, Dunn, Enrique, Clipp, Brian, Lazebnik, Svetlana, Pollefeys, Marc, Hutchison, David, Kanade, Takeo, Kittler, Josef, Kleinberg, Jon M., Mattern, Friedemann, Mitchell, John C., Naor, Moni, Nierstrasz, Oscar, Pandu Rangan, C., Steffen, Bernhard, Sudan, Madhu, Terzopoulos, Demetri, Tygar, Doug, Vardi, Moshe Y., Weikum, Gerhard, Daniilidis, Kostas, editor, Maragos, Petros, editor, and Paragios, Nikos, editor
Published: 2010
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

672 results on '"Lazebnik, Svetlana"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources