Author: "Cherian, Anoop" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cherian, Anoop"' showing total 253 results

Start Over Author "Cherian, Anoop"

253 results on '"Cherian, Anoop"'

1. Temporally Grounding Instructional Diagrams in Unconstrained Videos

Author: Zhang, Jiahao, Zhang, Frederic Z., Rodriguez, Cristian, Ben-Shabat, Yizhak, Cherian, Anoop, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.
Published: 2024

2. Disentangled Acoustic Fields For Multimodal Physical Scene Understanding

Author: Yin, Jie, Luo, Andrew, Du, Yilun, Cherian, Anoop, Marks, Tim K., Roux, Jonathan Le, and Gan, Chuang
Subjects: Computer Science - Robotics, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We study the problem of multimodal physical scene understanding, where an embodied agent needs to find fallen objects by inferring object properties, direction, and distance of an impact sound source. Previous works adopt feed-forward neural networks to directly regress the variables from sound, leading to poor generalization and domain adaptation issues. In this paper, we illustrate that learning a disentangled model of acoustic formation, referred to as disentangled acoustic field (DAF), to capture the sound generation and propagation process, enables the embodied agent to construct a spatial uncertainty map over where the objects may have fallen. We demonstrate that our analysis-by-synthesis framework can jointly infer sound properties by explicitly decomposing and factorizing the latent space of the disentangled model. We further show that the spatial uncertainty map can significantly improve the success rate for the localization of fallen objects by proposing multiple plausible exploration locations.
Published: 2024

3. Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads

Author: Cherian, Anoop, Peng, Kuan-Chuan, Lohit, Suhas, Matthiesen, Joanna, Smith, Kevin, and Tenenbaum, Joshua B.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.
Published: 2024

4. TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Author: Ni, Haomiao, Egger, Bernhard, Lohit, Suhas, Cherian, Anoop, Wang, Ye, Koike-Akino, Toshiaki, Huang, Sharon X., and Marks, Tim K.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation., Comment: CVPR 2024
Published: 2024

5. Multi-level Reasoning for Robotic Assembly: From Sequence Inference to Contact Selection

Author: Zhu, Xinghao, Jha, Devesh K., Romeres, Diego, Sun, Lingfeng, Tomizuka, Masayoshi, and Cherian, Anoop
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Automating the assembly of objects from their parts is a complex problem with innumerable applications in manufacturing, maintenance, and recycling. Unlike existing research, which is limited to target segmentation, pose regression, or using fixed target blueprints, our work presents a holistic multi-level framework for part assembly planning consisting of part assembly sequence inference, part motion planning, and robot contact optimization. We present the Part Assembly Sequence Transformer (PAST) -- a sequence-to-sequence neural network -- to infer assembly sequences recursively from a target blueprint. We then use a motion planner and optimization to generate part movements and contacts. To train PAST, we introduce D4PAS: a large-scale Dataset for Part Assembly Sequences (D4PAS) consisting of physically valid sequences for industrial objects. Experimental results show that our approach generalizes better than prior methods while needing significantly less computational time for inference., Comment: Supplementary video is available at https://www.youtube.com/watch?v=XNYkWSHkAaU&ab_channel=MitsubishiElectricResearchLabs%28MERL%29
Published: 2023

6. Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

Author: Nair, Nithin Gopalakrishnan, Cherian, Anoop, Lohit, Suhas, Wang, Ye, Koike-Akino, Toshiaki, Patel, Vishal M., and Marks, Tim K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidance is typically useful only towards synthesizing high-level semantics rather than editing fine-grained details as in image-to-image translation tasks. To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation. The key idea is to steer the image generation of the diffusion model at inference time via designing a loss using a pre-trained inverse model that characterizes the conditional task. This loss modulates the sampling trajectory of the diffusion process. Our framework allows for easy incorporation of multiple conditions during inference. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution. Our results demonstrate clear qualitative and quantitative improvements over state-of-the-art diffusion-based plug-and-play models while adding negligible additional computational cost., Comment: Accepted at ICCV 2023
Published: 2023

7. Pixel-Grounded Prototypical Part Networks

Author: Carmichael, Zachariah, Lohit, Suhas, Cherian, Anoop, Jones, Michael, and Scheirer, Walter
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prototypical part neural networks (ProtoPartNNs), namely PROTOPNET and its derivatives, are an intrinsically interpretable approach to machine learning. Their prototype learning scheme enables intuitive explanations of the form, this (prototype) looks like that (testing image patch). But, does this actually look like that? In this work, we delve into why object part localization and associated heat maps in past work are misleading. Rather than localizing to object parts, existing ProtoPartNNs localize to the entire image, contrary to generated explanatory visualizations. We argue that detraction from these underlying issues is due to the alluring nature of visualizations and an over-reliance on intuition. To alleviate these issues, we devise new receptive field-based architectural constraints for meaningful localization and a principled pixel space mapping for ProtoPartNNs. To improve interpretability, we propose additional architectural improvements, including a simplified classification head. We also make additional corrections to PROTOPNET and its derivatives, such as the use of a validation set, rather than a test set, to evaluate generalization during training. Our approach, PIXPNET (Pixel-grounded Prototypical part Network), is the only ProtoPartNN that truly learns and localizes to prototypical object parts. We demonstrate that PIXPNET achieves quantifiably improved interpretability without sacrificing accuracy., Comment: 21 pages
Published: 2023

8. CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments

Author: Liu, Xiulong, Paul, Sudipta, Chatterjee, Moitreya, and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: (i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that only use uni-directional interaction., Comment: Accepted at AAAI 2024
Published: 2023

9. HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions

Author: Shah, Anshul, Roy, Aniket, Shah, Ketul, Mishra, Shlok Kumar, Jacobs, David, Cherian, Anoop, and Chellappa, Rama
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code will be made available at https://github.com/anshulbshah/HaLP., Comment: To be presented at CVPR 2023
Published: 2023

10. Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Author: Zhang, Jiahao, Cherian, Anoop, Liu, Yanbin, Ben-Shabat, Yizhak, Rodriguez, Cristian, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives., Comment: Project website: https://academic.davidz.cn/en/publication/zhang-cvpr-2023/
Published: 2023

11. Are Deep Neural Networks SMARTer than Second Graders?

Author: Cherian, Anoop, Peng, Kuan-Chuan, Lohit, Suhas, Smith, Kevin A., and Tenenbaum, Joshua B.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a subset of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect., Comment: Extended version of CVPR 2023 paper. For the SMART-101 dataset, see http://smartdataset.github.io/smart101
Published: 2022

12. Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation

Author: Chatterjee, Moitreya, Ahuja, Narendra, and Cherian, Anoop
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (ii) predicting the 3D visual motion of a sounding source using its separated audio. Towards this end, we present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. At the heart of ASMP is a 2.5D scene graph capturing various objects in the video and their pseudo-3D spatial proximities. This graph is constructed by registering together 2.5D monocular depth predictions from the 2D video frames and associating the 2.5D scene regions with the outputs of an object detector applied on those frames. The ASMP task is then mathematically modeled as the joint problem of: (i) recursively segmenting the 2.5D scene graph into several sub-graphs, each associated with a constituent sound in the input audio mixture (which is then separated) and (ii) predicting the 3D motions of the corresponding sound sources from the separated audio. To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods., Comment: Accepted at NeurIPS 2022
Published: 2022

13. H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding Object Articulations from Interactions

Author: Ota, Kei, Tung, Hsiao-Yu, Smith, Kevin A., Cherian, Anoop, Marks, Tim K., Sullivan, Alan, Kanezaki, Asako, and Tenenbaum, Joshua B.
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: The world is filled with articulated objects that are difficult to determine how to use from vision alone, e.g., a door might open inwards or outwards. Humans handle these objects with strategic trial-and-error: first pushing a door then pulling if that doesn't work. We enable these capabilities in autonomous agents by proposing "Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR), a probabilistic generative framework that simultaneously generates a distribution of hypotheses about how objects articulate given input observations, captures certainty over hypotheses over time, and infer plausible actions for exploration and goal-conditioned manipulation. We compare our model with existing work in manipulating objects after a handful of exploration actions, on the PartNet-Mobility dataset. We further propose a novel PuzzleBoxes benchmark that contains locked boxes that require multiple steps to solve. We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework, despite using zero training data. We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
Published: 2022

14. AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments

Author: Paul, Sudipta, Roy-Chowdhury, Amit K., and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN~ -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds., Comment: Accepted at NeurIPS 2022
Published: 2022

15. (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Author: Cherian, Anoop, Hori, Chiori, Marks, Tim K., and Roux, Jonathan Le
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static sub-graph and a dynamic sub-graph, corresponding to whether the objects within them usually move in the world. The nodes in the dynamic graph are enriched with motion features capturing their interactions with other graph nodes. Next, for the video QA task, we present a novel transformer-based reasoning pipeline that embeds the (2.5+1)D graph into a spatio-temporal hierarchical latent space, where the sub-graphs and their interactions are captured at varied granularity. To demonstrate the effectiveness of our approach, we present experiments on the NExT-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art., Comment: Accepted at AAAI 2022 (Oral)
Published: 2022

16. Max-Margin Contrastive Learning

Author: Shah, Anshul, Sra, Suvrit, Chellappa, Rama, and Cherian, Anoop
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Standard contrastive learning approaches usually require a large number of negatives for effective unsupervised learning and often exhibit slow convergence. We suspect this behavior is due to the suboptimal selection of negatives used for offering contrast to the positives. We counter this difficulty by taking inspiration from support vector machines (SVMs) to present max-margin contrastive learning (MMCL). Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem, and contrastiveness is enforced by maximizing the decision margin. As SVM optimization can be computationally demanding, especially in an end-to-end setting, we present simplifications that alleviate the computational burden. We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning over state-of-the-art, while having better empirical convergence properties., Comment: Accepted at AAAI 2022
Published: 2021

17. MOST-GAN: 3D Morphable StyleGAN for Disentangled Face Image Manipulation

Author: Medin, Safa C., Egger, Bernhard, Cherian, Anoop, Wang, Ye, Tenenbaum, Joshua B., Liu, Xiaoming, and Marks, Tim K.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics, Computer Science - Machine Learning, I.2.10
Abstract: Recent advances in generative adversarial networks (GANs) have led to remarkable achievements in face image synthesis. While methods that use style-based GANs can generate strikingly photorealistic face images, it is often difficult to control the characteristics of the generated faces in a meaningful and disentangled way. Prior approaches aim to achieve such semantic control and disentanglement within the latent space of a previously trained GAN. In contrast, we propose a framework that a priori models physical attributes of the face such as 3D shape, albedo, pose, and lighting explicitly, thus providing disentanglement by design. Our method, MOST-GAN, integrates the expressive power and photorealism of style-based GANs with the physical disentanglement and flexibility of nonlinear 3D morphable models, which we couple with a state-of-the-art 2D hair manipulation network. MOST-GAN achieves photorealistic manipulation of portrait images with fully disentangled 3D control over their physical attributes, enabling extreme manipulation of lighting, facial expression, and pose variations up to full profile view.
Published: 2021

18. Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Author: Shah, Ankit P., Geng, Shijie, Gao, Peng, Cherian, Anoop, Hori, Takaaki, Marks, Tim K., Roux, Jonathan Le, and Hori, Chiori
Subjects: Computer Science - Computation and Language
Abstract: In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD) task, collected an AVSD dataset, developed AVSD technologies, and hosted an AVSD challenge track at both the 7th and 8th Dialog System Technology Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on human-generated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, we proposed a third AVSD challenge, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and our new extension of the AVSD dataset for DSTC10, for which we collected human-generated temporal reasoning data. We also introduce a baseline system built using an AV-transformer, which we released along with the new dataset. Finally, this paper introduces a new system that extends our baseline system with attentional multimodal fusion, joint student-teacher learning (JSTL), and model combination techniques, achieving state-of-the-art performances on the AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal reasoning methods for AVSD: one attention-based, and one based on a time-domain region proposal network., Comment: https://dstc10.dstc.community/home and https://github.com/dialogtekgeek/AVSD-DSTC10_Official/
Published: 2021

19. A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Author: Chatterjee, Moitreya, Ahuja, Narendra, and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high. Towards this end, we introduce Neural Uncertainty Quantifier (NUQ) - a stochastic quantification of the model's predictive uncertainty, and use it to weigh the MSE loss. We propose a hierarchical, variational framework to derive NUQ in a principled manner using a deep, Bayesian graphical model. Our experiments on four benchmark stochastic video prediction datasets show that our proposed framework trains more effectively compared to the state-of-the-art models (especially when the training sets are small), while demonstrating better video generation quality and diversity against several evaluation metrics., Comment: Accepted at ICCV 2021 (Oral)
Published: 2021

20. Visual Scene Graphs for Audio Source Separation

Author: Chatterjee, Moitreya, Roux, Jonathan Le, Ahuja, Narendra, and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. These approaches often ignore the visual context of these sound sources or avoid modeling object interactions that may be useful to better characterize the sources, especially when the same object class may produce varied sounds from distinct interactions. To address this challenging problem, we propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs, each subgraph being associated with a unique sound obtained by co-segmenting the audio spectrogram. At its core, AVSGS uses a recursive neural network that emits mutually-orthogonal sub-graph embeddings of the visual graph using multi-head attention. These embeddings are used for conditioning an audio encoder-decoder towards source separation. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds. In this paper, we also introduce an "in the wild'' video dataset for sound source separation that contains multiple non-musical sources, which we call Audio Separation in the Wild (ASIW). This dataset is adapted from the AudioCaps dataset, and provides a challenging, natural, and daily-life setting for source separation. Thorough experiments on the proposed ASIW and the standard MUSIC datasets demonstrate state-of-the-art sound separation performance of our method against recent prior approaches., Comment: Accepted at ICCV 2021
Published: 2021

21. InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images

Author: Cherian, Anoop, Pais, Goncalo Dias, Jain, Siddarth, Marks, Tim K., and Sullivan, Alan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: In this paper, we present InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. Using an analysis-by-synthesis approach, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vectors), each encoding the 3D pose of an object that is represented by a learned implicit object template. The generator has two distinct modules. The first module, the instance feature generator, uses each encoded pose to transform the implicit template into a feature map representation of each object instance. The second module, the depth image renderer, aggregates all of the single-instance feature maps output by the first module and generates a multiple-instance depth image. A discriminator distinguishes the generated multiple-instance depth images from the distribution of true depth images. To use our model for instance segmentation, we propose an instance pose encoder that learns to take in a generated depth image and reproduce the pose code vectors for all of the object instances. To evaluate our approach, we introduce a new synthetic dataset, "Insta-10", consisting of 100,000 depth images, each with 5 instances of an object from one of 10 classes. Our experiments on Insta-10, as well as on real-world noisy depth images, show that InSeGAN achieves state-of-the-art performance, often outperforming prior methods by large margins., Comment: Accepted at ICCV 2021. Code & data @ https://www.merl.com/research/license/InSeGAN
Published: 2021

22. Generalized One-Class Learning Using Pairs of Complementary Classifiers

Author: Cherian, Anoop and Wang, Jue
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: One-class learning is the classic problem of fitting a model to the data for which annotations are available only for a single class. In this paper, we explore novel objectives for one-class learning, which we collectively refer to as Generalized One-class Discriminative Subspaces (GODS). Our key idea is to learn a pair of complementary classifiers to flexibly bound the one-class data distribution, where the data belongs to the positive half-space of one of the classifiers in the complementary pair and to the negative half-space of the other. To avoid redundancy while allowing non-linearity in the classifier decision surfaces, we propose to design each classifier as an orthonormal frame and seek to learn these frames via jointly optimizing for two conflicting objectives, namely: i) to minimize the distance between the two frames, and ii) to maximize the margin between the frames and the data. The learned orthonormal frames will thus characterize a piecewise linear decision surface that allows for efficient inference, while our objectives seek to bound the data within a minimal volume that maximizes the decision margin, thereby robustly capturing the data distribution. We explore several variants of our formulation under different constraints on the constituent classifiers, including kernelized feature maps. We demonstrate the empirical benefits of our approach via experiments on data from several applications in computer vision, such as anomaly detection in video sequences, human poses, and human activities. We also explore the generality and effectiveness of GODS for non-vision tasks via experiments on several UCI datasets, demonstrating state-of-the-art results., Comment: Accepted at Trans. PAMI. arXiv admin note: text overlap with arXiv:1908.05884
Published: 2021

23. Learning Log-Determinant Divergences for Positive Definite Matrices

Author: Cherian, Anoop, Stanitsas, Panagiotis, Wang, Jue, Harandi, Mehrtash, Morellas, Vassilios, and Papanikolopoulos, Nikolaos
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Representations in the form of Symmetric Positive Definite (SPD) matrices have been popularized in a variety of visual learning applications due to their demonstrated ability to capture rich second-order statistics of visual data. There exist several similarity measures for comparing SPD matrices with documented benefits. However, selecting an appropriate measure for a given problem remains a challenge and in most cases, is the result of a trial-and-error process. In this paper, we propose to learn similarity measures in a data-driven manner. To this end, we capitalize on the \alpha\beta-log-det divergence, which is a meta-divergence parametrized by scalars \alpha and \beta, subsuming a wide family of popular information divergences on SPD matrices for distinct and discrete values of these parameters. Our key idea is to cast these parameters in a continuum and learn them from data. We systematically extend this idea to learn vector-valued parameters, thereby increasing the expressiveness of the underlying non-linear measure. We conjoin the divergence learning problem with several standard tasks in machine learning, including supervised discriminative dictionary learning and unsupervised SPD matrix clustering. We present Riemannian gradient descent schemes for optimizing our formulations efficiently, and show the usefulness of our method on eight standard computer vision tasks., Comment: Accepted at Trans. PAMI (extended version of ICCV 2017 paper). arXiv admin note: substantial text overlap with arXiv:1708.01741
Published: 2021

24. Tensor Representations for Action Recognition

Author: Koniusz, Piotr, Wang, Lei, and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Human actions in video sequences are characterized by the complex interplay between spatial features and their temporal dynamics. In this paper, we propose novel tensor representations for compactly capturing such higher-order relationships between visual features for the task of action recognition. We propose two tensor-based feature representations, viz. (i) sequence compatibility kernel (SCK) and (ii) dynamics compatibility kernel (DCK). SCK builds on the spatio-temporal correlations between features, whereas DCK explicitly models the action dynamics of a sequence. We also explore generalization of SCK, coined SCK(+), that operates on subsequences to capture the local-global interplay of correlations, which can incorporate multi-modal inputs e.g., skeleton 3D body-joints and per-frame classifier scores obtained from deep learning models trained on videos. We introduce linearization of these kernels that lead to compact and fast descriptors. We provide experiments on (i) 3D skeleton action sequences, (ii) fine-grained video sequences, and (iii) standard non-fine-grained videos. As our final representations are tensors that capture higher-order relationships of features, they relate to co-occurrences for robust fine-grained recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (EPN) which have been long speculated to perform spectral detection of higher-order occurrences, thus detecting fine-grained relationships of features rather than merely count features in action sequences. We prove that a tensor of order r, built from Z* dimensional features, coupled with EPN indeed detects if at least one higher-order occurrence is `projected' into one of its binom(Z*,r) subspaces of dim. r represented by the tensor, thus forming a Tensor Power Normalization metric endowed with binom(Z*,r) such `detectors'., Comment: Published with TPAMI, 2020. arXiv admin note: text overlap with arXiv:1604.00239
Published: 2020
Full Text: View/download PDF

25. First-Order Optimization Inspired from Finite-Time Convergent Flows

Author: Zhang, Siqi, Benosman, Mouhacine, Romero, Orlando, and Cherian, Anoop
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control, 68T07
Abstract: In this paper, we investigate the performance of two first-order optimization algorithms, obtained from forward Euler discretization of finite-time optimization flows. These flows are the rescaled-gradient flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or discontinuous dynamical systems that converge locally in finite time to the minima of gradient-dominated functions. We propose an Euler discretization for these first-order finite-time flows, and provide convergence guarantees, in the deterministic and the stochastic setting. We then apply the proposed algorithms to academic examples, as well as deep neural networks training, where we empirically test their performances on the SVHN dataset. Our results show that our schemes demonstrate faster convergences against standard optimization alternatives.
Published: 2020

26. Sound2Sight: Generating Visual Dynamics from Sound and Context

Author: Cherian, Anoop, Chatterjee, Moitreya, and Ahuja, Narendra
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures that are consistent with the provided audio and the past context. Moreover, to improve the quality and coherence of the generated frames, we propose a multimodal discriminator that differentiates between a synthesized and a real audio-visual clip. We empirically evaluate our approach, vis-\'a-vis closely-related prior methods, on two new datasets viz. (i) Multimodal Stochastic Moving MNIST with a Surprise Obstacle, (ii) Youtube Paintings; as well as on the existing Audio-Set Drums dataset. Our extensive experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality, while also producing diverse video content., Comment: Accepted at ECCV 2020
Published: 2020

27. Representation Learning via Adversarially-Contrastive Optimal Transport

Author: Cherian, Anoop and Aeron, Shuchin
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: In this paper, we study the problem of learning compact (low-dimensional) representations for sequential data that captures its implicit spatio-temporal cues. To maximize extraction of such informative cues from the data, we set the problem within the context of contrastive representation learning and to that end propose a novel objective via optimal transport. Specifically, our formulation seeks a low-dimensional subspace representation of the data that jointly (i) maximizes the distance of the data (embedded in this subspace) from an adversarial data distribution under the optimal transport, a.k.a. the Wasserstein distance, (ii) captures the temporal order, and (iii) minimizes the data distortion. To generate the adversarial distribution, we propose a novel framework connecting Wasserstein GANs with a classifier, allowing a principled mechanism for producing good negative distributions for contrastive learning, which is currently a challenging problem. Our full objective is cast as a subspace learning problem on the Grassmann manifold and solved via Riemannian optimization. To empirically study our formulation, we provide experiments on the task of human action recognition in video sequences. Our results demonstrate competitive performance against challenging baselines., Comment: Accepted at ICML 2020
Published: 2020

28. Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Author: Geng, Shijie, Gao, Peng, Chatterjee, Moitreya, Hori, Chiori, Roux, Jonathan Le, Zhang, Yongfeng, Li, Hongsheng, and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics., Comment: Accepted at AAAI 2021
Published: 2020

29. Dense Non-Rigid Structure from Motion: A Manifold Viewpoint

Author: Kumar, Suryansh, Van Gool, Luc, de Oliveira, Carlos E. P., Cherian, Anoop, Dai, Yuchao, and Li, Hongdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computational Geometry
Abstract: Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames. Classical approaches to this problem assume a small number of feature points and, ignore the local non-linearities of the shape deformation, and therefore, struggles to reliably model non-linear deformations. Furthermore, available dense NRSfM algorithms are often hurdled by scalability, computations, noisy measurements and, restricted to model just global deformation. In this paper, we propose algorithms that can overcome these limitations with the previous methods and, at the same time, can recover a reliable dense 3D structure of a non-rigid object with higher accuracy. Assuming that a deforming shape is composed of a union of local linear subspace and, span a global low-rank space over multiple frames enables us to efficiently model complex non-rigid deformations. To that end, each local linear subspace is represented using Grassmannians and, the global 3D shape across multiple frames is represented using a low-rank representation. We show that our approach significantly improves accuracy, scalability, and robustness against noise. Also, our representation naturally allows for simultaneous reconstruction and clustering framework which in general is observed to be more suitable for NRSfM problems. Our method currently achieves leading performance on the standard benchmark datasets., Comment: A comprehensive version that combines our cvpr 2018 and cvpr 2019 work (Still under development and refinement, Initial Version). 13 Figures, 1 Table. arXiv admin note: text overlap with arXiv:1902.01077
Published: 2020

30. Inferring Temporal Compositions of Actions Using Probabilistic Automata

Author: Cruz, Rodrigo Santa, Cherian, Anoop, Fernando, Basura, Campbell, Dylan, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a framework to recognize temporal compositions of atomic actions in videos. Specifically, we propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata to recognize complex actions as satisfying these expressions on the input video features. Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences. Instead, the proposed approach allows recognizing complex fine-grained activities using only pretrained action classifiers, without requiring any additional data, annotations or neural network training. To evaluate the potential of our approach, we provide experiments on synthetic datasets and challenging real action recognition datasets, such as MultiTHUMOS and Charades. We conclude that the proposed approach can extend state-of-the-art primitive action classifiers to vastly more complex activities without large performance degradation., Comment: Accepted in Workshop on Compositionality in Computer Vision at CVPR, 2020
Published: 2020

31. LUVLi Face Alignment: Estimating Landmarks' Location, Uncertainty, and Visibility Likelihood

Author: Kumar, Abhinav, Marks, Tim K., Mou, Wenxuan, Wang, Ye, Jones, Michael, Cherian, Anoop, Koike-Akino, Toshiaki, Liu, Xiaoming, and Feng, Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations nor predict whether landmarks are visible. In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities. We model these as mixed random variables and estimate them using a deep network trained with our proposed Location, Uncertainty, and Visibility Likelihood (LUVLi) loss. In addition, we release an entirely new labeling of a large face alignment dataset with over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmark is unoccluded, self-occluded (due to extreme head poses), or externally occluded. Not only does our joint estimation yield accurate estimates of the uncertainty of predicted landmark locations, but it also yields state-of-the-art estimates for the landmark locations themselves on multiple standard face alignment datasets. Our method's estimates of the uncertainty of predicted landmark locations could be used to automatically identify input images on which face alignment fails, which can be critical for downstream tasks., Comment: Accepted to CVPR 2020
Published: 2020

32. Spatio-Temporal Ranked-Attention Networks for Video Captioning

Author: Cherian, Anoop, Wang, Jue, Hori, Chiori, and Marks, Tim K.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
Published: 2020

33. The Eighth Dialog System Technology Challenge

Author: Kim, Seokhwan, Galley, Michel, Gunasekara, Chulaka, Lee, Sungjin, Atkinson, Adam, Peng, Baolin, Schulz, Hannes, Gao, Jianfeng, Li, Jinchao, Adada, Mahmoud, Huang, Minlie, Lastras, Luis, Kummerfeld, Jonathan K., Lasecki, Walter S., Hori, Chiori, Cherian, Anoop, Marks, Tim K., Rastogi, Abhinav, Zang, Xiaoxue, Sunkara, Srinivas, and Gupta, Raghav
Subjects: Computer Science - Computation and Language
Abstract: This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks., Comment: Submitted to NeurIPS 2019 3rd Conversational AI Workshop
Published: 2019

34. Discriminative Video Representation Learning Using Support Vector Classifiers

Author: Wang, Jue and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action---many are common across multiple actions---pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action. To identify these useful features, we resort to a negative bag consisting of features that are known to be irrelevant, for example, they are sampled either from datasets that are unrelated to our actions of interest or are CNN features produced via random noise as input. With the features from the video as a positive bag and the irrelevant features as the negative bag, we cast an objective to learn a (nonlinear) hyperplane that separates the unknown useful features from the rest in a multiple instance learning formulation within a support vector machine setup. We use the parameters of this separating hyperplane as a descriptor for the full video segment. Since these parameters are directly related to the support vectors in a max-margin framework, they can be treated as a weighted average pooling of the features from the bags, with zero weights given to non-support vectors. Our pooling scheme is end-to-end trainable within a deep learning framework. We report results from experiments on eight computer vision benchmark datasets spanning a variety of video-related tasks and demonstrate state-of-the-art performance across these tasks., Comment: arXiv admin note: substantial text overlap with arXiv:1803.10628
Published: 2019

35. GODS: Generalized One-class Discriminative Subspaces for Anomaly Detection

Author: Wang, Jue and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: One-class learning is the classic problem of fitting a model to data for which annotations are available only for a single class. In this paper, we propose a novel objective for one-class learning. Our key idea is to use a pair of orthonormal frames -- as subspaces -- to "sandwich" the labeled data via optimizing for two objectives jointly: i) minimize the distance between the origins of the two subspaces, and ii) to maximize the margin between the hyperplanes and the data, either subspace demanding the data to be in its positive and negative orthant respectively. Our proposed objective however leads to a non-convex optimization problem, to which we resort to Riemannian optimization schemes and derive an efficient conjugate gradient scheme on the Stiefel manifold. To study the effectiveness of our scheme, we propose a new dataset~\emph{Dash-Cam-Pose}, consisting of clips with skeleton poses of humans seated in a car, the task being to classify the clips as normal or abnormal; the latter is when any human pose is out-of-position with regard to say an airbag deployment. Our experiments on the proposed Dash-Cam-Pose dataset, as well as several other standard anomaly/novelty detection benchmarks demonstrate the benefits of our scheme, achieving state-of-the-art one-class accuracy., Comment: Accepted by ICCV 2019, 8 pages
Published: 2019

36. Game Theoretic Optimization via Gradient-based Nikaido-Isoda Function

Author: Raghunathan, Arvind U., Cherian, Anoop, and Jha, Devesh K.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Computing Nash equilibrium (NE) of multi-player games has witnessed renewed interest due to recent advances in generative adversarial networks. However, computing equilibrium efficiently is challenging. To this end, we introduce the Gradient-based Nikaido-Isoda (GNI) function which serves: (i) as a merit function, vanishing only at the first-order stationary points of each player's optimization problem, and (ii) provides error bounds to a stationary Nash point. Gradient descent is shown to converge sublinearly to a first-order stationary point of the GNI function. For the particular case of bilinear min-max games and multi-player quadratic games, the GNI function is convex. Hence, the application of gradient descent in this case yields linear convergence to an NE (when one exists). In our numerical experiments, we observe that the GNI formulation always converges to the first-order stationary point of each player's optimization problem., Comment: Accepted at International Conference on Machine Learning (ICML), 2019
Published: 2019

37. Audio-Visual Scene-Aware Dialog

Author: Alamri, Huda, Cartillier, Vincent, Das, Abhishek, Wang, Jue, Cherian, Anoop, Essa, Irfan, Batra, Dhruv, Marks, Tim K., Hori, Chiori, Anderson, Peter, Lee, Stefan, and Parikh, Devi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
Published: 2019

38. Searching by parts: Towards fine-grained image retrieval respecting species correlation

Author: Pang, Cheng, Cherian, Anoop, Lan, Rushi, Luo, Xiaonan, and Yao, Hongxun
Published: 2023
Full Text: View/download PDF

39. WI-FI based Indoor Monitoring Enhanced by Multimodal Fusion

Author: Hori, Chiori, primary, Wang, Pu, additional, Rahman, Mahbub, additional, Vaca-Rubio, Cristian, additional, Khurana, Sameer, additional, Cherian, Anoop, additional, and Le Roux, Jonathan, additional
Published: 2024
Full Text: View/download PDF

40. Contrastive Video Representation Learning via Adversarial Perturbations

Author: Wang, Jue and Cherian, Anoop
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations within a novel contrastive learning setup to build negative samples, which are then used to produce improved video representations. To this end, given a well-trained deep model for per-frame video recognition, we first generate adversarial noise adapted to this model. Positive and negative bags are produced using the original data features from the full video sequence and their perturbed counterparts, respectively. Unlike the classic contrastive learning methods, we develop a binary classification problem that learns a set of discriminative hyperplanes -- as a subspace -- that will separate the two bags from each other. This subspace is then used as a descriptor for the video, dubbed \emph{discriminative subspace pooling}. As the perturbed features belong to data classes that are likely to be confused with the original features, the discriminative subspace will characterize parts of the feature space that are more representative of the original data, and thus may provide robust video representations. To learn such descriptors, we formulate a subspace learning objective on the Stiefel manifold and resort to Riemannian optimization methods for solving it efficiently. We provide experiments on several video datasets and demonstrate state-of-the-art results., Comment: Revised version of ECCV 2018 Paper: Learning Discriminative Video Representations Using Adversarial Perturbations
Published: 2018

41. Sem-GAN: Semantically-Consistent Image-to-Image Translation

Author: Cherian, Anoop and Sullivan, Alan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unpaired image-to-image translation is the problem of mapping an image in the source domain to one in the target domain, without requiring corresponding image pairs. To ensure the translated images are realistically plausible, recent works, such as Cycle-GAN, demands this mapping to be invertible. While, this requirement demonstrates promising results when the domains are unimodal, its performance is unpredictable in a multi-modal scenario such as in an image segmentation task. This is because, invertibility does not necessarily enforce semantic correctness. To this end, we present a semantically-consistent GAN framework, dubbed Sem-GAN, in which the semantics are defined by the class identities of image segments in the source domain as produced by a semantic segmentation algorithm. Our proposed framework includes consistency constraints on the translation task that, together with the GAN loss and the cycle-constraints, enforces that the images when translated will inherit the appearances of the target domain, while (approximately) maintaining their identities from the source domain. We present experiments on several image-to-image translation tasks and demonstrate that Sem-GAN improves the quality of the translated images significantly, sometimes by more than 20% on the FCN score. Further, we show that semantic segmentation models, trained with synthetic images translated via Sem-GAN, leads to significantly better segmentation results than other variants.
Published: 2018

42. End-to-End Audio Visual Scene-Aware Dialog using Multimodal Attention-Based Video Features

Author: Hori, Chiori, Alamri, Huda, Wang, Jue, Wichern, Gordon, Hori, Takaaki, Cherian, Anoop, Marks, Tim K., Cartillier, Vincent, Lopes, Raphael Gontijo, Das, Abhishek, Essa, Irfan, Batra, Dhruv, and Parikh, Devi
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge., Comment: A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7
Published: 2018

43. Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7

Author: Alamri, Huda, Cartillier, Vincent, Lopes, Raphael Gontijo, Das, Abhishek, Wang, Jue, Essa, Irfan, Batra, Dhruv, Parikh, Devi, Cherian, Anoop, Marks, Tim K., and Hori, Chiori
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene-aware dialog systems will be able to have conversations with users about the objects and events around them. Progress on such systems can be made by integrating state-of-the-art technologies from multiple research areas including end-to-end dialog systems visual dialog, and video description. We introduce the Audio Visual Scene Aware Dialog (AVSD) challenge and dataset. In this challenge, which is one track of the 7th Dialog System Technology Challenges (DSTC7) workshop1, the task is to build a system that generates responses in a dialog about an input video
Published: 2018

44. Non-Linear Temporal Subspace Representations for Activity Recognition

Author: Cherian, Anoop, Sra, Suvrit, Gould, Stephen, and Hartley, Richard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Representations that can compactly and effectively capture the temporal evolution of semantic content are important to computer vision and machine learning algorithms that operate on multi-variate time-series data. We investigate such representations motivated by the task of human action recognition. Here each data instance is encoded by a multivariate feature (such as via a deep CNN) where action dynamics are characterized by their variations in time. As these features are often non-linear, we propose a novel pooling method, kernelized rank pooling, that represents a given sequence compactly as the pre-image of the parameters of a hyperplane in a reproducing kernel Hilbert space, projections of data onto which captures their temporal order. We develop this idea further and show that such a pooling scheme can be cast as an order-constrained kernelized PCA objective. We then propose to use the parameters of a kernelized low-rank feature subspace as the representation of the sequences. We cast our formulation as an optimization problem on generalized Grassmann manifolds and then solve it efficiently using Riemannian optimization techniques. We present experiments on several action recognition datasets using diverse feature modalities and demonstrate state-of-the-art results., Comment: Accepted at the IEEE International Conference on Computer Vision and Pattern Recognition, CVPR, 2018. arXiv admin note: substantial text overlap with arXiv:1705.08583
Published: 2018

45. Video Representation Learning Using Discriminative Pooling

Author: Wang, Jue, Cherian, Anoop, Porikli, Fatih, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action---indeed, many are common across multiple actions---pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action. To this end, we learn a (nonlinear) hyperplane that separates this unknown, yet discriminative, feature from the rest. Applying multiple instance learning in a large-margin setup, we use the parameters of this separating hyperplane as a descriptor for the full video segment. Since these parameters are directly related to the support vectors in a max-margin framework, they serve as robust representations for pooling of the features. We formulate a joint objective and an efficient solver that learns these hyperplanes per video and the corresponding action classifiers over the hyperplanes. Our pooling scheme is end-to-end trainable within a deep framework. We report results from experiments on three benchmark datasets spanning a variety of challenges and demonstrate state-of-the-art performance across these tasks., Comment: 8 pages, 7 figures, Accepted in CVPR2018. arXiv admin note: substantial text overlap with arXiv:1704.01716
Published: 2018

46. Scalable Dense Non-rigid Structure-from-Motion: A Grassmannian Perspective

Author: Kumar, Suryansh, Cherian, Anoop, Dai, Yuchao, and Li, Hongdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper addresses the task of dense non-rigid structure-from-motion (NRSfM) using multiple images. State-of-the-art methods to this problem are often hurdled by scalability, expensive computations, and noisy measurements. Further, recent methods to NRSfM usually either assume a small number of sparse feature points or ignore local non-linearities of shape deformations, and thus cannot reliably model complex non-rigid deformations. To address these issues, in this paper, we propose a new approach for dense NRSfM by modeling the problem on a Grassmann manifold. Specifically, we assume the complex non-rigid deformations lie on a union of local linear subspaces both spatially and temporally. This naturally allows for a compact representation of the complex non-rigid deformation over frames. We provide experimental results on several synthetic and real benchmark datasets. The procured results clearly demonstrate that our method, apart from being scalable and more accurate than state-of-the-art methods, is also more robust to noise and generalizes to highly non-linear deformations., Comment: 10 pages, 7 figure, 4 tables. Accepted for publication in Conference on Computer Vision and Pattern Recognition (CVPR), 2018, typos fixed and acknowledgement added
Published: 2018

47. Neural Algebra of Classifiers

Author: Cruz, Rodrigo Santa, Fernando, Basura, Cherian, Anoop, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: The world is fundamentally compositional, so it is natural to think of visual recognition as the recognition of basic visually primitives that are composed according to well-defined rules. This strategy allows us to recognize unseen complex concepts from simple visual primitives. However, the current trend in visual recognition follows a data greedy approach where huge amounts of data are required to learn models for any desired visual concept. In this paper, we build on the compositionality principle and develop an "algebra" to compose classifiers for complex visual concepts. To this end, we learn neural network modules to perform boolean algebra operations on simple visual classifiers. Since these modules form a complete functional set, a classifier for any complex visual concept defined as a boolean expression of primitives can be obtained by recursively applying the learned modules, even if we do not have a single training sample. As our experiments show, using such a framework, we can compose classifiers for complex visual concepts outperforming standard baselines on two well-known visual recognition benchmarks. Finally, we present a qualitative analysis of our method and its properties., Comment: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV)
Published: 2018

48. Human Action Forecasting by Learning Task Grammars

Author: Han, Tengda, Wang, Jue, Cherian, Anoop, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: For effective human-robot interaction, it is important that a robotic assistant can forecast the next action a human will consider in a given task. Unfortunately, real-world tasks are often very long, complex, and repetitive; as a result forecasting is not trivial. In this paper, we propose a novel deep recurrent architecture that takes as input features from a two-stream Residual action recognition framework, and learns to estimate the progress of human activities from video sequences -- this surrogate progress estimation task implicitly learns a temporal task grammar with respect to which activities can be localized and forecasted. To learn the task grammar, we propose a stacked LSTM based multi-granularity progress estimation framework that uses a novel cumulative Euclidean loss as objective. To demonstrate the effectiveness of our proposed architecture, we showcase experiments on two challenging robotic assistive tasks, namely (i) assembling an Ikea table from its constituents, and (ii) changing the tires of a car. Our results demonstrate that learning task grammars offers highly discriminative cues improving the forecasting accuracy by more than 9% over the baseline two-stream forecasting model, while also outperforming other competitive schemes.
Published: 2017

49. Learning Discriminative Alpha-Beta-divergence for Positive Definite Matrices (Extended Version)

Author: Cherian, Anoop, Stanitsas, Panagiotis, Harandi, Mehrtash, Morellas, Vassilios, and Papanikolopoulos, Nikolaos
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Symmetric positive definite (SPD) matrices are useful for capturing second-order statistics of visual data. To compare two SPD matrices, several measures are available, such as the affine-invariant Riemannian metric, Jeffreys divergence, Jensen-Bregman logdet divergence, etc.; however, their behaviors may be application dependent, raising the need of manual selection to achieve the best possible performance. Further and as a result of their overwhelming complexity for large-scale problems, computing pairwise similarities by clever embedding of SPD matrices is often preferred to direct use of the aforementioned measures. In this paper, we propose a discriminative metric learning framework, Information Divergence and Dictionary Learning (IDDL), that not only learns application specific measures on SPD matrices automatically, but also embeds them as vectors using a learned dictionary. To learn the similarity measures (which could potentially be distinct for every dictionary atom), we use the recently introduced alpha-beta-logdet divergence, which is known to unify the measures listed above. We propose a novel IDDL objective, that learns the parameters of the divergence and the dictionary atoms jointly in a discriminative setup and is solved efficiently using Riemannian optimization. We showcase extensive experiments on eight computer vision datasets, demonstrating state-of-the-art performances., Comment: Accepted at the International Conference on Computer Vision (ICCV)
Published: 2017

50. Human Pose Forecasting via Deep Markov Models

Author: Toyer, Sam, Cherian, Anoop, Han, Tengda, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Human pose forecasting is an important problem in computer vision with applications to human-robot interaction, visual surveillance, and autonomous driving. Usually, forecasting algorithms use 3D skeleton sequences and are trained to forecast for a few milliseconds into the future. Long-range forecasting is challenging due to the difficulty of estimating how long a person continues an activity. To this end, our contributions are threefold: (i) we propose a generative framework for poses using variational autoencoders based on Deep Markov Models (DMMs); (ii) we evaluate our pose forecasts using a pose-based action classifier, which we argue better reflects the subjective quality of pose forecasts than distance in coordinate space; (iii) last, for evaluation of the new model, we introduce a 480,000-frame video dataset called Ikea Furniture Assembly (Ikea FA), which depicts humans repeatedly assembling and disassembling furniture. We demonstrate promising results for our approach on both Ikea FA and the existing NTU RGB+D dataset., Comment: Accepted to DICTA'17
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

253 results on '"Cherian, Anoop"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources