Author: "Sikka, Karan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sikka, Karan"' showing total 144 results

Start Over Author "Sikka, Karan"

144 results on '"Sikka, Karan"'

1. Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification

Author: Sahu, Pritish, Sikka, Karan, and Divakaran, Ajay
Subjects: Computer Science - Computation and Language
Abstract: Large Visual Language Models (LVLMs) struggle with hallucinations in visual instruction following task(s), limiting their trustworthiness and real-world applicability. We propose Pelican -- a novel framework designed to detect and mitigate hallucinations through claim verification. Pelican first decomposes the visual claim into a chain of sub-claims based on first-order predicates. These sub-claims consist of (predicate, question) pairs and can be conceptualized as nodes of a computational graph. We then use Program-of-Thought prompting to generate Python code for answering these questions through flexible composition of external tools. Pelican improves over prior work by introducing (1) intermediate variables for precise grounding of object instances, and (2) shared computation for answering the sub-question to enable adaptive corrections and inconsistency identification. We finally use reasoning abilities of LLM to verify the correctness of the the claim by considering the consistency and confidence of the (question, answer) pairs from each sub-claim. Our experiments reveal a drop in hallucination rate by $\sim$8%-32% across various baseline LVLMs and a 27% drop compared to approaches proposed for hallucination mitigation on MMHal-Bench. Results on two other benchmarks further corroborate our results.
Published: 2024

2. A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval

Author: Gwilliam, Matthew, Cogswell, Michael, Ye, Meng, Sikka, Karan, Shrivastava, Abhinav, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at https://mgwillia.github.io/10k-words., Comment: 13 pages, 15 tables, 5 figures
Published: 2023

3. DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

Author: Chen, Yangyi, Sikka, Karan, Cogswell, Michael, Ji, Heng, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We present DRESS, a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First, prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback, they are still prone to generate unhelpful, hallucinated, or harmful responses. Second, while the visual instruction tuning data is generally structured in a multi-turn dialogue format, the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these, we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF, we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and harmless (21.03%) responses, and more effectively learn from feedback during multi-turn interactions compared to SOTA LVMLs., Comment: CVPR 2024. The feedback datasets are released at: https://huggingface.co/datasets/YangyiYY/LVLM_NLF
Published: 2023

4. Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning

Author: Som, Anirudh, Sikka, Karan, Gent, Helen, Divakaran, Ajay, Kathol, Andreas, and Vergyri, Dimitra
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Paraphrasing of offensive content is a better alternative to content removal and helps improve civility in a communication environment. Supervised paraphrasers; however, rely heavily on large quantities of labelled data to help preserve meaning and intent. They also often retain a large portion of the offensiveness of the original content, which raises questions on their overall usability. In this paper we aim to assist practitioners in developing usable paraphrasers by exploring In-Context Learning (ICL) with large language models (LLMs), i.e., using a limited number of input-label demonstration pairs to guide the model in generating desired outputs for specific queries. Our study focuses on key factors such as - number and order of demonstrations, exclusion of prompt instruction, and reduction in measured toxicity. We perform principled evaluation on three datasets, including our proposed Context-Aware Polite Paraphrase (CAPP) dataset, comprising of dialogue-style rude utterances, polite paraphrases, and additional dialogue context. We evaluate our approach using four closed source and one open source LLM. Our results reveal that ICL is comparable to supervised methods in generation quality, while being qualitatively better by 25% on human evaluation and attaining lower toxicity by 76%. Also, ICL-based paraphrasers only show a slight reduction in performance even with just 10% training data., Comment: Accepted in Association for Computational Linguistics (ACL) 2024 Findings
Published: 2023

5. Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Author: Chen, Yangyi, Sikka, Karan, Cogswell, Michael, Ji, Heng, and Divakaran, Ajay
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency., Comment: NAACL 2024 Main Conference. The data is released at https://github.com/Yangyi-Chen/CoTConsistency
Published: 2023

6. SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments

Author: Rajvanshi, Abhinav, Sikka, Karan, Lin, Xiao, Lee, Bhoram, Chiu, Han-Pang, and Velasquez, Alvaro
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: Semantic reasoning and dynamic planning capabilities are crucial for an autonomous agent to perform complex navigation tasks in unknown environments. It requires a large amount of common-sense knowledge, that humans possess, to succeed in these tasks. We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments. SayNav uses a novel grounding mechanism, that incrementally builds a 3D scene graph of the explored environment as inputs to LLMs, for generating feasible and contextually appropriate high-level plans for navigation. The LLM-generated plan is then executed by a pre-trained low-level planner, that treats each planned step as a short-distance point-goal navigation sub-task. SayNav dynamically generates step-by-step instructions during navigation and continuously refines future steps based on newly perceived information. We evaluate SayNav on multi-object navigation (MultiON) task, that requires the agent to utilize a massive amount of human knowledge to efficiently search multiple different objects in an unknown environment. We also introduce a benchmark dataset for MultiON task employing ProcTHOR framework that provides large photo-realistic indoor environments with variety of objects. SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate, highlighting its ability to generate dynamic plans for successfully locating objects in large-scale new environments. The code, benchmark dataset and demonstration videos are accessible at https://www.sri.com/ics/computer-vision/saynav.
Published: 2023

7. TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models

Author: Sur, Indranil, Sikka, Karan, Walmer, Matthew, Koneripalli, Kaushik, Roy, Anirban, Lin, Xiao, Divakaran, Ajay, and Jha, Susmit
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization). Recent work arXiv:2112.07668 has demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both modalities. We propose TIJO that defends against dual-key attacks through a joint optimization that reverse-engineers the trigger in both the image and text modalities. This joint optimization is challenging in multimodal models due to the disconnected nature of the visual pipeline which consists of an offline feature extractor, whose output is then fused with the text using a fusion module. The key insight enabling the joint optimization in TIJO is that the trigger inversion needs to be carried out in the object detection box feature space as opposed to the pixel space. We demonstrate the effectiveness of our method on the TrojVQA benchmark, where TIJO improves upon the state-of-the-art unimodal methods from an AUC of 0.6 to 0.92 on multimodal dual-key backdoors. Furthermore, our method also improves upon the unimodal baselines on unimodal backdoors. We present ablation studies and qualitative results to provide insights into our algorithm such as the critical importance of overlaying the inverted feature triggers on all visual features during trigger inversion. The prototype implementation of TIJO is available at https://github.com/SRI-CSL/TIJO., Comment: Published as conference paper at ICCV 2023. 13 pages, 6 figures, 7 tables
Published: 2023

8. Predicting Information Pathways Across Online Communities

Author: Jin, Yiqiao, Lee, Yeon-Chang, Sharma, Kartik, Ye, Meng, Sikka, Karan, Divakaran, Ajay, and Kumar, Srijan
Subjects: Computer Science - Social and Information Networks, Computer Science - Computers and Society, J.4
Abstract: The problem of community-level information pathway prediction (CLIPP) aims at predicting the transmission trajectory of content across online communities. A successful solution to CLIPP holds significance as it facilitates the distribution of valuable information to a larger audience and prevents the proliferation of misinformation. Notably, solving CLIPP is non-trivial as inter-community relationships and influence are unknown, information spread is multi-modal, and new content and new communities appear over time. In this work, we address CLIPP by collecting large-scale, multi-modal datasets to examine the diffusion of online YouTube videos on Reddit. We analyze these datasets to construct community influence graphs (CIGs) and develop a novel dynamic graph framework, INPAC (Information Pathway Across Online Communities), which incorporates CIGs to capture the temporal variability and multi-modal nature of video propagation across communities. Experimental results in both warm-start and cold-start scenarios show that INPAC outperforms seven baselines in CLIPP., Comment: In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'23)
Published: 2023
Full Text: View/download PDF

9. Multilingual Content Moderation: A Case Study on Reddit

Author: Ye, Meng, Sikka, Karan, Atwell, Katherine, Hassan, Sabit, Divakaran, Ajay, and Alikhani, Malihe
Subjects: Computer Science - Computation and Language
Abstract: Content moderation is the process of flagging content based on pre-defined platform rules. There has been a growing need for AI moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. While prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderation decisions are based on violation of rules, which subsumes detection of offensive speech, and 2) such rules often differ across communities which entails an adaptive solution. We propose to study the challenges of content moderation by introducing a multilingual dataset of 1.8 Million Reddit comments spanning 56 subreddits in English, German, Spanish and French. We perform extensive experimental analysis to highlight the underlying challenges and suggest related research problems such as cross-lingual transfer, learning under label noise (human biases), transfer of moderation models, and predicting the violated rule. Our dataset and analysis can help better prepare for the challenges and opportunities of auto moderation.
Published: 2023

10. Dual-Key Multimodal Backdoors for Visual Question Answering

Author: Walmer, Matthew, Sikka, Karan, Sur, Indranil, Shrivastava, Abhinav, and Jha, Susmit
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: The success of deep learning has enabled advances in multimodal tasks that require non-trivial fusion of multiple input domains. Although multimodal models have shown potential in many problems, their increased complexity makes them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of security vulnerability wherein an attacker embeds a malicious secret behavior into a network (e.g. targeted misclassification) that is activated when an attacker-specified trigger is added to an input. In this work, we show that multimodal networks are vulnerable to a novel type of attack that we refer to as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion mechanisms used by state-of-the-art networks to embed backdoors that are both effective and stealthy. Instead of using a single trigger, the proposed attack embeds a trigger in each of the input modalities and activates the malicious behavior only when both the triggers are present. We present an extensive study of multimodal backdoors on the Visual Question Answering (VQA) task with multiple architectures and visual feature backbones. A major challenge in embedding backdoors in VQA models is that most models use visual features extracted from a fixed pretrained object detector. This is challenging for the attacker as the detector can distort or ignore the visual trigger entirely, which leads to models where backdoors are over-reliant on the language trigger. We tackle this problem by proposing a visual trigger optimization strategy designed for pretrained object detectors. Through this method, we create Dual-Key Backdoors with over a 98% attack success rate while only poisoning 1% of the training data. Finally, we release TrojVQA, a large collection of clean and trojan VQA models to enable research in defending against multimodal backdoors., Comment: Published as conference paper at CVPR 2022. 22 pages, 11 figures, 12 tables
Published: 2021

11. Challenges in Procedural Multimodal Machine Comprehension:A Novel Way To Benchmark

Author: Sahu, Pritish, Sikka, Karan, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We focus on Multimodal Machine Reading Comprehension (M3C) where a model is expected to answer questions based on given passage (or context), and the context and the questions can be in different modalities. Previous works such as RecipeQA have proposed datasets and cloze-style tasks for evaluation. However, we identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models. These biases makes it easier for a model to overfit by relying on spurious correlations or naive data patterns. We propose a systematic framework to address these biases through three Control-Knobs that enable us to generate a test bed of datasets of progressive difficulty levels. We believe that our benchmark (referred to as Meta-RecipeQA) will provide, for the first time, a fine grained estimate of a model's generalization capabilities. We also propose a general M3C model that is used to realize several prior SOTA models and motivate a novel hierarchical transformer based reasoning network (HTRN). We perform a detailed evaluation of these models with different language and visual features on our benchmark. We observe a consistent improvement with HTRN over SOTA (~18% in Visual Cloze task and ~13% in average over all the tasks). We also observe a drop in performance across all the models when testing on RecipeQA and proposed Meta-RecipeQA (e.g. 83.6% versus 67.1% for HTRN), which shows that the proposed dataset is relatively less biased. We conclude by highlighting the impact of the control knobs with some quantitative results.
Published: 2021

12. Towards Solving Multimodal Comprehension

Author: Sahu, Pritish, Sikka, Karan, and Divakaran, Ajay
Subjects: Computer Science - Computation and Language
Abstract: This paper targets the problem of procedural multimodal machine comprehension (M3C). This task requires an AI to comprehend given steps of multimodal instructions and then answer questions. Compared to vanilla machine comprehension tasks where an AI is required only to understand a textual input, procedural M3C is more challenging as the AI needs to comprehend both the temporal and causal factors along with multimodal inputs. Recently Yagcioglu et al. [35] introduced RecipeQA dataset to evaluate M3C. Our first contribution is the introduction of two new M3C datasets- WoodworkQA and DecorationQA with 16K and 10K instructional procedures, respectively. We then evaluate M3C using a textual cloze style question-answering task and highlight an inherent bias in the question answer generation method from [35] that enables a naive baseline to cheat by learning from only answer choices. This naive baseline performs similar to a popular method used in question answering- Impatient Reader [6] that uses attention over both the context and the query. We hypothesized that this naturally occurring bias present in the dataset affects even the best performing model. We verify our proposed hypothesis and propose an algorithm capable of modifying the given dataset to remove the bias elements. Finally, we report our performance on the debiased dataset with several strong baselines. We observe that the performance of all methods falls by a margin of 8% - 16% after correcting for the bias. We hope these datasets and the analysis will provide valuable benchmarks and encourage further research in this area.
Published: 2021

13. MISA: Online Defense of Trojaned Models using Misattributions

Author: Kiourti, Panagiota, Li, Wenchao, Roy, Anirban, Sikka, Karan, and Jha, Susmit
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Recent studies have shown that neural networks are vulnerable to Trojan attacks, where a network is trained to respond to specially crafted trigger patterns in the inputs in specific and potentially malicious ways. This paper proposes MISA, a new online approach to detect Trojan triggers for neural networks at inference time. Our approach is based on a novel notion called misattributions, which captures the anomalous manifestation of a Trojan activation in the feature space. Given an input image and the corresponding output prediction, our algorithm first computes the model's attribution on different features. It then statistically analyzes these attributions to ascertain the presence of a Trojan trigger. Across a set of benchmarks, we show that our method can effectively detect Trojan triggers for a wide variety of trigger patterns, including several recent ones for which there are no known defenses. Our method achieves 96% AUC for detecting images that include a Trojan trigger without any assumptions on the trigger pattern.
Published: 2021

14. Detecting Trojaned DNNs Using Counterfactual Attributions

Author: Sikka, Karan, Sur, Indranil, Jha, Susmit, Roy, Anirban, and Divakaran, Ajay
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Theory
Abstract: We target the problem of detecting Trojans or backdoors in DNNs. Such models behave normally with typical inputs but produce specific incorrect predictions for inputs poisoned with a Trojan trigger. Our approach is based on a novel observation that the trigger behavior depends on a few ghost neurons that activate on trigger pattern and exhibit abnormally higher relative attribution for wrong decisions when activated. Further, these trigger neurons are also active on normal inputs of the target class. Thus, we use counterfactual attributions to localize these ghost neurons from clean inputs and then incrementally excite them to observe changes in the model's accuracy. We use this information for Trojan detection by using a deep set encoder that enables invariance to the number of model classes, architecture, etc. Our approach is implemented in the TrinityAI tool that exploits the synergies between trustworthiness, resilience, and interpretability challenges in deep learning. We evaluate our approach on benchmarks with high diversity in model architectures, triggers, etc. We show consistent gains (+10%) over state-of-the-art methods that rely on the susceptibility of the DNN to specific adversarial attacks, which in turn requires strong assumptions on the nature of the Trojan attack.
Published: 2020

15. Zero-Shot Learning with Knowledge Enhanced Visual Semantic Embeddings

Author: Sikka, Karan, Huang, Jihua, Silberfarb, Andrew, Nayak, Prateeth, Rohrer, Luke, Sahu, Pritish, Byrnes, John, Divakaran, Ajay, and Rohwer, Richard
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We improve zero-shot learning (ZSL) by incorporating common-sense knowledge in DNNs. We propose Common-Sense based Neuro-Symbolic Loss (CSNL) that formulates prior knowledge as novel neuro-symbolic loss functions that regularize visual-semantic embedding. CSNL forces visual features in the VSE to obey common-sense rules relating to hypernyms and attributes. We introduce two key novelties for improved learning: (1) enforcement of rules for a group instead of a single concept to take into account class-wise relationships, and (2) confidence margins inside logical operators that enable implicit curriculum learning and prevent premature overfitting. We evaluate the advantages of incorporating each knowledge source and show consistent gains over prior state-of-art methods in both conventional and generalized ZSL e.g. 11.5%, 5.5%, and 11.6% improvements on AWA2, CUB, and Kinetics respectively.
Published: 2020

16. RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization

Author: Mithun, Niluthpol Chowdhury, Sikka, Karan, Chiu, Han-Pang, Samarasekera, Supun, and Kumar, Rakesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization., Comment: ACM Multimedia 2020
Published: 2020
Full Text: View/download PDF

17. Deep Adaptive Semantic Logic (DASL): Compiling Declarative Knowledge into Deep Neural Networks

Author: Sikka, Karan, Silberfarb, Andrew, Byrnes, John, Sur, Indranil, Chow, Ed, Divakaran, Ajay, and Rohwer, Richard
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce Deep Adaptive Semantic Logic (DASL), a novel framework for automating the generation of deep neural networks that incorporates user-provided formal knowledge to improve learning from data. We provide formal semantics that demonstrate that our knowledge representation captures all of first order logic and that finite sampling from infinite domains converges to correct truth values. DASL's representation improves on prior neural-symbolic work by avoiding vanishing gradients, allowing deeper logical structure, and enabling richer interactions between the knowledge and learning components. We illustrate DASL through a toy problem in which we add structure to an image classification problem and demonstrate that knowledge of that structure reduces data requirements by a factor of $1000$. We then evaluate DASL on a visual relationship detection task and demonstrate that the addition of commonsense knowledge improves performance by $10.7\%$ in a data scarce setting.
Published: 2020

18. Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Author: Ray, Arijit, Sikka, Karan, Divakaran, Ajay, Lee, Stefan, and Burachas, Giedrius
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: While models for Visual Question Answering (VQA) have steadily improved over the years, interacting with one quickly reveals that these models lack consistency. For instance, if a model answers "red" to "What color is the balloon?", it might answer "no" if asked, "Is the balloon red?". These responses violate simple notions of entailment and raise questions about how effectively VQA models ground language. In this work, we introduce a dataset, ConVQA, and metrics that enable quantitative evaluation of consistency in VQA. For a given observable fact in an image (e.g. the balloon's color), we generate a set of logically consistent question-answer (QA) pairs (e.g. Is the balloon red?) and also collect a human-annotated set of common-sense based consistent QA pairs (e.g. Is the balloon the same color as tomato sauce?). Further, we propose a consistency-improving data augmentation module, a Consistency Teacher Module (CTM). CTM automatically generates entailed (or similar-intent) questions for a source QA pair and fine-tunes the VQA model if the VQA's answer to the entailed question is consistent with the source QA pair. We demonstrate that our CTM-based training improves the consistency of VQA models on the ConVQA datasets and is a strong baseline for further research., Comment: 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019)
Published: 2019

19. FoodX-251: A Dataset for Fine-grained Food Classification

Author: Kaur, Parneet, Sikka, Karan, Wang, Weijun, Belongie, Serge, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Food classification is a challenging problem due to the large number of categories, high visual similarity between different foods, as well as the lack of datasets for training state-of-the-art deep models. Solving this problem will require advances in both computer vision models as well as datasets for evaluating these models. In this paper we focus on the second aspect and introduce FoodX-251, a dataset of 251 fine-grained food categories with 158k images collected from the web. We use 118k images as a training set and provide human verified labels for 40k images that can be used for validation and testing. In this work, we outline the procedure of creating this dataset and provide relevant baselines with deep learning models. The FoodX-251 dataset has been used for organizing iFood-2019 challenge in the Fine-Grained Visual Categorization workshop (FGVC6 at CVPR 2019) and is available for download., Comment: Published at Fine-Grained Visual Categorization Workshop, CVPR19
Published: 2019

20. Deep Unified Multimodal Embeddings for Understanding both Content and Users in Social Media Networks

Author: Sikka, Karan, Van Bramer, Lucas, and Divakaran, Ajay
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Social and Information Networks
Abstract: There has been an explosion of multimodal content generated on social media networks in the last few years, which has necessitated a deeper understanding of social media content and user behavior. We present a novel content-independent content-user-reaction model for social multimedia content analysis. Compared to prior works that generally tackle semantic content understanding and user behavior modeling in isolation, we propose a generalized solution to these problems within a unified framework. We embed users, images and text drawn from open social media in a common multimodal geometric space, using a novel loss function designed to cope with distant and disparate modalities, and thereby enable seamless three-way retrieval. Our model not only outperforms unimodal embedding based methods on cross-modal retrieval tasks but also shows improvements stemming from jointly solving the two tasks on Twitter data. We also show that the user embeddings learned within our joint multimodal embedding model are better at predicting user interests compared to those learned with unimodal content on Instagram data. Our framework thus goes beyond the prior practice of using explicit leader-follower link information to establish affiliations by extracting implicit content-centric affiliations from isolated users. We provide qualitative results to show that the user clusters emerging from learned embeddings have consistent semantics and the ability of our model to discover fine-grained semantics from noisy and unstructured data. Our work reveals that social multimodal content is inherently multimodal and possesses a consistent structure because in social networks meaning is created through interactions between users and content., Comment: Preprint submitted to IJCV
Published: 2019

21. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Author: Kruk, Julia, Lubin, Jonah, Sikka, Karan, Lin, Xiao, Jurafsky, Dan, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram posts labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 9.6% compared to using only the image modality, demonstrating the commonality of non-intersective meaning multiplication. The gain with multimodality is greatest when the image and caption diverge semiotically. Our dataset offers a new resource for the study of the rich meanings that result from pairing text and image., Comment: Accepted at EMNLP'2019; Added dataset link
Published: 2019

22. Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

Author: Datta, Samyak, Sikka, Karan, Roy, Anirban, Ahuja, Karuna, Parikh, Devi, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We address the problem of grounding free-form textual phrases by using weak supervision from image-caption pairs. We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase localization. Our method, as a first step, infers the latent correspondences between regions-of-interest (RoIs) and phrases in the caption and creates a discriminative image representation using these matched RoIs. In a subsequent step, this (learned) representation is aligned with the caption. Our key contribution lies in building this `caption-conditioned' image encoding which tightly couples both the tasks and allows the weak supervision to effectively guide visual grounding. We provide an extensive empirical and qualitative analysis to investigate the different components of our proposed model and compare it with competitive baselines. For phrase localization, we report an improvement of 4.9% (absolute) over the prior state-of-the-art on the VisualGenome dataset. We also report results that are at par with the state-of-the-art on the downstream caption-to-image retrieval task on COCO and Flickr30k datasets., Comment: v2 contains phrase localization results on Flickr30k Entities. Accepted for publication at ICCV 2019
Published: 2019

23. Semantically-Aware Attentive Neural Embeddings for Image-based Visual Localization

Author: Seymour, Zachary, Sikka, Karan, Chiu, Han-Pang, Samarasekera, Supun, and Kumar, Rakesh
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of $19\%$ over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing $8$--$15\%$ and $4\%$ improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model., Comment: Appearing in BMVC 2019
Published: 2018

24. Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention

Author: Ahuja, Karuna, Sikka, Karan, Roy, Anirban, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We tackle the problem of understanding visual ads where given an ad image, our goal is to rank appropriate human generated statements describing the purpose of the ad. This problem is generally addressed by jointly embedding images and candidate statements to establish correspondence. Decoding a visual ad requires inference of both semantic and symbolic nuances referenced in an image and prior methods may fail to capture such associations especially with weakly annotated symbols. In order to create better embeddings, we leverage an attention mechanism to associate image proposals with symbols and thus effectively aggregate information from aligned multimodal representations. We propose a multihop co-attention mechanism that iteratively refines the attention map to ensure accurate attention estimation. Our attention based embedding model is learned end-to-end guided by a max-margin loss function. We show that our model outperforms other baselines on the benchmark Ad dataset and also show qualitative results to highlight the advantages of using multihop co-attention., Comment: Accepted at CVPR 2018 workshop- Towards Automatic Understanding of Visual Advertisements
Published: 2018

25. Zero-Shot Object Detection

Author: Bansal, Ankan, Sikka, Karan, Sharma, Gaurav, Chellappa, Rama, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training. We work with a challenging set of object classes, not restricting ourselves to similar and/or fine-grained categories as in prior works on zero-shot classification. We present a principled approach by first adapting visual-semantic embeddings for ZSD. We then discuss the problems associated with selecting a background class and motivate two background-aware approaches for learning robust detectors. One of these models uses a fixed background class and the other is based on iterative latent assignments. We also outline the challenge associated with using a limited number of training classes and propose a solution based on dense sampling of the semantic label space using auxiliary data with a large number of categories. We propose novel splits of two standard detection datasets - MSCOCO and VisualGenome, and present extensive empirical results in both the traditional and generalized zero-shot settings to highlight the benefits of the proposed methods. We provide useful insights into the algorithm and conclude by posing some open questions to encourage further research., Comment: 17 pages. ECCV 2018
Published: 2018

26. Combining Weakly and Webly Supervised Learning for Classifying Food Images

Author: Kaur, Parneet, Sikka, Karan, and Divakaran, Ajay
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Food classification from images is a fine-grained classification problem. Manual curation of food images is cost, time and scalability prohibitive. On the other hand, web data is available freely but contains noise. In this paper, we address the problem of classifying food images with minimal data curation. We also tackle a key problems with food images from the web where they often have multiple cooccuring food types but are weakly labeled with a single label. We first demonstrate that by sequentially adding a few manually curated samples to a larger uncurated dataset from two web sources, the top-1 classification accuracy increases from 50.3% to 72.8%. To tackle the issue of weak labels, we augment the deep model with Weakly Supervised learning (WSL) that results in an increase in performance to 76.2%. Finally, we show some qualitative results to provide insights into the performance improvements using the proposed ideas.
Published: 2017

27. AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos

Author: Kar, Amlan, Rai, Nishant, Sikka, Karan, and Sharma, Gaurav
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a novel method for temporally pooling frames in a video for the task of human action recognition. The method is motivated by the observation that there are only a small number of frames which, together, contain sufficient information to discriminate an action class present in a video, from the rest. The proposed method learns to pool such discriminative and informative frames, while discarding a majority of the non-informative frames in a single temporal scan of the video. Our algorithm does so by continuously predicting the discriminative importance of each video frame and subsequently pooling them in a deep learning framework. We show the effectiveness of our proposed pooling method on standard benchmarks where it consistently improves on baseline pooling methods, with both RGB and optical flow based Convolutional networks. Further, in combination with complementary video representations, we show results that are competitive with respect to the state-of-the-art results on two challenging and publicly available benchmark datasets., Comment: CVPR 2017 Camera Ready Version
Published: 2016

28. Discriminatively Trained Latent Ordinal Model for Video Classification

Author: Sikka, Karan and Sharma, Gaurav
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the problem of video classification for facial analysis and human action recognition. We propose a novel weakly supervised learning method that models the video as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for "smile", running and jumping for "highjump"). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF -- it extends such frameworks to model the ordinal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations and on three challenging human action datasets. We also validate the method with qualitative results and show that they largely support the intuitions behind the method., Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1604.01500
Published: 2016

29. LOMo: Latent Ordinal Model for Facial Analysis in Videos

Author: Sikka, Karan, Sharma, Gaurav, and Bartlett, Marian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the problem of facial analysis in videos. We propose a novel weakly supervised learning method that models the video event (expression, pain etc.) as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for smile, brow lower and cheek raise for pain). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF- it extends such frameworks to model the ordinal or temporal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations. In combination with complimentary features, we report state-of-the-art results on these datasets., Comment: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Published: 2016

30. Deep Active Object Recognition by Joint Label and Action Prediction

Author: Malmir, Mohsen, Sikka, Karan, Forster, Deborah, Fasel, Ian, Movellan, Javier R., and Cottrell, Garrison W.
Subjects: Computer Science - Artificial Intelligence
Abstract: An active object recognition system has the advantage of being able to act in the environment to capture images that are more suited for training and that lead to better performance at test time. In this paper, we propose a deep convolutional neural network for active object recognition that simultaneously predicts the object label, and selects the next action to perform on the object with the aim of improving recognition performance. We treat active object recognition as a reinforcement learning problem and derive the cost function to train the network for joint prediction of the object label and the action. A generative model of object similarities based on the Dirichlet distribution is proposed and embedded in the network for encoding the state of the system. The training is carried out by simultaneously minimizing the label and action prediction errors using gradient descent. We empirically show that the proposed network is able to predict both the object label and the actions on GERMS, a dataset for active object recognition. We compare the test label prediction accuracy of the proposed model with Dirichlet and Naive Bayes state encoding. The results of experiments suggest that the proposed model equipped with Dirichlet state encoding is superior in performance, and selects images that lead to better training and higher accuracy of label prediction at test time.
Published: 2015

31. Deep active object recognition by joint label and action prediction

Author: Malmir, Mohsen, Sikka, Karan, Forster, Deborah, Fasel, Ian, Movellan, Javier R, and Cottrell, Garrison W
Subjects: Active object recognition, Deep learning, Q-learning
Abstract: An active object recognition system has the advantage of acting in the environment to capture images that are more suited for training and lead to better performance at test time. In this paper, we utilize deep convolutional neural networks for active object recognition by simultaneously predicting the object label and the next action to be performed on the object with the aim of improving recognition performance. We treat active object recognition as a reinforcement learning problem and derive the cost function to train the network for joint prediction of the object label and the action. A generative model of object similarities based on the Dirichlet distribution is proposed and embedded in the network for encoding the state of the system. The training is carried out by simultaneously minimizing the label and action prediction errors using gradient descent. We empirically show that the proposed network is able to predict both the object label and the actions on GERMS, a dataset for active object recognition. We compare the test label prediction accuracy of the proposed model with Dirichlet and Naive Bayes state encoding. The results of experiments suggest that the proposed model equipped with Dirichlet state encoding is superior in performance, and selects images that lead to better training and higher accuracy of label prediction at test time.
Published: 2017

32. Pseudo vs. True Defect Classification in Printed Circuits Boards using Wavelet Features

Author: Sikka, Sahil, Sikka, Karan, Bhuyan, M. K., and Iwahori, Yuji
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years, Printed Circuit Boards (PCB) have become the backbone of a large number of consumer electronic devices leading to a surge in their production. This has made it imperative to employ automatic inspection systems to identify manufacturing defects in PCB before they are installed in the respective systems. An important task in this regard is the classification of defects as either true or pseudo defects, which decides if the PCB is to be re-manufactured or not. This work proposes a novel approach to detect most common defects in the PCBs. The problem has been approached by employing highly discriminative features based on multi-scale wavelet transform, which are further boosted by using a kernalized version of the support vector machines (SVM). A real world printed circuit board dataset has been used for quantitative analysis. Experimental results demonstrated the efficacy of the proposed method., Comment: 6 pages, 8 figures
Published: 2013

33. Predicting Information Pathways Across Online Communities

Author: Jin, Yiqiao, primary, Lee, Yeon-Chang, additional, Sharma, Kartik, additional, Ye, Meng, additional, Sikka, Karan, additional, Divakaran, Ajay, additional, and Kumar, Srijan, additional
Published: 2023
Full Text: View/download PDF

34. Detecting Trojaned DNNs Using Counterfactual Attributions

Author: Sikka, Karan, primary, Sur, Indranil, additional, Roy, Anirban, additional, Divakaran, Ajay, additional, and Jha, Susmit, additional
Published: 2023
Full Text: View/download PDF

35. Zero-Shot Object Detection

Author: Bansal, Ankan, primary, Sikka, Karan, additional, Sharma, Gaurav, additional, Chellappa, Rama, additional, and Divakaran, Ajay, additional
Published: 2018
Full Text: View/download PDF

36. Exploring Bag of Words Architectures in the Facial Expression Domain

Author: Sikka, Karan, Wu, Tingfan, Susskind, Josh, Bartlett, Marian, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Fusiello, Andrea, editor, Murino, Vittorio, editor, and Cucchiara, Rita, editor
Published: 2012
Full Text: View/download PDF

37. Classification and weakly supervised pain localization using multiple segment representation

Author: Sikka, Karan, Dhall, Abhinav, and Bartlett, Marian Stewart
Published: 2014
Full Text: View/download PDF

38. Dual-Key Multimodal Backdoors for Visual Question Answering

Author: Walmer, Matthew, Sikka, Karan, Sur, Indranil, Shrivastava, Abhinav, and Jha, Susmit
Subjects: FOS: Computer and information sciences, Software_OPERATINGSYSTEMS, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: The success of deep learning has enabled advances in multimodal tasks that require non-trivial fusion of multiple input domains. Although multimodal models have shown potential in many problems, their increased complexity makes them more vulnerable to attacks. A Backdoor (or Trojan) attack is a class of security vulnerability wherein an attacker embeds a malicious secret behavior into a network (e.g. targeted misclassification) that is activated when an attacker-specified trigger is added to an input. In this work, we show that multimodal networks are vulnerable to a novel type of attack that we refer to as Dual-Key Multimodal Backdoors. This attack exploits the complex fusion mechanisms used by state-of-the-art networks to embed backdoors that are both effective and stealthy. Instead of using a single trigger, the proposed attack embeds a trigger in each of the input modalities and activates the malicious behavior only when both the triggers are present. We present an extensive study of multimodal backdoors on the Visual Question Answering (VQA) task with multiple architectures and visual feature backbones. A major challenge in embedding backdoors in VQA models is that most models use visual features extracted from a fixed pretrained object detector. This is challenging for the attacker as the detector can distort or ignore the visual trigger entirely, which leads to models where backdoors are over-reliant on the language trigger. We tackle this problem by proposing a visual trigger optimization strategy designed for pretrained object detectors. Through this method, we create Dual-Key Backdoors with over a 98% attack success rate while only poisoning 1% of the training data. Finally, we release TrojVQA, a large collection of clean and trojan VQA models to enable research in defending against multimodal backdoors., Published as conference paper at CVPR 2022. 22 pages, 11 figures, 12 tables
Published: 2022

39. MUWS’22: 1st International Workshop on Multimodal Understanding for the Web and Social Media

Author: Hakimov, Sherzod, primary, Cheema, Gullal Singh, additional, Kastner, Marc A., additional, Shah, Rajiv Ratn, additional, and Sikka, Karan, additional
Published: 2022
Full Text: View/download PDF

40. Challenges in Procedural Multimodal Machine Comprehension: A Novel Way To Benchmark

Author: Sahu, Pritish, Sikka, Karan, and Divakaran, Ajay
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: We focus on Multimodal Machine Reading Comprehension (M3C) where a model is expected to answer questions based on given passage (or context), and the context and the questions can be in different modalities. Previous works such as RecipeQA have proposed datasets and cloze-style tasks for evaluation. However, we identify three critical biases stemming from the question-answer generation process and memorization capabilities of large deep models. These biases makes it easier for a model to overfit by relying on spurious correlations or naive data patterns. We propose a systematic framework to address these biases through three Control-Knobs that enable us to generate a test bed of datasets of progressive difficulty levels. We believe that our benchmark (referred to as Meta-RecipeQA) will provide, for the first time, a fine grained estimate of a model's generalization capabilities. We also propose a general M3C model that is used to realize several prior SOTA models and motivate a novel hierarchical transformer based reasoning network (HTRN). We perform a detailed evaluation of these models with different language and visual features on our benchmark. We observe a consistent improvement with HTRN over SOTA (~18% in Visual Cloze task and ~13% in average over all the tasks). We also observe a drop in performance across all the models when testing on RecipeQA and proposed Meta-RecipeQA (e.g. 83.6% versus 67.1% for HTRN), which shows that the proposed dataset is relatively less biased. We conclude by highlighting the impact of the control knobs with some quantitative results.
Published: 2022

41. Latent Dynamic Space-Time Volumes for Predicting Human Facial Behavior in Videos

Author: Sikka, Karan
Subjects: Electrical engineering, Computer science, Robotics, Computer Vision, Facial behavior analysis, Machine Learning, Supervised Learning, Video classification
Abstract: Enabling machines to understand non-verbal facial behavior from visual data is crucial for building smart interactive systems. This thesis focusses on human behavior analysis in videos. Previous state-of-the-art methods generally employed global temporal pooling approaches that, (i) assume presence of a single uniform event spanning the sequence, and (ii) discard temporal ordering by squashing all information along the temporal dimension. In this dissertation we focus on two specific modeling challenges unaddressed by previous approaches. First issue is training with weak labels that only provide video-level annotations and are much cheaper to obtain than fine (frame-level) annotations. The second concerns modeling temporal dynamics during prediction, as facial expressions are dynamic actions with sub-events. We propose to tackle these issues by proposing methods based on Weakly Supervised Latent Variable Models (WSLVM) and evaluate them on real-world spontaneous expressions. We begin with addressing these challenges by combining Multiple Instance Learning (MIL) framework and Multiple Segment representation (MS-MIL). MS-MIL can simultaneously classify and localize target behavior in videos despite training with weak annotations. However, this method lacks the capability to explicitly model multiple latent concepts or global temporal order. We address this issue in the next chapter by explicitly modeling temporal orderings by learning an exemplar Hidden Markov Model for each sequence. This algorithm models dependencies between segments but is limited in its modeling capacity due to the use of generative modeling. Chapter~4 extends MIL to learn multiple discriminative concepts in a novel formulation for joint clustering and classification. This algorithm shows consistent performance improvement but does not capture temporal structure. We finally present a unified learning framework that combines the strengths of the previously proposed algorithms in that it (i) addresses weakly labeled data (ii) learns multiple discriminative concepts, and (iii) models the temporal ordering structure of the concepts. This method is a novel WSLVM that models a video as a sequence of automatically mined, multiple discriminative sub-events with a loose temporal structure. We show both qualitative and quantitative results highlighting improvements over state-of-the-art algorithms by jointly addressing weak labels and temporal dynamics.
Published: 2016

42. MISA: Online Defense of Trojaned Models using Misattributions

Author: Kiourti, Panagiota, primary, Li, Wenchao, additional, Roy, Anirban, additional, Sikka, Karan, additional, and Jha, Susmit, additional
Published: 2021
Full Text: View/download PDF

43. A fully automated algorithm under modified FCM framework for improved brain MR image segmentation

Author: Sikka, Karan, Sinha, Nitesh, Singh, Pankaj K., and Mishra, Amit K.
Published: 2009
Full Text: View/download PDF

44. Exploring Bag of Words Architectures in the Facial Expression Domain

Author: Sikka, Karan, primary, Wu, Tingfan, additional, Susskind, Josh, additional, and Bartlett, Marian, additional
Published: 2012
Full Text: View/download PDF

45. RGB2LIDAR

Author: Mithun, Niluthpol Chowdhury, primary, Sikka, Karan, additional, Chiu, Han-Pang, additional, Samarasekera, Supun, additional, and Kumar, Rakesh, additional
Published: 2020
Full Text: View/download PDF

46. Learning User Preferences from Social Multimedia Analysis and Overview of the iFood2019Challenge

Author: Sikka, Karan, primary
Published: 2019
Full Text: View/download PDF

47. Align2Ground: Weakly Supervised Phrase Grounding Guided by Image-Caption Alignment

Author: Datta, Samyak, primary, Sikka, Karan, additional, Roy, Anirban, additional, Ahuja, Karuna, additional, Parikh, Devi, additional, and Divakaran, Ajay, additional
Published: 2019
Full Text: View/download PDF

48. Integrating Text and Image: Determining Multimodal Document Intent in Instagram Posts

Author: Kruk, Julia, primary, Lubin, Jonah, additional, Sikka, Karan, additional, Lin, Xiao, additional, Jurafsky, Dan, additional, and Divakaran, Ajay, additional
Published: 2019
Full Text: View/download PDF

49. Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Author: Ray, Arijit, primary, Sikka, Karan, additional, Divakaran, Ajay, additional, Lee, Stefan, additional, and Burachas, Giedrius, additional
Published: 2019
Full Text: View/download PDF

50. Discriminatively Trained Latent Ordinal Model for Video Classification

Author: Sikka, Karan, primary and Sharma, Gaurav, additional
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

144 results on '"Sikka, Karan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources