Author: "Cascante-Bonilla, Paola" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cascante-Bonilla, Paola"' showing total 21 results

Start Over Author "Cascante-Bonilla, Paola"

21 results on '"Cascante-Bonilla, Paola"'

1. Natural Language Inference Improves Compositionality in Vision-Language Models

Author: Cascante-Bonilla, Paola, Hou, Yu, Cao, Yang Trista, Daumé III, Hal, and Rudinger, Rachel
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Compositional reasoning in Vision-Language Models (VLMs) remains challenging as these models often struggle to relate objects, attributes, and spatial relationships. Recent methods aim to address these limitations by relying on the semantics of the textual description, using Large Language Models (LLMs) to break them down into subsets of questions and answers. However, these methods primarily operate on the surface level, failing to incorporate deeper lexical understanding while introducing incorrect assumptions generated by the LLM. In response to these issues, we present Caption Expansion with Contradictions and Entailments (CECE), a principled approach that leverages Natural Language Inference (NLI) to generate entailments and contradictions from a given premise. CECE produces lexically diverse sentences while maintaining their core meaning. Through extensive experiments, we show that CECE enhances interpretability and reduces overreliance on biased or superficial features. By balancing CECE along the original premise, we achieve significant improvements over previous methods without requiring additional fine-tuning, producing state-of-the-art results on benchmarks that score agreement with human judgments for image-text alignment, and achieving an increase in performance on Winoground of +19.2% (group score) and +12.9% on EqBen (group score) over the best prior work (finetuned with targeted data)., Comment: Project page: https://cece-vlm.github.io/
Published: 2024

2. PropTest: Automatic Property Testing for Improved Visual Programming

Author: Koo, Jaywon, Yang, Ziyan, Cascante-Bonilla, Paola, Ray, Baishakhi, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different benchmarks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1\% accuracy (+6.0\%) on GQA using Llama3-8B and 59.5\% (+8.1\%) on RefCOCO+ using CodeLlama-34B., Comment: Project Page: https://jaywonkoo17.github.io/PropTest/
Published: 2024

3. Learning from Models and Data for Visual Grounding

Author: He, Ruozhen, Cascante-Bonilla, Paola, Yang, Ziyan, Berg, Alexander C., and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%., Comment: Project Page: https://catherine-r-he.github.io/SynGround/
Published: 2024

4. Grounding Language Models for Visual Entity Recognition

Author: Xiao, Zilin, Gong, Ming, Cascante-Bonilla, Paola, Zhang, Xingyao, Wu, Jie, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin., Comment: ECCV 2024
Published: 2024

5. Improved Visual Grounding through Self-Consistent Explanations

Author: He, Ruozhen, Cascante-Bonilla, Paola, Yang, Ziyan, Berg, Alexander C., and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model, and SelfEQ, a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically, for an input textual phrase, we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle, and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k, ReferIt, and RefCOCO+ over a strong baseline method and several prior works. Particularly, comparing to other methods that do not use any type of box annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%), 67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average)., Comment: Project Page: https://catherine-r-he.github.io/SelfEQ/
Published: 2023

6. Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Author: Doveh, Sivan, Arbelle, Assaf, Harary, Sivan, Herzig, Roei, Kim, Donghyun, Cascante-bonilla, Paola, Alfassy, Amit, Panda, Rameswar, Giryes, Raja, Feris, Rogerio, Ullman, Shimon, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called `object bias' - their representations behave as `bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words `image-alignment', of the texts; and (ii) the `density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27\%$ over the base model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average.
Published: 2023

7. Going Beyond Nouns With Vision & Language Models Using Synthetic Data

Author: Cascante-Bonilla, Paola, Shehada, Khaled, Smith, James Seale, Doveh, Sivan, Kim, Donghyun, Panda, Rameswar, Varol, Gül, Oliva, Aude, Ordonez, Vicente, Feris, Rogerio, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy., Comment: Accepted to ICCV 2023. Project page: https://synthetic-vic.github.io/
Published: 2023

8. CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning

Author: Smith, James Seale, Karlinsky, Leonid, Gutta, Vyshnavi, Cascante-Bonilla, Paola, Kim, Donghyun, Arbelle, Assaf, Panda, Rameswar, Feris, Rogerio, and Kira, Zsolt
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy. We also outperform the state of art by as much as 4.4% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings. Our code is available at https://github.com/GT-RIPL/CODA-Prompt, Comment: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
Published: 2022

9. On the Transferability of Visual Features in Generalized Zero-Shot Learning

Author: Cascante-Bonilla, Paola, Karlinsky, Leonid, Smith, James Seale, Qi, Yanjun, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Generalized Zero-Shot Learning (GZSL) aims to train a classifier that can generalize to unseen classes, using a set of attributes as auxiliary information, and the visual features extracted from a pre-trained convolutional neural network. While recent GZSL methods have explored various techniques to leverage the capacity of these features, there has been an extensive growth of representation learning techniques that remain under-explored. In this work, we investigate the utility of different GZSL methods when using different feature extractors, and examine how these models' pre-training objectives, datasets, and architecture design affect their feature representation ability. Our results indicate that 1) methods using generative components for GZSL provide more advantages when using recent feature extractors; 2) feature extractors pre-trained using self-supervised learning objectives and knowledge distillation provide better feature representations, increasing up to 15% performance when used with recent GZSL techniques; 3) specific feature extractors pre-trained with larger datasets do not necessarily boost the performance of GZSL methods. In addition, we investigate how GZSL methods fare against CLIP, a more recent multi-modal pre-trained model with strong zero-shot performance. We found that GZSL tasks still benefit from generative-based GZSL methods along with CLIP's internet-scale pre-training to achieve state-of-the-art performance in fine-grained datasets. We release a modular framework for analyzing representation learning issues in GZSL here: https://github.com/uvavision/TV-GZSL
Published: 2022

10. ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

Author: Smith, James Seale, Cascante-Bonilla, Paola, Arbelle, Assaf, Kim, Donghyun, Panda, Rameswar, Cox, David, Yang, Diyi, Kira, Zsolt, Feris, Rogerio, and Karlinsky, Leonid
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as ~7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL, Comment: Accepted by the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
Published: 2022

11. SimVQA: Exploring Simulated Environments for Visual Question Answering

Author: Cascante-Bonilla, Paola, Wu, Hui, Wang, Letao, Feris, Rogerio, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing work on VQA explores data augmentation to achieve better generalization by perturbing the images in the dataset or modifying the existing questions and answers. While these methods exhibit good performance, the diversity of the questions and answers are constrained by the available image set. In this work we explore using synthetic computer-generated data to fully control the visual and language space, allowing us to provide more diverse scenarios. We quantify the effect of synthetic data in real-world VQA benchmarks and to which extent it produces results that generalize to real data. By exploiting 3D and physics simulation platforms, we provide a pipeline to generate synthetic data to expand and replace type-specific questions and answers without risking the exposure of sensitive or personal data that might be present in real images. We offer a comprehensive analysis while expanding existing hyper-realistic datasets to be used for VQA. We also propose Feature Swapping (F-SWAP) -- where we randomly switch object-level features during training to make a VQA model more domain invariant. We show that F-SWAP is effective for enhancing a currently existing VQA dataset of real images without compromising on the accuracy to answer existing questions in the dataset., Comment: Accepted to CVPR 2022. Camera-Ready version. Project page: https://simvqa.github.io/
Published: 2022

12. Evolving Image Compositions for Feature Representation Learning

Author: Cascante-Bonilla, Paola, Sekhon, Arshdeep, Qi, Yanjun, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Convolutional neural networks for visual recognition require large amounts of training samples and usually benefit from data augmentation. This paper proposes PatchMix, a data augmentation method that creates new samples by composing patches from pairs of images in a grid-like pattern. These new samples are assigned label scores that are proportional to the number of patches borrowed from each image. We then add a set of additional losses at the patch-level to regularize and to encourage good representations at both the patch and image levels. A ResNet-50 model trained on ImageNet using PatchMix exhibits superior transfer learning capabilities across a wide array of benchmarks. Although PatchMix can rely on random pairings and random grid-like patterns for mixing, we explore evolutionary search as a guiding strategy to jointly discover optimal grid-like patterns and image pairings. For this purpose, we conceive a fitness function that bypasses the need to re-train a model to evaluate each possible choice. In this way, PatchMix outperforms a base model on CIFAR-10 (+1.91), CIFAR-100 (+5.31), Tiny Imagenet (+3.52), and ImageNet (+1.16)., Comment: Accepted to BMVC 2021. Camera-Ready version. Project page: https://paolacascante.com/patchmix/index.html
Published: 2021

13. Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

Author: Cascante-Bonilla, Paola, Tan, Fuwen, Qi, Yanjun, and Ordonez, Vicente
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: In this paper we revisit the idea of pseudo-labeling in the context of semi-supervised learning where a learning algorithm has access to a small set of labeled samples and a large set of unlabeled samples. Pseudo-labeling works by applying pseudo-labels to samples in the unlabeled set by using a model trained on the combination of the labeled samples and any previously pseudo-labeled samples, and iteratively repeating this process in a self-training cycle. Current methods seem to have abandoned this approach in favor of consistency regularization methods that train models under a combination of different styles of self-supervised losses on the unlabeled samples and standard supervised losses on the labeled samples. We empirically demonstrate that pseudo-labeling can in fact be competitive with the state-of-the-art, while being more resilient to out-of-distribution samples in the unlabeled set. We identify two key factors that allow pseudo-labeling to achieve such remarkable results (1) applying curriculum learning principles and (2) avoiding concept drift by restarting model parameters before each self-training cycle. We obtain 94.91% accuracy on CIFAR-10 using only 4,000 labeled samples, and 68.87% top-1 accuracy on Imagenet-ILSVRC using only 10% of the labeled samples. The code is available at https://github.com/uvavision/Curriculum-Labeling, Comment: In the 35th AAAI Conference on Artificial Intelligence. AAAI 2021
Published: 2020

14. Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

Author: Tan, Fuwen, Cascante-Bonilla, Paola, Guo, Xiaoxiao, Wu, Hui, Feng, Song, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper explores the task of interactive image retrieval using natural language queries, where a user progressively provides input queries to refine a set of retrieval results. Moreover, our work explores this problem in the context of complex image scenes containing multiple objects. We propose Drill-down, an effective framework for encoding multiple queries with an efficient compact state representation that significantly extends current methods for single-round image retrieval. We show that using multiple rounds of natural language queries as input can be surprisingly effective to find arbitrarily specific images of complex scenes. Furthermore, we find that existing image datasets with textual captions can provide a surprisingly effective form of weak supervision for this task. We compare our method with existing sequential encoding and embedding networks, demonstrating superior performance on two proposed benchmarks: automatic image retrieval on a simulated scenario that uses region captions as queries, and interactive image retrieval using real queries from human evaluators., Comment: 14 pages, 9 figures, NeurIPS 2019
Published: 2019

15. Moviescope: Large-scale Analysis of Movies using Multiple Modalities

Author: Cascante-Bonilla, Paola, Sitaraman, Kalpathy, Luo, Mengjia, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Film media is a rich form of artistic expression. Unlike photography, and short videos, movies contain a storyline that is deliberately complex and intricate in order to engage its audience. In this paper we present a large scale study comparing the effectiveness of visual, audio, text, and metadata-based features for predicting high-level information about movies such as their genre or estimated budget. We demonstrate the usefulness of content-based methods in this domain in contrast to human-based and metadata-based predictions in the era of deep learning. Additionally, we provide a comprehensive study of temporal feature aggregation methods for representing video and text and find that simple pooling operations are effective in this domain. We also show to what extent different modalities are complementary to each other. To this end, we also introduce Moviescope, a new large-scale dataset of 5,000 movies with corresponding movie trailers (video + audio), movie posters (images), movie plots (text), and metadata.
Published: 2019

16. Chat-crowd: A Dialog-based Platform for Visual Layout Composition

Author: Cascante-Bonilla, Paola, Yin, Xuwang, Ordonez, Vicente, and Feng, Song
Subjects: Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: In this paper we introduce Chat-crowd, an interactive environment for visual layout composition via conversational interactions. Chat-crowd supports multiple agents with two conversational roles: agents who play the role of a designer are in charge of placing objects in an editable canvas according to instructions or commands issued by agents with a director role. The system can be integrated with crowdsourcing platforms for both synchronous and asynchronous data collection and is equipped with comprehensive quality controls on the performance of both types of agents. We expect that this system will be useful to build multimodal goal-oriented dialog tasks that require spatial and geometric reasoning.
Published: 2018

17. ConStruct-VL: Data-Free Continual Structured VL Concepts Learning*

Author: Smith, James Seale, primary, Cascante-Bonilla, Paola, additional, Arbelle, Assaf, additional, Kim, Donghyun, additional, Panda, Rameswar, additional, Cox, David, additional, Yang, Diyi, additional, Kira, Zsolt, additional, Feris, Rogerio, additional, and Karlinsky, Leonid, additional
Published: 2023
Full Text: View/download PDF

18. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning

Author: Smith, James Seale, primary, Karlinsky, Leonid, additional, Gutta, Vyshnavi, additional, Cascante-Bonilla, Paola, additional, Kim, Donghyun, additional, Arbelle, Assaf, additional, Panda, Rameswar, additional, Feris, Rogerio, additional, and Kira, Zsolt, additional
Published: 2023
Full Text: View/download PDF

19. Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

Author: Cascante-Bonilla, Paola, Tan, Fuwen, Qi, Yanjun, and Ordonez, Vicente
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), General Medicine, Machine Learning (cs.LG)
Abstract: In this paper we revisit the idea of pseudo-labeling in the context of semi-supervised learning where a learning algorithm has access to a small set of labeled samples and a large set of unlabeled samples. Pseudo-labeling works by applying pseudo-labels to samples in the unlabeled set by using a model trained on the combination of the labeled samples and any previously pseudo-labeled samples, and iteratively repeating this process in a self-training cycle. Current methods seem to have abandoned this approach in favor of consistency regularization methods that train models under a combination of different styles of self-supervised losses on the unlabeled samples and standard supervised losses on the labeled samples. We empirically demonstrate that pseudo-labeling can in fact be competitive with the state-of-the-art, while being more resilient to out-of-distribution samples in the unlabeled set. We identify two key factors that allow pseudo-labeling to achieve such remarkable results (1) applying curriculum learning principles and (2) avoiding concept drift by restarting model parameters before each self-training cycle. We obtain 94.91% accuracy on CIFAR-10 using only 4,000 labeled samples, and 68.87% top-1 accuracy on Imagenet-ILSVRC using only 10% of the labeled samples. The code is available at https://github.com/uvavision/Curriculum-Labeling, Comment: In the 35th AAAI Conference on Artificial Intelligence. AAAI 2021
Published: 2021

20. Sim VQA: Exploring Simulated Environments for Visual Question Answering

Author: Cascante-Bonilla, Paola, primary, Wu, Hui, additional, Wang, Letao, additional, Feris, Rogerio, additional, and Ordonez, Vicente, additional
Published: 2022
Full Text: View/download PDF

21. Chat-crowd: A Dialog-based Platform for Visual Layout Composition

Author: Cascante-Bonilla, Paola, primary, Yin, Xuwang, additional, Ordonez, Vicente, additional, and Feng, Song, additional
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

21 results on '"Cascante-Bonilla, Paola"'

1. Natural Language Inference Improves Compositionality in Vision-Language Models

2. PropTest: Automatic Property Testing for Improved Visual Programming

3. Learning from Models and Data for Visual Grounding

4. Grounding Language Models for Visual Entity Recognition

5. Improved Visual Grounding through Self-Consistent Explanations

6. Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

7. Going Beyond Nouns With Vision & Language Models Using Synthetic Data

8. CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning

9. On the Transferability of Visual Features in Generalized Zero-Shot Learning

10. ConStruct-VL: Data-Free Continual Structured VL Concepts Learning

11. SimVQA: Exploring Simulated Environments for Visual Question Answering

12. Evolving Image Compositions for Feature Representation Learning

13. Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

14. Drill-down: Interactive Retrieval of Complex Scenes using Natural Language Queries

15. Moviescope: Large-scale Analysis of Movies using Multiple Modalities

16. Chat-crowd: A Dialog-based Platform for Visual Layout Composition

17. ConStruct-VL: Data-Free Continual Structured VL Concepts Learning*

18. CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning

19. Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

20. Sim VQA: Exploring Simulated Environments for Visual Question Answering

21. Chat-crowd: A Dialog-based Platform for Visual Layout Composition

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

21 results on '"Cascante-Bonilla, Paola"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources