Author: "Cornia, A." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Cornia, A."' showing total 3,015 results

Start Over Author "Cornia, A."

3,015 results on '"Cornia, A."'

1. Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Author: Bucciarelli, Davide, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs., Comment: ECCV 2024 Workshop on Green Foundation Models
Published: 2024

2. Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Author: Barsellotti, Luca, Bianchi, Lorenzo, Messina, Nicola, Carrara, Fabio, Cornia, Marcella, Baraldi, Lorenzo, Falchi, Fabrizio, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.
Published: 2024

3. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Author: Cocchi, Federico, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at https://github.com/aimagelab/ReflectiVA.
Published: 2024

4. TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes

Author: D'Amelio, Alessandro, Cartella, Giuseppe, Cuculo, Vittorio, Lucchi, Manuele, Cornia, Marcella, Cucchiara, Rita, and Boccignone, Giuseppe
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches. Source code and trained models are publicly available at: https://github.com/phuselab/tppgaze., Comment: Accepted at WACV 2025
Published: 2024

5. Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Author: Barsellotti, Luca, Bigazzi, Roberto, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents., Comment: NeurIPS 2024 Datasets and Benchmarks Track. Project page: https://aimagelab.github.io/pin/
Published: 2024

6. Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Author: Sarto, Sara, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
Published: 2024

7. Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Author: Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets., Comment: ICPR 2024
Published: 2024

8. Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Author: Moratelli, Nicholas, Caffagni, Davide, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO., Comment: BMVC 2024
Published: 2024

9. Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Author: Baraldi, Lorenzo, Cocchi, Federico, Cornia, Marcella, Nicolosi, Alessandro, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE., Comment: ECCV 2024
Published: 2024

10. BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score., Comment: ECCV 2024
Published: 2024

11. Towards Retrieval-Augmented Architectures for Image Captioning

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Nicolosi, Alessandro, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale., Comment: ACM Transactions on Multimedia Computing, Communications and Applications (2024)
Published: 2024

12. Unlearning Vision Transformers Without Retaining Data via Low-Rank Decompositions

Author: Poppi, Samuele, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

13. Adapt to Scarcity: Few-Shot Deepfake Detection via Low-Rank Adaptation

Author: Cappelletti, Silvia, Baraldi, Lorenzo, Cocchi, Federico, Cornia, Marcella, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

14. Fluent and Accurate Image Captioning with a Self-trained Reward Model

Author: Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
Published: 2025
Full Text: View/download PDF

15. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Author: Poppi, Samuele, Poppi, Tobia, Cocchi, Federico, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

16. Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Author: Baraldi, Lorenzo, Cocchi, Federico, Cornia, Marcella, Nicolosi, Alessandro, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

17. BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

18. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Author: Caffagni, Davide, Cocchi, Federico, Moratelli, Nicholas, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach., Comment: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models
Published: 2024

19. Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

Author: Barsellotti, Luca, Amoroso, Roberto, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training., Comment: CVPR 2024. Project page: https://aimagelab.github.io/freeda/
Published: 2024

20. Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Author: Baldrati, Alberto, Morelli, Davide, Cornia, Marcella, Bertini, Marco, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
Published: 2024

21. Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

Author: Cartella, Giuseppe, Cuculo, Vittorio, Cornia, Marcella, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth., Comment: Accepted to IEEE Signal Processing Letters 2024
Published: 2024

22. Trends, Applications, and Challenges in Human Attention Modelling

Author: Cartella, Giuseppe, Cornia, Marcella, Cuculo, Vittorio, D'Amelio, Alessandro, Zanca, Dario, Boccignone, Giuseppe, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges. For a comprehensive overview on the ongoing research refer to our dedicated repository available at https://github.com/aimagelab/awesome-human-visual-attention., Comment: Accepted at IJCAI 2024 Survey Track
Published: 2024

23. The Revolution of Multimodal Large Language Models: A Survey

Author: Caffagni, Davide, Cocchi, Federico, Barsellotti, Luca, Moratelli, Nicholas, Sarto, Sara, Baraldi, Lorenzo, Cornia, Marcella, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs., Comment: ACL 2024 (Findings)
Published: 2024

24. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Author: Poppi, Samuele, Poppi, Tobia, Cocchi, Federico, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip., Comment: ECCV 2024
Published: 2023

25. OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Author: Cartella, Giuseppe, Baldrati, Alberto, Morelli, Davide, Cornia, Marcella, Bertini, Marco, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip., Comment: International Conference on Image Analysis and Processing (ICIAP) 2023
Published: 2023

26. With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Author: Barraco, Manuele, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net., Comment: ICCV 2023
Published: 2023

27. Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Author: Cornia, Marcella, Baraldi, Lorenzo, Fiameni, Giuseppe, and Cucchiara, Rita
Published: 2024
Full Text: View/download PDF

28. Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

Author: Baraldi, Lorenzo, Amoroso, Roberto, Cornia, Marcella, Pilzer, Andrea, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed $k$-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. We release an implementation of our code and models at https://github.com/aimagelab/MaPeT., Comment: Computer Vision and Image Understanding (2025)
Published: 2023

29. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

Author: Morelli, Davide, Baldrati, Alberto, Cartella, Giuseppe, Cornia, Marcella, Bertini, Marco, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek innovative approaches to enhance the consumer experience. At the same time, recent advancements in the development of diffusion models have enabled generative networks to create remarkably realistic images. In this context, image-based virtual try-on, which consists in generating a novel image of a target model wearing a given in-shop garment, has yet to capitalize on the potential of these powerful generative solutions. This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module that exploits learnable skip connections to enhance the generation process preserving the model's characteristics. To effectively maintain the texture and details of the in-shop garment, we propose a textual inversion component that can map the visual features of the garment to the CLIP token embedding space and thus generate a set of pseudo-word token embeddings capable of conditioning the generation process. Experimental results on Dress Code and VITON-HD datasets demonstrate that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task. Source code and trained models are publicly available at: https://github.com/miccunifi/ladi-vton., Comment: ACM Multimedia 2023
Published: 2023

30. Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing

Author: Baldrati, Alberto, Morelli, Davide, Cartella, Giuseppe, Cornia, Marcella, Bertini, Marco, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer., Comment: ICCV 2023
Published: 2023

31. Multi-Class Unlearning for Image Classification via Weight Filtering

Author: Poppi, Samuele, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Machine Unlearning is an emerging paradigm for selectively removing the impact of training datapoints from a network. Unlike existing methods that target a limited subset or a single class, our framework unlearns all classes in a single round. We achieve this by modulating the network's components using memory matrices, enabling the network to demonstrate selective unlearning behavior for any class after training. By discovering weights that are specific to each class, our approach also recovers a representation of the classes which is explainable by design. We test the proposed framework on small- and medium-scale image classification datasets, with both convolution- and Transformer-based backbones, showcasing the potential for explainable solutions through unlearning., Comment: IEEE Intelligent Systems (2024)
Published: 2023

32. Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images

Author: Amoroso, Roberto, Morelli, Davide, Cornia, Marcella, Baraldi, Lorenzo, Del Bimbo, Alberto, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language. While these models have numerous benefits across various sectors, they have also raised concerns about the potential misuse of fake images and cast new pressures on fake image detection. In this work, we pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models. Firstly, we conduct a comprehensive analysis of the performance of contrastive and classification-based visual features, respectively extracted from CLIP-based models and ResNet or ViT-based architectures trained on image classification datasets. Our results demonstrate that fake images share common low-level cues, which render them easily recognizable. Further, we devise a multimodal setting wherein fake images are synthesized by different textual captions, which are used as seeds for a generator. Under this setting, we quantify the performance of fake detection strategies and introduce a contrastive-based disentangling method that lets us analyze the role of the semantics of textual descriptions and low-level perceptual cues. Finally, we release a new dataset, called COCOFake, containing about 1.2M images generated from the original COCO image-caption pairs using two recent text-to-image diffusion models, namely Stable Diffusion v1.4 and v2.0., Comment: ACM Transactions on Multimedia Computing, Communications and Applications (2024)
Published: 2023

33. Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Author: Sarto, Sara, Barraco, Manuele, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore., Comment: CVPR 2023 (highlight paper)
Published: 2023

34. Embodied Agents for Efficient Exploration and Smart Scene Description

Author: Bigazzi, Roberto, Cornia, Marcella, Cascianelli, Silvia, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by highlighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations., Comment: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2023)
Published: 2023

35. Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions

Author: Cascianelli, Silvia, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Handwritten Text Recognition (HTR) in free-layout pages is a challenging image understanding task that can provide a relevant boost to the digitization of handwritten documents and reuse of their content. The task becomes even more challenging when dealing with historical documents due to the variability of the writing style and degradation of the page quality. State-of-the-art HTR approaches typically couple recurrent structures for sequence modeling with Convolutional Neural Networks for visual feature extraction. Since convolutional kernels are defined on fixed grids and focus on all input pixels independently while moving over the input image, this strategy disregards the fact that handwritten characters can vary in shape, scale, and orientation even within the same document and that the ink pixels are more relevant than the background ones. To cope with these specific HTR difficulties, we propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text. We design two deformable architectures and conduct extensive experiments on both modern and historical datasets. Experimental results confirm the suitability of deformable convolutions for the HTR task.
Published: 2022

36. The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

Author: Cascianelli, Silvia, Pippi, Vittorio, Maarand, Martin, Cornia, Marcella, Baraldi, Lorenzo, Kermorvant, Christopher, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Digital Libraries
Abstract: Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at \url{https://aimagelab.ing.unimore.it/go/lam}., Comment: Accepted at ICPR 2022
Published: 2022

37. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs.

Author: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi 0002, and Rita Cucchiara
Published: 2024
Full Text: View/download PDF

38. Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation.

Author: Luca Barsellotti, Roberto Amoroso, Marcella Cornia, Lorenzo Baraldi 0001, and Rita Cucchiara
Published: 2024
Full Text: View/download PDF

39. Trends, Applications, and Challenges in Human Attention Modelling.

Author: Giuseppe Cartella, Marcella Cornia, Vittorio Cuculo, Alessandro D'Amelio, Dario Zanca, Giuseppe Boccignone, and Rita Cucchiara
Published: 2024

40. The Revolution of Multimodal Large Language Models: A Survey.

Author: Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi 0001, Marcella Cornia, and Rita Cucchiara
Published: 2024
Full Text: View/download PDF

41. Understanding Fatigue Through Biosignals: A Comprehensive Dataset.

Author: Marta Gabbi, Luca Cornia, Valeria Villani, and Lorenzo Sabattini
Published: 2024
Full Text: View/download PDF

42. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Author: Messina, Nicola, Stefanini, Matteo, Cornia, Marcella, Baraldi, Lorenzo, Falchi, Fabrizio, Amato, Giuseppe, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space - where an efficient kNN search can be performed - by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN., Comment: CBMI 2022
Published: 2022

43. Retrieval-Augmented Transformer for Image Captioning

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale., Comment: CBMI 2022
Published: 2022

44. Learning to mask and permute visual tokens for Vision Transformer pre-training

Author: Baraldi, Lorenzo, Amoroso, Roberto, Cornia, Marcella, Pilzer, Andrea, and Cucchiara, Rita
Published: 2025
Full Text: View/download PDF

45. Embodied Navigation at the Art Gallery

Author: Bigazzi, Roberto, Landi, Federico, Cascianelli, Silvia, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like offices, flats, and houses. In this paper, we build and release a new 3D space with unique characteristics: the one of a complete art museum. We name this environment ArtGallery3D (AG3D). Compared with existing 3D scenes, the collected space is ampler, richer in visual features, and provides very sparse occupancy information. This feature is challenging for occupancy-based agents which are usually trained in crowded domestic environments with plenty of occupancy information. Additionally, we annotate the coordinates of the main points of interest inside the museum, such as paintings, statues, and other items. Thanks to this manual process, we deliver a new benchmark for PointGoal navigation inside this new space. Trajectories in this dataset are far more complex and lengthy than existing ground-truth paths for navigation in Gibson and Matterport3D. We carry on extensive experimental evaluation using our new space for evaluation and prove that existing methods hardly adapt to this scenario. As such, we believe that the availability of this 3D model will foster future research and help improve existing solutions., Comment: Accepted by 21st International Conference on Image Analysis and Processing (ICIAP 2021)
Published: 2022
Full Text: View/download PDF

46. Dress Code: High-Resolution Multi-Category Virtual Try-On

Author: Morelli, Davide, Fincato, Matteo, Cornia, Marcella, Landi, Federico, Cesari, Fabio, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics, Computer Science - Multimedia
Abstract: Image-based virtual try-on strives to transfer the appearance of a clothing item onto the image of a target person. Prior work focuses mainly on upper-body clothes (e.g. t-shirts, shirts, and tops) and neglects full-body or lower-body items. This shortcoming arises from a main factor: current publicly available datasets for image-based virtual try-on do not account for this variety, thus limiting progress in the field. To address this deficiency, we introduce Dress Code, which contains images of multi-category clothes. Dress Code is more than 3x larger than publicly available datasets for image-based virtual try-on and features high-resolution paired images (1024x768) with front-view, full-body reference models. To generate HD try-on images with high visual quality and rich in details, we propose to learn fine-grained discriminating features. Specifically, we leverage a semantic-aware discriminator that makes predictions at pixel-level instead of image- or patch-level. Extensive experimental evaluation demonstrates that the proposed approach surpasses the baselines and state-of-the-art competitors in terms of visual quality and quantitative results. The Dress Code dataset is publicly available at https://github.com/aimagelab/dress-code., Comment: ECCV 2022 - Video Demo: https://www.youtube.com/watch?v=qr6TW3uTHG4
Published: 2022

47. Spot the Difference: A Novel Task for Embodied Agents in Changing Environments

Author: Landi, Federico, Bigazzi, Roberto, Cornia, Marcella, Cascianelli, Silvia, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: Embodied AI is a recent research area that aims at creating intelligent agents that can move and operate inside an environment. Existing approaches in this field demand the agents to act in completely new and unexplored scenes. However, this setting is far from realistic use cases that instead require executing multiple tasks in the same environment. Even if the environment changes over time, the agent could still count on its global knowledge about the scene while trying to adapt its internal representation to the current state of the environment. To make a step towards this setting, we propose Spot the Difference: a novel task for Embodied AI where the agent has access to an outdated map of the environment and needs to recover the correct layout in a fixed time budget. To this end, we collect a new dataset of occupancy maps starting from existing datasets of 3D spaces and generating a number of possible layouts for a single environment. This dataset can be employed in the popular Habitat simulator and is fully compliant with existing methods that employ reconstructed occupancy maps during navigation. Furthermore, we propose an exploration policy that can take advantage of previous knowledge of the environment and identify changes in the scene faster and more effectively than existing agents. Experimental results show that the proposed architecture outperforms existing state-of-the-art models for exploration on this new setting., Comment: Accepted by 26TH International Conference on Pattern Recognition (ICPR 2022)
Published: 2022
Full Text: View/download PDF

48. CaMEL: Mean Teacher Learning for Image Captioning

Author: Barraco, Manuele, Stefanini, Matteo, Cornia, Marcella, Cascianelli, Silvia, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Describing images in natural language is a fundamental step towards the automatic modeling of connections between the visual and textual modalities. In this paper we present CaMEL, a novel Transformer-based architecture for image captioning. Our proposed approach leverages the interaction of two interconnected language models that learn from each other during the training phase. The interplay between the two language models follows a mean teacher learning paradigm with knowledge distillation. Experimentally, we assess the effectiveness of the proposed solution on the COCO dataset and in conjunction with different visual feature extractors. When comparing with existing proposals, we demonstrate that our model provides state-of-the-art caption quality with a significantly reduced number of parameters. According to the CIDEr metric, we obtain a new state of the art on COCO when training without using external data. The source code and trained models are publicly available at: https://github.com/aimagelab/camel.
Published: 2022

49. Single-Photon Detectors for Quantum Integrated Photonics

Author: Thu Ha Dao, Francesco Amanti, Greta Andrini, Fabrizio Armani, Fabrizio Barbato, Vittorio Bellani, Vincenzo Bonaiuto, Simone Cammarata, Matteo Campostrini, Samuele Cornia, Fabio De Matteis, Valeria Demontis, Giovanni Di Giuseppe, Sviatoslav Ditalia Tchernij, Simone Donati, Andrea Fontana, Jacopo Forneris, Roberto Francini, Luca Frontini, Gian Carlo Gazzadi, Roberto Gunnella, Ali Emre Kaplan, Cosimo Lacava, Valentino Liberali, Leonardo Martini, Francesco Marzioni, Claudia Menozzi, Elena Nieto Hernández, Elena Pedreschi, Paolo Piergentili, Paolo Prosposito, Valentino Rigato, Carlo Roncolato, Francesco Rossella, Matteo Salvato, Fausto Sargeni, Jafar Shojaii, Franco Spinella, Alberto Stabile, Alessandra Toncelli, Gabriella Trucco, and Valerio Vitali
Subjects: quantum information, integrated photonics, single-photon detectors, Applied optics. Photonics, TA1501-1820
Abstract: Single-photon detectors have gained significant attention recently, driven by advancements in quantum information technology. Applications such as quantum key distribution, quantum cryptography, and quantum computation demand the ability to detect individual quanta of light and distinguish between single-photon states and multi-photon states, particularly when operating within waveguide systems. Although single-photon detector fabrication has been established for some time, integrating detectors with waveguides using new materials with suitable structural and electronic properties, especially at telecommunication wavelengths, creates more compact source-line-detector systems. This review explores the state of the art of single-photon detector research and examines the potential breakthroughs offered by novel low-dimensional materials in this field.
Published: 2024
Full Text: View/download PDF

50. Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

Author: Cornia, Marcella, Baraldi, Lorenzo, Fiameni, Giuseppe, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: While captioning models have obtained compelling results in describing natural images, there is a growing effort to increase their capability of dealing with real-world concepts. In this paper, we address the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human- and automatically-collected captions. To this end, we propose a model which induces a separation between content and descriptive style through the incorporation of stylistic parameters and keywords extracted from large-scale multi-modal models as pivotal data. In terms of visual features, our model avoids the need of object detectors and employs grid-like features together with a single objective of prompt language modeling. Experimentally, we consistently outperform existing methods in terms of caption quality and capability of describing out-of-domain concepts. Finally, our model obtains a new state of the art on both COCO and nocaps.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,015 results on '"Cornia, A."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources