Author: "Pan, Mianzhi" / Publication Year Range: Last 50 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Pan, Mianzhi"' showing total 5 results

Start Over Author "Pan, Mianzhi" Publication Year Range Last 50 years

5 results on '"Pan, Mianzhi"'

1. Probing Commonsense Reasoning Capability of Text-to-Image Generative Models via Non-visual Description

Author: Pan, Mianzhi, Li, Jianfei, Yu, Mingyue, Ma, Zheng, Cheng, Kanzhi, Zhang, Jianbing, and Chen, Jiajun
Subjects: Computer Science - Multimedia
Abstract: Commonsense reasoning, the ability to make logical assumptions about daily scenes, is one core intelligence of human beings. In this work, we present a novel task and dataset for evaluating the ability of text-to-image generative models to conduct commonsense reasoning, which we call PAINTaboo. Given a description with few visual clues of one object, the goal is to generate images illustrating the object correctly. The dataset was carefully hand-curated and covered diverse object categories to analyze model performance comprehensively. Our investigation of several prevalent text-to-image generative models reveals that these models are not proficient in commonsense reasoning, as anticipated. We trust that PAINTaboo can improve our understanding of the reasoning abilities of text-to-image generative models., Comment: It is an incomplete work
Published: 2023

2. Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Author: Ma, Zheng, Pan, Mianzhi, Wu, Wenhan, Cheng, Kanzhi, Zhang, Jianbing, Huang, Shujian, and Chen, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue., Comment: Accepted at ACM Multimedia (ACMMM) 2023
Published: 2023

3. Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Author: Ma, Zheng, Zong, Shi, Pan, Mianzhi, Zhang, Jianbing, Huang, Shujian, Dai, Xinyu, and Chen, Jiajun
Subjects: Computer Science - Computation and Language
Abstract: In recent years, vision and language pre-training (VLP) models have advanced the state-of-the-art results in a variety of cross-modal downstream tasks. Aligning cross-modal semantics is claimed to be one of the essential capabilities of VLP models. However, it still remains unclear about the inner working mechanism of alignment in VLP models. In this paper, we propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of VLP models. Our probing method is built upon the fact that given an image-caption pair, the VLP models will give a score, indicating how well two modalities are aligned; maximizing such scores will generate sentences that VLP models believe are of good alignment. Analyzing these sentences thus will reveal in what way different modalities are aligned and how well these alignments are in VLP models. We apply our probing method to five popular VLP models, including UNITER, ROSITA, ViLBERT, CLIP, and LXMERT, and provide a comprehensive analysis of the generated captions guided by these models. Our results show that VLP models (1) focus more on just aligning objects with visual words, while neglecting global semantics; (2) prefer fixed sentence patterns, thus ignoring more important textual information including fluency and grammar; and (3) deem the captions with more visual words are better aligned with images. These findings indicate that VLP models still have weaknesses in cross-modal semantics alignment and we hope this work will draw researchers' attention to such problems when designing a new VLP model., Comment: Findings of EMNLP2022
Published: 2022

4. Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

Author: Ma, Zheng, primary, Pan, Mianzhi, additional, Wu, Wenhan, additional, Cheng, Kanzhi, additional, Zhang, Jianbing, additional, Huang, Shujian, additional, and Chen, Jiajun, additional
Published: 2023
Full Text: View/download PDF

5. Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Author: Ma, Zheng, primary, Zong, Shi, additional, Pan, Mianzhi, additional, Zhang, Jianbing, additional, Huang, Shujian, additional, Dai, Xinyu, additional, and Chen, Jiajun, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

5 results on '"Pan, Mianzhi"'

1. Probing Commonsense Reasoning Capability of Text-to-Image Generative Models via Non-visual Description

2. Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

3. Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

4. Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

5. Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

5 results on '"Pan, Mianzhi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources