1,432 results on '"CLIP"'
Search Results
2. CLIP-AGIQA: Boosting the Performance of AI-Generated Image Quality Assessment with CLIP
- Author
-
Tang, Zhenchen, Wang, Zichuan, Peng, Bo, Dong, Jing, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Prompting Language-Informed Distribution for Compositional Zero-Shot Learning
- Author
-
Bao, Wentao, Chen, Lichang, Huang, Heng, Kong, Yu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
4. Contour-Guided Context Learning for Scene Text Recognition
- Author
-
Hsieh, Wei-Chun, Hsu, Gee-Sern, Chen, Jun-Yi, Yap, Moi Hoon, Chao, Zi-Chun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
5. Boosting Fine-Grained Oriented Object Detection via Text Features
- Author
-
Zhou, Beichen, Bi, Qi, Ding, Jian, Xia, Gui-Song, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
6. DATR: Domain Agnostic Text Recognizer
- Author
-
Purkayastha, Kunal, Sarkar, Shashwat, Palaiahnakote, Shivakumara, Pal, Umapada, Ghosal, Palash, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Antonacopoulos, Apostolos, editor, Chaudhuri, Subhasis, editor, Chellappa, Rama, editor, Liu, Cheng-Lin, editor, Bhattacharya, Saumik, editor, and Pal, Umapada, editor
- Published
- 2025
- Full Text
- View/download PDF
7. Teach CLIP to Develop a Number Sense for Ordinal Regression
- Author
-
Du, Yao, Zhai, Qiang, Dai, Weihang, Li, Xiaomeng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
8. Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models
- Author
-
Abbasi, Reza, Rohban, Mohammad Hossein, Baghshah, Mahdieh Soleymani, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
9. In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation
- Author
-
Kang, Dahyun, Cho, Minsu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
10. -Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
- Author
-
Liu, Ye, He, Jixuan, Li, Wanhua, Kim, Junsik, Wei, Donglai, Pfister, Hanspeter, Chen, Chang Wen, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
11. A Decoupling Video Frame Selection Method for Action Recognition
- Author
-
Zhu, Qingmeng, He, Yanan, Lan, Tianxing, Gu, Ziyin, Li, Yi, Wu, Qihuan, Yu, Zhipeng, He, Hao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hadfi, Rafik, editor, Anthony, Patricia, editor, Sharma, Alok, editor, Ito, Takayuki, editor, and Bai, Quan, editor
- Published
- 2025
- Full Text
- View/download PDF
12. Zero-Shot Referring Image Segmentation with Hierarchical Prompts and Frequency Domain Fusion
- Author
-
Li, Changlong, Zhuang, Jiedong, Hu, Jiaqi, Hu, Haoji, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hadfi, Rafik, editor, Anthony, Patricia, editor, Sharma, Alok, editor, Ito, Takayuki, editor, and Bai, Quan, editor
- Published
- 2025
- Full Text
- View/download PDF
13. A Proposal for Explainable Fruit Quality Recognition Using Multimodal Models
- Author
-
Nuñez, Felipe, Peralta, Billy, Nicolis, Orietta, Caro, Luis, Mora, Marco, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Hernández-García, Ruber, editor, Barrientos, Ricardo J., editor, and Velastin, Sergio A., editor
- Published
- 2025
- Full Text
- View/download PDF
14. Expanding Design Horizons: Evolutionary Tool for Parametric Design Exploration with Interactive and CLIP-Based Evaluation
- Author
-
Sacadura, Ricardo, Gonçalo, Luís, Martins, Tiago, Machado, Penousal, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Santos, Manuel Filipe, editor, Machado, José, editor, Novais, Paulo, editor, Cortez, Paulo, editor, and Moreira, Pedro Miguel, editor
- Published
- 2025
- Full Text
- View/download PDF
15. Enhancing Zero-Shot Anomaly Detection: CLIP-SAM Collaboration with Cascaded Prompts
- Author
-
Hou, Yanning, Xu, Ke, Li, Junfa, Ruan, Yanran, Qiu, Jianfeng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lin, Zhouchen, editor, Cheng, Ming-Ming, editor, He, Ran, editor, Ubul, Kurban, editor, Silamu, Wushouer, editor, Zha, Hongbin, editor, Zhou, Jie, editor, and Liu, Cheng-Lin, editor
- Published
- 2025
- Full Text
- View/download PDF
16. Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
- Author
-
Shao, Tong, Tian, Zhuotao, Zhao, Hang, Su, Jingyong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
17. SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
- Author
-
Wang, Feng, Mei, Jieru, Yuille, Alan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
18. Unleashing the Class-Incremental Learning Potential of Foundation Models by Virtual Feature Generation and Replay
- Author
-
Xun, Tianci, Zheng, Zhong, He, Yulin, Chen, Wei, Zheng, Weiwei, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lin, Zhouchen, editor, Cheng, Ming-Ming, editor, He, Ran, editor, Ubul, Kurban, editor, Silamu, Wushouer, editor, Zha, Hongbin, editor, Zhou, Jie, editor, and Liu, Cheng-Lin, editor
- Published
- 2025
- Full Text
- View/download PDF
19. Multi-layer Tuning CLIP for Few-Shot Image Classification
- Author
-
Zhang, Ruihao, Geng, Jinsong, Liu, Cenyu, Zhang, Wei, Feng, Zunlei, xue, Liang, Bei, Yijun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lin, Zhouchen, editor, Cheng, Ming-Ming, editor, He, Ran, editor, Ubul, Kurban, editor, Silamu, Wushouer, editor, Zha, Hongbin, editor, Zhou, Jie, editor, and Liu, Cheng-Lin, editor
- Published
- 2025
- Full Text
- View/download PDF
20. mCLIP: Multimodal Approach to Classify Memes
- Author
-
Shahid, M Kaab Bin, Husain, Hamid, Javed, Hira, Ghosh, Ashish, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Weber, Gerhard-Wilhelm, editor, Martinez Trinidad, Jose Francisco, editor, Sheng, Michael, editor, Ramachand, Raghavendra, editor, Kharb, Latika, editor, and Chahal, Deepak, editor
- Published
- 2025
- Full Text
- View/download PDF
21. UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation
- Author
-
Chen, Yaxiong, Du, Chuang, Li, Chunlei, Hu, Jingliang, Shi, Yilei, Xiong, Shengwu, Zhu, Xiao Xiang, Mou, Lichao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Deng, Zhongying, editor, Shen, Yiqing, editor, Kim, Hyunwoo J., editor, Jeong, Won-Ki, editor, Aviles-Rivero, Angelica I., editor, He, Junjun, editor, and Zhang, Shaoting, editor
- Published
- 2025
- Full Text
- View/download PDF
22. Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
- Author
-
Luo, Jianjie, Chen, Jingwen, Li, Yehao, Pan, Yingwei, Feng, Jianlin, Chao, Hongyang, Yao, Ting, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
23. TmfimCLIP: Text-Driven Multi-Attribute Face Image Manipulation.
- Author
-
Yaermaimaiti, Yilihamu, Wang, Ruohao, Lou, Xudong, Liu, Yajie, and Xi, Linfei
- Abstract
Text-to-image conversion has garnered significant research attention, with contemporary methods leveraging the latent space analysis of StyleGAN. However, issues with latent code decoupling, interpretability, and controllability often remain, leading to misaligned image attributes. To address these challenges, we propose a refined approach that segments StyleGAN’s latent code using the Visual Language Model (CLIP). Our method aligns the latent code segments with text embeddings via an image-text alignment module and modulates them through a text injection module. Additionally, we incorporate semantic segmentation loss and mouth loss to constrain operations that affect irrelevant attributes. Compared to previous CLIP-driven techniques, our approach significantly enhances decoupling, interpretability, and controllability. Experiments on the CelebA-HQ and FFHQ datasets validate our model’s efficacy through both qualitative and quantitative comparisons. Our model effectively handles a wide range of style variations, achieving an FID score of 21.15 for facial attributes and an ID metric of 0.88 for hair attributes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Local part attention for image stylization with text prompt.
- Author
-
Truong, Quoc-Truong, Nguyen, Vinh-Tiep, Nguyen, Lan-Phuong, Cao, Hung-Phu, and Luu, Duc-Tuan
- Subjects
- *
LIPS , *HAIR - Abstract
Prompt-based portrait image style transfer aims at translating an input content image to a desired style described by text without a style image. In many practical situations, users may not only attend to the entire portrait image but also the local parts (e.g., eyes, lips, and hair). To address such applications, we propose a new framework that enables style transfer on specific regions described by a text description of the desired style. Specifically, we incorporate semantic segmentation to identify the intended area without requiring edit masks from the user while utilizing a pre-trained CLIP-based model for stylizing. Besides, we propose a text-to-patch matching loss by randomly dividing the stylized image into smaller patches to ensure the consistent quality of the result. To comprehensively evaluate the proposed method, we use several metrics, such as FID, SSIM, and PSNR on a dataset consisting of portraits from the CelebAMask-HQ dataset and style descriptions of other related works. Extensive experimental results demonstrate that our framework outperforms other state-of-the-art methods in terms of both stylization quality and inference time. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Hierarchical bi-directional conceptual interaction for text-video retrieval.
- Author
-
Han, Wenpeng, Niu, Guanglin, Zhou, Mingliang, and Zhang, Xiaowei
- Abstract
The large pre-trained vision-language models (VLMs) utilized in text-video retrieval have demonstrated strong cross image-text understanding ability. Existing works leverage VLMs to extract features and design fine-grained uni-directional interaction from text to video to enhance the visual understanding ability of the model. However, the vast cross-modal gap makes it difficult to fully match video-text mutual information solely through uni-directional cross-modal interaction techniques. To this end, we propose a novel hierarchical bi-directional conceptual interaction (HBCI) method, which utilizes multi-granularity video-text decoupled features mutual attention to enhance cross-modal alignment. Firstly, we introduce the text-guided attention to extract visual representations among hierarchical concepts, and decouple the multi-granularity features from video and text to find representation subspaces with maximal relevance to each other. Furthermore, we construct an iterative bi-directional conceptual interaction (BCI) module to reason semantic associations across text and video modalities, which generates attention weights adaptively based on video-text decoupled concepts and projects them into the other modality to facilitate profound cross-modal interaction. Finally, we implement the cross-level similarity distillation to progressively propagate the knowledge-aware similarity. Extensive experiments consistently deliver exceptional performance of our proposed HBCI across MSR-VTT, DiDeMo and ActivityNet datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Swimtrans Net: a multimodal robotic system for swimming action recognition driven via Swin-Transformer.
- Author
-
Chen, He and Yue, Xiaoyu
- Subjects
SWIMMING techniques ,MOTION analysis ,FEATURE extraction ,MACHINE learning ,DATA extraction - Abstract
Introduction: Currently, using machine learning methods for precise analysis and improvement of swimming techniques holds significant research value and application prospects. The existing machine learning methods have improved the accuracy of action recognition to some extent. However, they still face several challenges such as insufficient data feature extraction, limited model generalization ability, and poor real-time performance. Methods: To address these issues, this paper proposes an innovative approach called Swimtrans Net: A multimodal robotic system for swimming action recognition driven via Swin-Transformer. By leveraging the powerful visual data feature extraction capabilities of Swin-Transformer, Swimtrans Net effectively extracts swimming image information. Additionally, to meet the requirements of multimodal tasks, we integrate the CLIP model into the system. Swin-Transformer serves as the image encoder for CLIP, and through fine-tuning the CLIP model, it becomes capable of understanding and interpreting swimming action data, learning relevant features and patterns associated with swimming. Finally, we introduce transfer learning for pre-training to reduce training time and lower computational resources, thereby providing real-time feedback to swimmers. Results and discussion: Experimental results show that Swimtrans Net has achieved a 2.94% improvement over the current state-of-the-art methods in swimming motion analysis and prediction, making significant progress. This study introduces an innovative machine learning method that can help coaches and swimmers better understand and improve swimming techniques, ultimately improving swimming performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Development of a biocompatible 3D hydrogel scaffold using continuous liquid interface production for the delivery of cell therapies to treat recurrent glioblastoma.
- Author
-
Kass, Lauren, Thang, Morrent, Zhang, Yu, DeVane, Cathleen, Logan, Julia, Tessema, Addis, Perry, Jillian, and Hingtgen, Shawn
- Subjects
- *
NEURAL stem cells , *BRAIN tumors , *TUMOR growth , *CELLULAR therapy , *GLIOBLASTOMA multiforme - Abstract
Glioblastoma (GBM) is the most common primary malignant brain tumor diagnosed in adults, carrying with it an extremely poor prognosis and limited options for effective treatment. Various cell therapies have emerged as promising candidates for GBM treatment but fail in the clinic due to poor tumor trafficking, poor transplantation efficiency, and high systemic toxicity. In this study, we design, characterize, and test a 3D‐printed cell delivery platform that can enhance the survival of therapeutic cells implanted in the GBM resection cavity. Using continuous liquid interface production (CLIP) to generate a biocompatible 3D hydrogel, we demonstrate that we can effectively seed neural stem cells (NSCs) onto the surface of the hydrogel, and that the cells can proliferate to high densities when cultured for 14 days in vitro. We show that NSCs seeded on CLIP scaffolds persist longer than freely injected cells in vivo, proliferating to 20% higher than their original density in 6 days after implantation. Finally, we demonstrate that therapeutic fibroblasts seeded on CLIP more effectively suppress tumor growth and extend survival in a mouse model of LN229 GBM resection compared to the scaffold or therapeutic cells alone. These promising results demonstrate the potential to leverage CLIP to design hydrogels with various features to control the delivery of different types of cell therapies. Future work will include a more thorough evaluation of the immunological response to the material and improvement of the printing resolution for biocompatible aqueous resins. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning.
- Author
-
Zhang, Mingliang, Hou, Xia, Yan, Yujing, and Sun, Meng
- Subjects
ENCODING - Abstract
Image captioning aims to describe the content in an image, which plays a critical role in image understanding. Existing methods tend to generate the text for more distinct natural images. These models can not be well for paintings containing more abstract meaning due to the limitation of objective parsing without related knowledge. To alleviate, we propose a novel cross-modality decouple model to generate the objective and subjective parsing separately. Concretely, we propose to encode both subjective semantic and implied knowledge contained in the paintings. The key point of our framework is decoupled CLIP-based branches (DecoupleCLIP). For the objective caption branch, we utilize the CLIP model as the global feature extractor and construct a feature fusion module for global clues. Based on the objective caption branch structure, we add a multimodal fusion module called the artistic conception branch. In this way, the objective captions can constrain artistic conception content. We conduct extensive experiments to demonstrate our DecoupleCLIP's superior ability over our new dataset. Our model achieves nearly 2% improvement over other comparison models on CIDEr. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. QR-CLIP: Introducing Explicit Knowledge for Location and Time Reasoning.
- Author
-
Shi, Weimin, Gao, Dehong, Xiong, Yuan, and Zhou, Zhong
- Subjects
VISUAL learning ,SOURCE code ,COGNITION - Abstract
This article focuses on reasoning about the location and time behind images. Given that pre-trained vision-language models (VLMs) exhibit excellent image and text understanding capabilities, most existing methods leverage them to match visual cues with location and time-related descriptions. However, these methods cannot look beyond the actual content of an image, failing to produce satisfactory reasoning results, as such reasoning requires connecting visual details with rich external cues (e.g., relevant event contexts). To this end, we propose a novel reasoning method, QR-CLIP, that aims at enhancing the model's ability to reason about location and time through interaction with external explicit knowledge such as Wikipedia. Specifically, QR-CLIP consists of two modules: (1) The Quantity module abstracts the image into multiple distinct representations and uses them to search and gather external knowledge from different perspectives that are beneficial to model reasoning. (2) The Relevance module filters the visual features and the searched explicit knowledge and dynamically integrates them to form a comprehensive reasoning result. Extensive experiments demonstrate the effectiveness and generalizability of QR-CLIP. On the WikiTiLo dataset, QR-CLIP boosts the accuracy of location (country) and time reasoning by 7.03% and 2.22%, respectively, over previous SOTA methods. On the more challenging TARA dataset, it improves the accuracy for location and time reasoning by 3.05% and 2.45%, respectively. The source code is at https://github.com/Shi-Wm/QR-CLIP. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. CAM-Vtrans: real-time sports training utilizing multi-modal robot data.
- Author
-
Hong LinLin, Lee Sangheang, and Song GuanTing
- Subjects
HUMAN-robot interaction ,PHYSICAL training & conditioning ,RECOVERY movement ,ROBOTS ,ROBOTICS - Abstract
Introduction: Assistive robots and human-robot interaction have become integral parts of sports training. However, existing methods often fail to provide real-time and accurate feedback, and they often lack integration of comprehensive multi-modal data. Methods: To address these issues, we propose a groundbreaking and innovative approach: CAM-Vtrans--Cross-Attention Multi-modal Visual Transformer. By leveraging the strengths of state-of-the-art techniques such as Visual Transformers (ViT) andmodels like CLIP, along with cross-attentionmechanisms, CAM-Vtrans harnesses the power of visual and textual information to provide athletes with highly accurate and timely feedback. Through the utilization of multi-modal robot data, CAM-Vtrans offers valuable assistance, enabling athletes to optimize their performance while minimizing potential injury risks. This novel approach represents a significant advancement in the field, offering an innovative solution to overcome the limitations of existing methods and enhance the precision and efficiency of sports training programs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Grounded situation recognition under data scarcity.
- Author
-
Zhou, Jing, Liu, Zhiqiang, Hu, Siying, Li, Xiaoxue, Wang, Zhiguang, and Lu, Qiang
- Subjects
- *
FEATURE extraction , *TRANSFORMER models , *SCARCITY , *VERBS , *NOUNS , *LOCALIZATION (Mathematics) - Abstract
Grounded Situation Recognition (GSR) aims to generate structured image descriptions. For a given image, GSR needs to identify the key verb, the nouns corresponding to roles, and their bounding-box groundings. However, current GSR research demands numerous meticulously labeled images, which are labor-intensive and time-consuming, making it costly to expand detection categories. Our study enhances model accuracy in detecting and localizing under data scarcity, reducing dependency on large datasets and paving the way for broader detection capabilities. In this paper, we propose the Grounded Situation Recognition under Data Scarcity (GSRDS) model, which uses the CoFormer model as the baseline and optimizes three subtasks: image feature extraction, verb classification, and bounding-box localization, to better adapt to data-scarce scenarios. Specifically, we replace ResNet50 with EfficientNetV2-M for advanced image feature extraction. Additionally, we introduce the Transformer Combined with CLIP for Verb Classification (TCCV) module, utilizing features extracted by CLIP's image encoder to enhance verb classification accuracy. Furthermore, we design the Multi-source Verb-Role Queries (Multi-VR Queries) and the Dual Parallel Decoders (DPD) modules to improve the accuracy of bounding-box localization. Through extensive comparative experiments and ablation studies, we demonstrate that our method achieves higher accuracy than mainstream approaches in data-scarce scenarios. Our code will be available at https://github.com/Zhou-maker-oss/GSRDS. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Swimtrans Net: a multimodal robotic system for swimming action recognition driven via Swin-Transformer.
- Author
-
He Chen and Xiaoyu Yue
- Subjects
SWIMMING techniques ,MOTION analysis ,FEATURE extraction ,MACHINE learning ,DATA extraction - Abstract
Introduction: Currently, using machine learning methods for precise analysis and improvement of swimming techniques holds significant research value and application prospects. The existing machine learning methods have improved the accuracy of action recognition to some extent. However, they still face several challenges such as insufficient data feature extraction, limited model generalization ability, and poor real-time performance. Methods: To address these issues, this paper proposes an innovative approach called Swimtrans Net: A multimodal robotic system for swimming action recognition driven via Swin-Transformer. By leveraging the powerful visual data feature extraction capabilities of Swin-Transformer, Swimtrans Net effectively extracts swimming image information. Additionally, to meet the requirements of multimodal tasks, we integrate the CLIP model into the system. Swin-Transformer serves as the image encoder for CLIP, and through fine-tuning the CLIP model, it becomes capable of understanding and interpreting swimming action data, learning relevant features and patterns associated with swimming. Finally, we introduce transfer learning for pre-training to reduce training time and lower computational resources, thereby providing real-time feedback to swimmers. Results and discussion: Experimental results show that Swimtrans Net has achieved a 2.94% improvement over the current state-of-the-art methods in swimming motion analysis and prediction, making significant progress. This study introduces an innovative machine learning method that can help coaches and swimmers better understand and improve swimming techniques, ultimately improving swimming performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. Endoclip-Assisted Cannulation for a Hidden Duodenal Papilla: Three Cases.
- Author
-
Jung, Il Soon, Kim, Ki Bae, Lee, Jun Su, Han, Joung-Ho, and Park, Seon Mee
- Subjects
- *
ENDOSCOPIC retrograde cholangiopancreatography , *DIVERTICULUM , *CATHETERIZATION , *CATHETERS - Abstract
Selective cannulation during endoscopic retrograde cholangiopancreatography can be particularly challenging when the papilla is invisible, either due to an intradiverticular papilla (IDP) or because it is covered by a mucosal fold. Endoclipassisted cannulation is an effective and safe technique for everting and fixing the papilla and it can be used alone or in combination with other devices. In this report, we achieved successful papillary cannulation in two cases of IDP and one case where the papilla was covered by a mucosal fold. In two cases, cannulation was accomplished by repositioning the invisible papilla using an endoclip alone, while in one case, we used an endoclip-assisted technique to push a redundant fold with a catheter. Endoclipassisted papillary cannulation can be applied in different situations, either alone or in combination with other devices. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. Open-world barely-supervised learning via augmented pseudo labels.
- Author
-
Li, Zhongnian, Ding, Yanyan, Wei, Meng, and Xu, Xinzheng
- Subjects
- *
ALGAL blooms , *TRANSFORMER models , *VISUAL programming languages (Computer science) , *DATA augmentation - Abstract
Open-world semi-supervised learning (OWSSL) has received significant attention since it addresses the issue of unlabeled data containing classes not present in the labeled data. Unfortunately, existing OWSSL methods still rely on a large amount of labeled data from seen classes, overlooking the reality that a substantial amount of labels is difficult to obtain in real scenarios. In this paper, we explored a new setting called open-world barely-supervised learning (OWBSL), where only a single label was provided for each seen class, greatly reducing labeling costs. To tackle the OWBSL task, we proposed a novel framework that leveraged augmented pseudo-labels generated for the unlabeled data. Specifically, we first generated initial pseudo-labels for the unlabeled data using visual-language models. Subsequently, to ensure that the pseudo-labels remained reliable while being updated during model training, we enhanced them using predictions from weak data augmentation. This way, we obtained the augmented pseudo-labels. Additionally, to fully exploit the information from unlabeled data, we incorporated consistency regularization based on strong and weak augmentations into our framework. Our experimental results on multiple benchmark datasets demonstrated the effectiveness of our method. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Dual-stream multi-label image classification model enhanced by feature reconstruction.
- Author
-
Hu, Liming, Chen, Mingxuan, Wang, Anjie, and Fang, Zhijun
- Abstract
Multi-label image classification (MLIC) is a highly practical and challenging task in computer vision. Compared to traditional single-label image classification, MLIC not only focuses on the dependencies between images and labels but also places significant emphasis on the spatial relationships within images and the internal dependencies of labels. In this paper, we propose the Dual-Stream Classification Network (DSCN) for multi-label image classification. In one branch, we capture more spatial information by segmenting the image. A feature reconstruction layer based on self-attention mechanism is used to recover the boundary information lost after segmentation, while the dependency between the image and label is captured by a transformer encoder. The other branch enhances the label’s semantics using multimodal features by employing templates to extend categories into prompts, thus improving the reliability of the features. The CLIP model provides multimodal association features between images and prompts. The final labels of the images are generated by a weighted fusion of the results from the two branches. We tested our model on three popular datasets: MSCOCO2014, VOC2007 and NUS-WIDE. DSCN outperformed state-of-the-art methods, demonstrating the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. Grounded situation recognition under data scarcity
- Author
-
Jing Zhou, Zhiqiang Liu, Siying Hu, Xiaoxue Li, Zhiguang Wang, and Qiang Lu
- Subjects
Grounded Situation Recognition ,Data Scarcity ,Transformer ,CLIP ,Medicine ,Science - Abstract
Abstract Grounded Situation Recognition (GSR) aims to generate structured image descriptions. For a given image, GSR needs to identify the key verb, the nouns corresponding to roles, and their bounding-box groundings. However, current GSR research demands numerous meticulously labeled images, which are labor-intensive and time-consuming, making it costly to expand detection categories. Our study enhances model accuracy in detecting and localizing under data scarcity, reducing dependency on large datasets and paving the way for broader detection capabilities. In this paper, we propose the Grounded Situation Recognition under Data Scarcity (GSRDS) model, which uses the CoFormer model as the baseline and optimizes three subtasks: image feature extraction, verb classification, and bounding-box localization, to better adapt to data-scarce scenarios. Specifically, we replace ResNet50 with EfficientNetV2-M for advanced image feature extraction. Additionally, we introduce the Transformer Combined with CLIP for Verb Classification (TCCV) module, utilizing features extracted by CLIP’s image encoder to enhance verb classification accuracy. Furthermore, we design the Multi-source Verb-Role Queries (Multi-VR Queries) and the Dual Parallel Decoders (DPD) modules to improve the accuracy of bounding-box localization. Through extensive comparative experiments and ablation studies, we demonstrate that our method achieves higher accuracy than mainstream approaches in data-scarce scenarios. Our code will be available at https://github.com/Zhou-maker-oss/GSRDS .
- Published
- 2024
- Full Text
- View/download PDF
37. Based-CLIP early fusion transformer for image caption.
- Author
-
Guo, Jinyu, Li, Yuejia, Cheng, Guanghui, and Li, Wenrui
- Abstract
Image captioning is a task in the bimodal context of computer vision and natural language processing, where the model outputs textual information captions for given input images. Traditional Transformer architectures based on image encoder and language decoder have shown promising results in the image captioning domain. However, there are still two challenges present: heavy parameters and additional data preprocessing. In this paper, we propose a lightweight based-CLIP early fusion transformer (BCEFT) to tackle this challenge. The BCEFT use CLIP as the data encoder for images and text, then add a multi-modal fusion model to generate image captions. Specifically, the multi-modal fusion model comprises a multi-modal fusion attention module, which reduces computational complexity by more than a half. At last, we utilize reinforcement learning to train our model with beam search algorithm after cross-entropy training. Our approach only requires relatively quick training to produce a high-qualified captioning model. Without the demand for additional annotations or pre-training, it can effectively generate meaningful captions for large-scale and diverse datasets. The experimental results on the MSCOCO dataset demonstrate the superiority of our model. Meanwhile, our model achieves significant efficiency gains, including a nearly 50% decrease in model parameters and an eight-fold improvement in runtime speed. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
38. RBP-Tar – a searchable database for experimental RBP binding sites [version 3; peer review: 1 approved, 1 approved with reservations]
- Author
-
Katarina Gresova, Tomas Racek, Vlastimil Martinek, David Cechak, Radka Svobodova, and Panagiotis Alexiou
- Subjects
Software Tool Article ,Articles ,RNA Binding Proteins ,CLIP ,RBP ,Web-server - Abstract
Background RNA-binding proteins (RBPs) play a critical role in regulating gene expression by binding to specific sites on RNA molecules. Identifying these binding sites is crucial for understanding the many functions of RBPs in cellular function, development and disease. Current experimental methods for identifying RBP binding sites, such as ultra-violet (UV) crosslinking and immunoprecipitation (CLIP), and especially the enhanced CLIP (eCLIP) protocol, were developed to identify authentic RBP binding sites experimentally. Methods To make this data more accessible to the scientific community, we have developed RBP-Tar ( https://ncbr.muni.cz/RBP-Tar ), a web server and database that utilises eCLIP data for 167 RBPs mapped on the human genome. The web server allows researchers to easily search and retrieve binding site information by genomic location and RBP name. Use case Researchers can produce lists of all known RBP binding sites on a gene of interest, or produce lists of binding sites for one RBP on different genomic loci. Conclusions Our future goal is to continue to populate the web server with additional experimental datasets from CLIP experiments as they become available and processed, making it an increasingly valuable resource for the scientific community.
- Published
- 2024
- Full Text
- View/download PDF
39. Efficacy of a novel traction method: outside-lesion clip-thread method for gastric endoscopic submucosal dissection of lesions of the greater curvature of the upper/middle stomach (with video).
- Author
-
Yamada, Keisaku, Tajika, Masahiro, Tanaka, Tsutomu, Ito, Nobuhito, Takagi, Akihiro, and Niwa, Yasumasa
- Subjects
- *
STOMACH surgery , *DATA analysis , *HUMAN dissection , *STOMACH , *SCIENTIFIC observation , *FISHER exact test , *LOGISTIC regression analysis , *TREATMENT effectiveness , *RETROSPECTIVE studies , *MANN Whitney U Test , *VETERINARY dissection , *SURGICAL complications , *ORTHOPEDIC traction , *ENDOSCOPIC gastrointestinal surgery , *MEDICAL records , *ACQUISITION of data , *ONE-way analysis of variance , *STATISTICS , *VIDEO recording - Abstract
Background: Gastric endoscopic submucosal dissection (ESD) for lesions located on the greater curvature of the upper and middle (U/M) third of the stomach remains challenging, even for experienced endoscopists. Accordingly, we have developed a novel traction technique, termed the outside-lesion clip-thread method (O-CTM). In this method, a clip thread is attached to the healthy mucosa outside the circumferential incision line, and traction is applied to bring the scope and lesion into proximity for ESD. Here, we assessed the efficacy of ESD using the O-CTM compared to ESD without the O-CTM. Methods: We retrospectively reviewed data from 63 consecutive patients who underwent gastric ESD for 63 lesions located on the greater curvature of the U/M third of the stomach between September 2015 and April 2024. The primary outcome was the operation time, and secondary outcomes were resection speed, en bloc resection, R0 resection and complications in the O-CTM and without O-CTM ESD groups. Results: Of the 63 included lesions, 37 were resected without the O-CTM between September 2015 and June 2022 (without O-CTM group), and 26 lesions were resected using the O-CTM between July 2022 and April 2024 (O-CTM group). The O-CTM group had significantly shorter operation times (40 min vs. 77 min, p = 0.01) than the without O-CTM group. The resection speed was also significantly faster (20.1 mm2/min vs. 11.3 mm2/min, p = 0.02). No significant differences in en bloc resection rate, R0 resection rate, and complications were observed. Conclusions: Gastric ESD using O-CTM is beneficial when compared with the ESD without O-CTM in reducing operation time and improving resection speeds for treating lesions located on the greater curvature of the U/M region. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. WildCLIP: Scene and Animal Attribute Retrieval from Camera Trap Data with Domain-Adapted Vision-Language Models.
- Author
-
Gabeff, Valentin, Rußwurm, Marc, Tuia, Devis, and Mathis, Alexander
- Subjects
- *
ANIMAL behavior , *ACQUISITION of data , *CAMERAS , *ANNOTATIONS , *VOCABULARY - Abstract
Wildlife observation with camera traps has great potential for ethology and ecology, as it gathers data non-invasively in an automated way. However, camera traps produce large amounts of uncurated data, which is time-consuming to annotate. Existing methods to label these data automatically commonly use a fixed pre-defined set of distinctive classes and require many labeled examples per class to be trained. Moreover, the attributes of interest are sometimes rare and difficult to find in large data collections. Large pretrained vision-language models, such as contrastive language image pretraining (CLIP), offer great promises to facilitate the annotation process of camera-trap data. Images can be described with greater detail, the set of classes is not fixed and can be extensible on demand and pretrained models can help to retrieve rare samples. In this work, we explore the potential of CLIP to retrieve images according to environmental and ecological attributes. We create WildCLIP by fine-tuning CLIP on wildlife camera-trap images and to further increase its flexibility, we add an adapter module to better expand to novel attributes in a few-shot manner. We quantify WildCLIP's performance and show that it can retrieve novel attributes in the Snapshot Serengeti dataset. Our findings outline new opportunities to facilitate annotation processes with complex and multi-attribute captions. The code is available at https://github.com/amathislab/wildclip. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. A Lightweight Enhancement Approach for Real-Time Semantic Segmentation by Distilling Rich Knowledge from Pre-Trained Vision-Language Model.
- Author
-
Lin, Chia-Yi, Chen, Jun-Cheng, Wu, Ja-Ling, Wang, Jia-Ching, Wang, Hsin-Min, Peng, Wen-Hsiao, and Yeh, Chia-Hung
- Subjects
LEARNING strategies ,SPINE - Abstract
In this work, we propose a lightweight approach to enhance realtime semantic segmentation by leveraging the pre-trained visionlanguage models, specifically utilizing the text encoder of Contrastive Language-Image Pretraining (CLIP) to generate rich semantic embeddings for text labels. Then, our method distills this textual knowledge into the segmentation model, integrating the image and text embeddings to align visual and textual information. Additionally, we implement learnable prompt embeddings for better class-specific semantic comprehension. We propose a two-stage training strategy for efficient learning: the segmentation backbone initially learns from fixed text embeddings and subsequently optimizes prompt embeddings to streamline the learning process. The extensive evaluations and ablation studies validate our approach's ability to effectively improve the semantic segmentation model's performance over the compared methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning.
- Author
-
Zhuo, Linhai, Fu, Yuqian, Chen, Jingjing, Cao, Yixin, and Jiang, Yu-Gang
- Subjects
DATA augmentation ,GENERALIZATION ,EMPIRICAL research - Abstract
The challenge of cross-domain few-shot learning (CD-FSL) stems from the substantial distribution disparities between target and source domain images, necessitating a model with robust generalization capabilities. In this work, we posit that large-scale pretrained models are pivotal in addressing the CD-FSL task owing to their exceptional representational and generalization prowess. To our knowledge, no existing research comprehensively investigates the utility of large-scale pretrained models in the CD-FSL context. Addressing this gap, our study presents an exhaustive empirical assessment of the Contrastive Language–Image Pre-Training model within the CD-FSL task. We undertake a comparison spanning six dimensions: base model, transfer module, classifier, loss, data augmentation, and training schedule. Furthermore, we establish a straightforward baseline model, E-base, based on our empirical analysis, underscoring the importance of our investigation. Experimental results substantiate the efficacy of our model, yielding a mean gain of 1.2% in 5-way 5-shot evaluations on the BSCD dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. RL-CWtrans Net: multimodal swimming coaching driven via robot vision.
- Author
-
Guanlin Wang
- Subjects
SWIMMING coaching ,SWIMMING coaches ,ARTIFICIAL neural networks ,ROBOT vision ,REINFORCEMENT learning - Abstract
In swimming, the posture and technique of athletes are crucial for improving performance. However, traditional swimming coaches often struggle to capture and analyze athletes' movements in real-time, which limits the effectiveness of coaching. Therefore, this paper proposes RL-CWtrans Net: a robot vision-driven multimodal swimming training system that provides precise and real-time guidance and feedback to swimmers. The system utilizes the Swin-Transformer as a computer vision model to effectively extract the motion and posture features of swimmers. Additionally, with the help of the CLIP model, the system can understand natural language instructions and descriptions related to swimming. By integrating visual and textual features, the systemachieves amore comprehensive and accurate information representation. Finally, by employing reinforcement learning to train an intelligent agent, the system can provide personalized guidance and feedback based on multimodal inputs. Experimental results demonstrate significant advancements in accuracy and practicality for this multimodal robot swimming coaching system. The system is capable of capturing real-time movements and providing immediate feedback, thereby enhancing the effectiveness of swimming instruction. This technology holds promise. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization.
- Author
-
Deng, Lujuan, Tan, Jieqing, and Liu, Fangmei
- Subjects
COGNITIVE styles ,VIDEO excerpts ,MODELS & modelmaking ,CHINESE language ,ENGLISH language - Abstract
The contrastive vision–language pre-trained model CLIP, driven by large-scale open-vocabulary image–text pairs, has recently demonstrated remarkable zero-shot generalization capabilities in diverse downstream image tasks, which has made numerous models dominated by the "image pre-training followed by fine-tuning" paradigm exhibit promising results on standard video benchmarks. However, as models scale up, full fine-tuning adaptive strategy for specific tasks becomes difficult in terms of training and storage. In this work, we propose a novel method that adapts CLIP to the video domain for efficient recognition without destroying the original pre-trained parameters. Specifically, we introduce temporal prompts to realize the object of reasoning about the dynamic content of videos for pre-trained models that lack temporal cues. Then, by replacing the direct learning style of prompt vectors with a lightweight reparameterization encoder, the model can be adapted to domain-specific adjustment to learn more generalizable representations. Furthermore, we predefine a Chinese label dictionary to enhance video representation by co-supervision of Chinese and English semantics. Extensive experiments on video action recognition benchmarks show that our method achieves competitive or even better performance than most existing methods with fewer trainable parameters in both general and few-shot recognition scenarios. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. Dose multimodal machine translation can improve translation performance?
- Author
-
Cui, ShaoDong, Duan, Kaibo, Ma, Wen, and Shinnou, Hiroyuki
- Subjects
- *
MACHINE translating , *JUDGES , *TRANSLATING & interpreting - Abstract
Multimodal machine translation (MMT) is a method that uses visual information to guide text translation. However, recent studies have engendered controversy regarding the extent to which MMT can contribute to the improvement of text-enhanced translation. To explore whether the MMT model can improve translation performance, we use the current Neural Machine Translation (NMT) system for evaluation at Multi30K dataset. Specifically, we judge the performance of the MMT model by comparing the difference between the NMT model and the MMT model. At the same time, we conduct text and multimodal degradation experiments to verify whether vision can play a role. We explored the performance of the NMT model and the MMT model for sentences of different lengths to clarify the pros and cons of the MMT model. We found that the performance of the current NMT model surpasses that of the MMT model, suggesting that the impact of visual features might be less significant. Visual features seem to exert influence primarily when a substantial number of words in the source text are masked. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. CLIP feature-based randomized control using images and text for multiple tasks and robots.
- Author
-
Shibata, Kazuki, Deguchi, Hideki, and Taguchi, Shun
- Subjects
- *
LANGUAGE models , *ROBOT control systems , *COST control , *CLASSROOM environment , *ROBOTS - Abstract
This study presents a control framework leveraging vision language models (VLMs) for multiple tasks and robots. Notably, existing control methods using VLMs have achieved high performance in various tasks and robots in the training environment. However, these methods incur high costs for learning control policies for tasks and robots other than those in the training environment. Considering the application of industrial and household robots, learning in novel environments where robots are introduced is challenging. To address this issue, we propose a control framework that does not require learning control policies. Our framework combines the vision-language CLIP model with a randomized control. CLIP computes the similarity between images and texts by embedding them in the feature space. This study employs CLIP to compute the similarity between camera images and text representing the target state. In our method, the robot is controlled by a randomized controller that simultaneously explores and increases the similarity gradients. Moreover, we fine-tune the CLIP to improve the performance of the proposed method. Consequently, we confirm the effectiveness of our approach through a multitask simulation and a real robot experiment using a two-wheeled robot and robot arm. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Partial coil embolization before surgical clipping of ruptured intracranial aneurysms.
- Author
-
Mistry, Akshitkumar M., Naidugari, Janki, Meyer, Kimberly S., Chen, Ching-Jen, Williams, Brian J., Morton, Ryan P., Abecassis, Isaac J., and Ding, Dale
- Subjects
- *
INTRACRANIAL aneurysm ruptures , *THERAPEUTIC embolization , *RUPTURED aneurysms , *SUBARACHNOID hemorrhage , *ANEURYSMS - Abstract
Objective: Intraoperative rupture (IOR) is the most common adverse event encountered during surgical clip obliteration of ruptured intracranial aneurysms. Besides increasing surgeon experience and early proximal control, no methods exist to decrease IOR risk. Thus, our objective was to assess if partial endovascular coil embolization to protect the aneurysm before clipping decreases IOR. Methods: We conducted a retrospective analysis of patients with ruptured intracranial aneurysms that were treated with surgical clipping at two tertiary academic centers. We compared patient characteristics and outcomes of those who underwent partial endovascular coil embolization to protect the aneurysm before clipping to those who did not. The primary outcome was IOR. Secondary outcomes were inpatient mortality and discharge destination. Results: We analyzed 100 patients. Partial endovascular aneurysm protection was performed in 27 patients. Age, sex, subarachnoid hemorrhage severity, and aneurysm location were similar between the partially-embolized and non-embolized groups. The median size of the partially-embolized aneurysms was larger (7.0 mm [interquartile range 5.95–8.7] vs. 4.6 mm [3.3–6.0]; P < 0.001). During surgical clipping, IOR occurred less frequently in the partially-embolized aneurysms than non-embolized aneurysms (2/27, 7.4%, vs. 30/73, 41%; P = 0.001). Inpatient mortality was 14.8% (4/27) in patients with partially-embolized aneurysms and 28.8% (21/73) in patients without embolization (P = 0.20). Discharge to home or inpatient rehabilitation was 74.0% in patients with partially-embolized aneurysms and 56.2% in patients without embolization (P = 0.11). A complication from partial embolization occurred in 2/27 (7.4%) patients. Conclusions: Preoperative partial endovascular coil embolization of ruptured aneurysms is associated with a reduced frequency of IOR during definitive treatment with surgical clip obliteration. These results and the impact of preoperative partial endovascular coil embolization on functional outcomes should be confirmed with a randomized trial. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. The Usefulness of Extradural Anterior Clinoidectomy for Low-Lying Posterior Communicating Artery Aneurysms : A Cadaveric Study.
- Author
-
Byoun, Hyoung Soo, Choi, Kyu-Sun, Na, Min Kyun, Kwon, Sae Min, and Nam, Yong Seok
- Subjects
- *
CRANIOTOMY , *ANEURYSMS , *HUMAN dissection , *ANTERIOR cerebral artery , *INTERNAL carotid artery , *SUBARACHNOID hemorrhage , *ARTERIES , *MEDICAL cadavers - Abstract
Objective: To confirm the usefulness of the extradural anterior clinoidectomy during the clipping of a low riding posterior communicating artery (PCoA) aneurysm through cadaver dissection. Methods: Anatomic measurements of 12 adult cadaveric heads (24 sides total) were performed to compare the microsurgical exposure of the PCoA and internal carotid artery (ICA) before and after clinoidectomy. A standard pterional craniotomy and transsylvian approach were performed in all cadavers. The distance from the ICA bifurcation to the origin of PCoA (D1), pre-anterior clinoidectomy distance from the ICA bifurcation to tentorium (D2), post-anterior clinoidectomy distance from the ICA bifurcation to tentorium (D3), pre-anterior clinoidectomy distance from the tentorium to the origin of PCoA (D4) and post-anterior clinoidectomy distance from the tentorium to the origin of PCoA (D5) and the distance of the ICA obtained after anterior clinoidectomy (D6) were measured. We measured the precise thickness of the blade for the Yasargil clip with a digital precision ruler to confirm the usefulness of the extradural anterior clinoidectomy. Results: Twenty-four sites were dissected from 12 cadavers. The age of the cadavers was 79.83±6.25 years. The number of males was the same as the females. The space from the proximal origin of the PCoA to the preclinoid-tentorium (D4) was 1.45±1.08 mm (max, 4.01; min, 0.56). After the clinoidectomy, the space from the proximal origin of the PCoA to the postclinoid-tentorium (D5) was 3.612±1.15 mm (max, 6.14; min, 1.83). The length (D6) of the exposed proximal ICA after the extradural clinoididectomy was 2.17±1.04 mm on the lateral side and 2.16±0.89 mm on the medial side. The thickness of the Yasargil clip blade used during the clipping surgery was 1.35 mm measured with a digital precision ruler. Conclusion: The proximal length obtained by performing an external anterior clinoidectomy is about 2 mm, sufficient for proximal control during PCoA aneurysm surgery, considering the thickness of the aneurysm clips. In a subarachnoid hemorrhage, performing an extradural anterior clinoidectomy could prevent a devastating situation during PCoA aneurysm clipping. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. Zero-shot urban function inference with street view images through prompting a pretrained vision-language model.
- Author
-
Huang, Weiming, Wang, Jing, and Cong, Gao
- Subjects
- *
URBAN land use - Abstract
Inferring urban functions using street view images (SVIs) has gained tremendous momentum. The recent prosperity of large-scale vision-language pretrained models sheds light on addressing some long-standing challenges in this regard, for example, heavy reliance on labeled samples and computing resources. In this paper, we present a novel prompting framework for enabling the pretrained vision-language model CLIP to effectively infer fine-grained urban functions with SVIs in a zero-shot manner, that is, without labeled samples and model training. The prompting framework UrbanCLIP comprises an urban taxonomy and several urban function prompt templates, in order to (1) bridge the abstract urban function categories and concrete urban object types that can be readily understood by CLIP, and (2) mitigate the interference in SVIs, for example, street-side trees and vehicles. We conduct extensive experiments to verify the effectiveness of UrbanCLIP. The results indicate that the zero-shot UrbanCLIP largely surpasses several competitive supervised baselines, e.g. a fine-tuned ResNet, and its advantages become more prominent in cross-city transfer tests. In addition, UrbanCLIP's zero-shot performance is considerably better than the vanilla CLIP. Overall, UrbanCLIP is a simple yet effective framework for urban function inference, and showcases the potential of foundation models for geospatial applications. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. Detecting images generated by diffusers.
- Author
-
Coccomini, Davide Alessandro, Esuli, Andrea, Falchi, Fabrizio, Gennaro, Claudio, and Amato, Giuseppe
- Subjects
MULTILAYER perceptrons ,CONVOLUTIONAL neural networks ,STABLE Diffusion ,COMPUTER vision ,ARTIFICIAL intelligence - Abstract
In recent years, the field of artificial intelligence has witnessed a remarkable surge in the generation of synthetic images, driven by advancements in deep learning techniques. These synthetic images, often created through complex algorithms, closely mimic real photographs, blurring the lines between reality and artificiality. This proliferation of synthetic visuals presents a pressing challenge: how to accurately and reliably distinguish between genuine and generated images. This article, in particular, explores the task of detecting images generated by text-to-image diffusion models, highlighting the challenges and peculiarities of this field. To evaluate this, we consider images generated from captions in the MSCOCO and Wikimedia datasets using two state-of-the-art models: Stable Diffusion and GLIDE. Our experiments show that it is possible to detect the generated images using simple multi-layer perceptrons (MLPs), starting from features extracted by CLIP or RoBERTa, or using traditional convolutional neural networks (CNNs). These latter models achieve remarkable performances in particular when pretrained on large datasets. We also observe that models trained on images generated by Stable Diffusion can occasionally detect images generated by GLIDE, but only on the MSCOCO dataset. However, the reverse is not true. Lastly, we find that incorporating the associated textual information with the images in some cases can lead to a better generalization capability, especially if textual features are closely related to visual ones. We also discovered that the type of subject depicted in the image can significantly impact performance. This work provides insights into the feasibility of detecting generated images and has implications for security and privacy concerns in real-world applications. The code to reproduce our results is available at: https://github.com/davide-coccomini/Detecting-Images-Generated-by-Diffusers. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.