Back to Search Start Over

SingleS2R: Single sample driven Sim-to-Real transfer for Multi-Source Visual-Tactile Information Understanding using multi-scale vision transformers.

Authors :
Tang, Jing
Gong, Zeyu
Tao, Bo
Yin, Zhouping
Source :
Information Fusion. Aug2024, Vol. 108, pN.PAG-N.PAG. 1p.
Publication Year :
2024

Abstract

Due to variations in light transmission and wear on the contact head, existing visual-tactile dataset building methods typically require a large amount of real-world data, making the dataset building process time-consuming and labor-intensive. Sim-to-Real learning has been proposed to realize Multi-Source Visual-Tactile Information Understanding (MSVTIU) in simulate and real environment, which can efficiently promote visual-tactile dataset building using simulation method for emerged robotic applications. However, the existing Sim-to-Real learning also requires more than 10,000 real data, while the corresponding data need to be re-collected when the sensor version changes. To address this challenge, we propose a powerful Sim-to-Real transfer for MSVTIU which requires only one single real-world tactile sample. To effectively extract features from the single real tactile sample, a multi-scale vision transformers-based Generative Adversarial Network (GAN) is proposed to address MSVTIU task under extremely limited data. We introduce a novel scale-dependent self-attention mechanism that allows attention layers to adapt their behavior at different stages of the generating process. In addition, we introduced a residual block for capturing contextual information between adjacent scales, which utilizes shortcut connections to fully preserve texture and structure information. We subsequently enhanced the model understanding of visual-tactile information using elastic transform and adaptive adversarial training strategy, both of which are designed specifically for MSVTIU. Experiments on two public datasets with diverse objects indicate that our Sim-to-Real transfer approach utilizing only one single real-world visual-tactile sample outperforms the state-of-the-art methods that requires tens of thousands of samples. • Single sample-driven and new state-of-the-art method. • Our method outperforms the others requiring 10000+ samples. • A scale-dependent self-attention mechanism to extract features efficiently. • An adaptive residual block to better capturing contextual features. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
15662535
Volume :
108
Database :
Academic Search Index
Journal :
Information Fusion
Publication Type :
Academic Journal
Accession number :
176811350
Full Text :
https://doi.org/10.1016/j.inffus.2024.102390