Descriptor: "Vision Transformer (ViT)" / Journal: ieee transactions on geoscience & remote sensing - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Vision Transformer (ViT)"' showing total 3 results

Start Over Descriptor "Vision Transformer (ViT)" Journal ieee transactions on geoscience & remote sensing

3 results on '"Vision Transformer (ViT)"'

1. Building Extraction With Vision Transformer.

Author: Wang, Libo, Fang, Shenghui, Meng, Xiaoliang, and Li, Rui
Subjects: *BUILDING inspection, *CONVOLUTIONAL neural networks, *REMOTE sensing, *VISION, *FEATURE extraction
Abstract: As an important carrier of human productive activities, the extraction of buildings is not only essential for urban dynamic monitoring but also necessary for suburban construction inspection. Nowadays, accurate building extraction from remote sensing images remains a challenge due to the complex background and diverse appearances of buildings. The convolutional neural network (CNN)-based building extraction methods, although increased the accuracy significantly, are criticized for their inability for modeling global dependencies. Thus, this article applies the vision transformer (ViT) for building extraction. However, the actual utilization of the ViT often comes with two limitations. First, the ViT requires more GPU memory and computational costs compared with CNNs. This limitation is further magnified when encountering large-sized inputs like fine-resolution remote sensing images. Second, spatial details are not sufficiently preserved during the feature extraction of the ViT, resulting in the inability for fine-grained building segmentation. To handle these issues, we propose a novel ViT (BuildFormer), with a dual-path structure. Specifically, we design a spatial-detailed context path to encode rich spatial details and a global context path to capture global dependencies. Besides, we design a window-based linear multihead self-attention whose complexity is linear with the window size. Such a design allows the BuildFormer to apply large windows for capturing global context, which greatly improves its potential in processing large-sized remote sensing images. The proposed method yields the state-of-the-art performance (75.74% IoU) on the Massachusetts building dataset. Code will be available at https://github.com/WangLibo1995/BuildFormer. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

2. Looking Outside the Window: Wide-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images.

Author: Ding, Lei, Lin, Dong, Lin, Shaofu, Zhang, Jing, Cui, Xiaojie, Wang, Yuebin, Tang, Hao, and Bruzzone, Lorenzo
Subjects: *REMOTE sensing, *IMAGE enhancement (Imaging systems), *CONVOLUTIONAL neural networks
Abstract: Long-range contextual information is crucial for the semantic segmentation of high-resolution (HR) remote sensing images (RSIs). However, image cropping operations, commonly used for training neural networks, limit the perception of long-range contexts in large RSIs. To overcome this limitation, we propose a wide-context network (WiCoNet) for the semantic segmentation of HR RSIs. Apart from extracting local features with a conventional convolutional neural network (CNN), the WiCoNet has an extra context branch to aggregate information from a larger image area. Moreover, we introduce a context transformer to embed contextual information from the context branch and selectively project it onto the local features. The context transformer extends the vision transformer, an emerging kind of neural networks, to model the dual-branch semantic correlations. It overcomes the locality limitation of CNNs and enables the WiCoNet to see the bigger picture before segmenting the land-cover/land-use (LCLU) classes. Ablation studies and comparative experiments conducted on several benchmark datasets demonstrate the effectiveness of the proposed method. In addition, we present a new Beijing Land-Use (BLU) dataset. This is a large-scale HR satellite dataset with high-quality and fine-grained reference labels, which can facilitate future studies in this field. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

3. Exploring Vision Transformers for Polarimetric SAR Image Classification.

Author: Dong, Hongwei, Zhang, Lamei, and Zou, Bin
Subjects: *SYNTHETIC aperture radar, *NATURAL language processing, *SUPERVISED learning, *CONVOLUTIONAL neural networks, *SYNTHETIC apertures, *DEEP learning, *PIXELS
Abstract: As one of the most popular topics in polarimetric synthetic aperture radar (PolSAR) community, PolSAR image classification has always been an important way for PolSAR applications. Constructing representations is the most critical part of PolSAR image classification. With the maturity of deep learning technique, many data-driven PolSAR representation methods have been proposed, most of which are based on convolutional neural networks (CNNs). Despite some achievements, the bottleneck of CNN-based methods may be related to the locality induced by their inductive biases. Considering this problem, the state-of-the-art method in natural language processing, i.e., transformer, is introduced into PolSAR image classification for the first time. Specifically, a vision transformer (ViT)-based representation learning framework is proposed in this article, which covers both supervised learning and unsupervised learning. For supervised learning, we use self-attention to replace convolution, which shifts the focus from the information in local neighborhoods to the long-range interactions between each pixel. Beyond supervised learning, we introduce an improved contrastive-based strategy to implement simple unsupervised representation learning. Compared with CNN and its variants, ViT constructs more global representations by explicitly modeling the relationship between each pixel, so as to improve the classification performance. Experimental results on four widely used PolSAR image datasets indicate that the representation obtained by the ViT-based methods is better for PolSAR image classification, whether supervised (up to about 5% accuracy improvement) or unsupervised (up to about 4%). In addition, we also prove the robustness of ViT to the initial input form. These discoveries may arouse rethinking of the dominance of CNNs in PolSAR image classification. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

3 results on '"Vision Transformer (ViT)"'

1. Building Extraction With Vision Transformer.

2. Looking Outside the Window: Wide-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images.

3. Exploring Vision Transformers for Polarimetric SAR Image Classification.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

3 results on '"Vision Transformer (ViT)"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources