2,596 results on '"Vision transformer"'
Search Results
2. PYRA: Parallel Yielding Re-activation for Training-Inference Efficient Task Adaptation
- Author
-
Xiong, Yizhe, Chen, Hui, Hao, Tianxiang, Lin, Zijia, Han, Jungong, Zhang, Yuesong, Wang, Guoxin, Bao, Yongjun, Ding, Guiguang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
3. Industrial Gearbox Fault Diagnosis Based on Vision Transformer and Infrared Thermal Imaging
- Author
-
Li, Yan, Cao, Xunqi, Wang, Haoyu, Yu, Kun, Zhang, Yongchao, Ceccarelli, Marco, Series Editor, Corves, Burkhard, Advisory Editor, Glazunov, Victor, Advisory Editor, Hernández, Alfonso, Advisory Editor, Huang, Tian, Advisory Editor, Jauregui Correa, Juan Carlos, Advisory Editor, Takeda, Yukio, Advisory Editor, Agrawal, Sunil K., Advisory Editor, Wang, Zuolu, editor, Zhang, Kai, editor, Feng, Ke, editor, Xu, Yuandong, editor, and Yang, Wenxian, editor
- Published
- 2025
- Full Text
- View/download PDF
4. Multi-modal Knowledge-Enhanced Fine-Grained Image Classification
- Author
-
Cheng, Suyan, Zhang, Feifei, Zhou, Haoliang, Xu, Changsheng, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Lin, Zhouchen, editor, Cheng, Ming-Ming, editor, He, Ran, editor, Ubul, Kurban, editor, Silamu, Wushouer, editor, Zha, Hongbin, editor, Zhou, Jie, editor, and Liu, Cheng-Lin, editor
- Published
- 2025
- Full Text
- View/download PDF
5. Multi-center Ovarian Tumor Classification Using Hierarchical Transformer-Based Multiple-Instance Learning
- Author
-
H.B. Claessens, Cris, W.R. Schultz, Eloy, Koch, Anna, Nies, Ingrid, A.E. Hellström, Terese, Nederend, Joost, Niers-Stobbe, Ilse, Bruining, Annemarie, M.J. Piek, Jurgen, H.N. De With, Peter, van der Sommen, Fons, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Ali, Sharib, editor, van der Sommen, Fons, editor, Papież, Bartłomiej Władysław, editor, Ghatwary, Noha, editor, Jin, Yueming, editor, and Kolenbrander, Iris, editor
- Published
- 2025
- Full Text
- View/download PDF
6. Self-Supervised Image Aesthetic Assessment Based on Transformer.
- Author
-
Jia, Minrui, Wang, Guangao, Wang, Zibei, Yang, Shuai, Ke, Yongzhen, and Wang, Kai
- Abstract
Visual aesthetics has always been an important area of computational vision, and researchers have continued exploring it. To further improve the performance of the image aesthetic evaluation task, we introduce a Transformer into the image aesthetic evaluation task. This paper pioneers a novel self-supervised image aesthetic evaluation model founded upon Transformers. Meanwhile, we expand the pretext task to capture rich visual representations, adding a branch for inpainting the masked images in parallel with the tasks related to aesthetic quality degradation operations. Our model’s refinement employs the innovative uncertainty weighting method, seamlessly amalgamating three distinct losses into a unified objective. On the AVA dataset, our approach surpasses the efficacy of prevailing self-supervised image aesthetic assessment methods. Remarkably, we attain results approaching those of supervised methods, even while operating with a limited dataset. On the AADB dataset, our approach improves the aesthetic binary classification accuracy by roughly 16% compared to other self-supervised image aesthetic assessment methods and improves the prediction of aesthetic attributes. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Deep learning and computer vision approach - a vision transformer based classification of fruits and vegetable diseases (DLCVA-FVDC).
- Author
-
N. A., Deepak
- Subjects
CONVOLUTIONAL neural networks ,TRANSFORMER models ,IMAGE recognition (Computer vision) ,FOOD industry ,COMPUTER vision - Abstract
As technology progresses, automation gains importance. The automation might be in a large-scale industry with more employees and heavy capital investments, or maybe a small-scale industry in which manufacturing, providing the services, and productions are performed on a smaller scale or micro scale, everywhere the automation gains importance. Similar to this is the food processing industry, where fruits and vegetables are some of the most popular products that enhance our health and help us to stay fit. In this approach, we have developed a framework that classifies fruits and vegetables using computer vision and deep learning-based methods. We test the proposed framework on Kaggle's fresh/stale images of fruits and vegetables dataset and IEEEDataPort's FruitsGB dataset. Experiments were conducted in multiple trials to extract model parameters and were analyzed to classify the fruits and vegetables as fresh/stale. The classification depends on the selection of the optimizer and varying the hyperparameter value like batch size, learning rate, kernel size, number of kernels, patch size, etc. The proposed custom CNN model achieves the highest classification accuracy of 97.65% and 95.86% using Kaggle's and FruitsGB datasets, respectively. Similarly, in the second approach, the vision transform (ViT) achieves the highest classification accuracy of 98.34% and 96.75% on the same datasets, respectively. The results of these methods outperform the results of the baseline algorithm used in the classification of the images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
8. GroupFormer for hyperspectral image classification through group attention.
- Author
-
Khan, Rahim, Arshad, Tahir, Ma, Xuefei, Zhu, Haifeng, Wang, Chen, Khan, Javed, Khan, Zahid Ullah, and Khan, Sajid Ullah
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *IMAGE recognition (Computer vision) , *FEATURE extraction , *RESEARCH personnel - Abstract
Hyperspectral image (HSI) data has a wide range of valuable spectral information for numerous tasks. HSI data encounters challenges such as small training samples, scarcity, and redundant information. Researchers have introduced various research works to address these challenges. Convolution Neural Network (CNN) has gained significant success in the field of HSI classification. CNN's primary focus is to extract low-level features from HSI data, and it has a limited ability to detect long-range dependencies due to the confined filter size. In contrast, vision transformers exhibit great success in the HSI classification field due to the use of attention mechanisms to learn the long-range dependencies. As mentioned earlier, the primary issue with these models is that they require sufficient labeled training data. To address this challenge, we proposed a spectral-spatial feature extractor group attention transformer that consists of a multiscale feature extractor to extract low-level or shallow features. For high-level semantic feature extraction, we proposed a group attention mechanism. Our proposed model is evaluated using four publicly available HSI datasets, which are Indian Pines, Pavia University, Salinas, and the KSC dataset. Our proposed approach achieved the best classification results in terms of overall accuracy (OA), average accuracy (AA), and Kappa coefficient. As mentioned earlier, the proposed approach utilized only 5%, 1%, 1%, and 10% of the training samples from the publicly available four datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
9. Building a pelvic organ prolapse diagnostic model using vision transformer on multi‐sequence MRI.
- Author
-
Zhu, Shaojun, Zhu, Xiaoxuan, Zheng, Bo, Wu, Maonian, Li, Qiongshan, and Qian, Cheng
- Subjects
- *
TRANSFORMER models , *MAGNETIC resonance imaging , *PELVIC organ prolapse , *VALSALVA'S maneuver , *FEATURE extraction , *DEEP learning , *KEGEL exercises , *PELVIC floor - Abstract
Background Purpose Methods Results Conclusions Although the uterus, bladder, and rectum are distinct organs, their muscular fasciae are often interconnected. Clinical experience suggests that they may share common risk factors and associations. When one organ experiences prolapse, it can potentially affect the neighboring organs. However, the current assessment of disease severity still relies on manual measurements, which can yield varying results depending on the physician, thereby leading to diagnostic inaccuracies.This study aims to develop a multilabel grading model based on deep learning to classify the degree of prolapse of three organs in the female pelvis using stress magnetic resonance imaging (MRI) and provide interpretable result analysis.We utilized sagittal MRI sequences taken at rest and during maximum Valsalva maneuver from 662 subjects. The training set included 464 subjects, the validation set included 98 subjects, and the test set included 100 subjects (training set
n = 464, validation setn = 98, test setn = 100). We designed a feature extraction module specifically for pelvic floor MRI using the vision transformer architecture and employed label masking training strategy and pre‐training methods to enhance model convergence. The grading results were evaluated using Precision, Kappa, Recall, and Area Under the Curve (AUC). To validate the effectiveness of the model, the designed model was compared with classic grading methods. Finally, we provided interpretability charts illustrating the model's operational principles on the grading task.In terms of POP grading detection, the model achieved an average Precision, Kappa coefficient, Recall, and AUC of 0.86, 0.77, 0.76, and 0.86, respectively. Compared to existing studies, our model achieved the highest performance metrics. The average time taken to diagnose a patient was 0.38 s.The proposed model achieved detection accuracy that is comparable to or even exceeds that of physicians, demonstrating the effectiveness of the vision transformer architecture and label masking training strategy for assisting in the grading of POP under static and maximum Valsalva conditions. This offers a promising option for computer‐aided diagnosis and treatment planning of POP. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
10. Distributed Fire Classification and Localization Model Based on Federated Learning with Image Clustering.
- Author
-
Lee, Jiwon, Kang, Jeongheun, Park, Chun-Su, and Jeong, Jongpil
- Abstract
In this study, we propose a fire classification system using image clustering based on a federated learning (FL) structure. This system enables fire detection in various industries, including manufacturing. The accurate classification of fire, smoke, and normal conditions is an important element of fire prevention and response systems in industrial sites. The server in the proposed system extracts data features using a pretrained vision transformer model and clusters the data using the bisecting K-means algorithm to obtain weights. The clients utilize these weights to cluster local data with the K-means algorithm and measure the difference in data distribution using the Kullback–Leibler divergence. Experimental results show that the proposed model achieves nearly 99% accuracy on the server, and the clustering accuracy on the clients remains high. In addition, the normalized mutual information value remains above 0.6 and the silhouette score reaches 0.9 as the rounds progress, indicating improved clustering quality. This study shows that the accuracy of fire classification is enhanced by using FL and clustering techniques and has a high potential for real-time detection. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Multi-label remote sensing classification with self-supervised gated multi-modal transformers.
- Author
-
Na Liu, Ye Yuan, Guodong Wu, Sai Zhang, Jie Leng, and Lihong Wan
- Subjects
TRANSFORMER models ,SYNTHETIC aperture radar ,REMOTE sensing ,MULTISENSOR data fusion ,RESEARCH personnel - Abstract
Introduction: With the great success of Transformers in the field of machine learning, it is also gradually attracting widespread interest in the field of remote sensing (RS). However, the research in the field of remote sensing has been hampered by the lack of large labeled data sets and the inconsistency of data modes caused by the diversity of RS platforms. With the rise of self-supervised learning (SSL) algorithms in recent years, RS researchers began to pay attention to the application of "pre-training and fine-tuning" paradigm in RS. However, there are few researches on multi-modal data fusion in remote sensing field. Most of them choose to use only one of the modal data or simply splice multiple modal data roughly. Method: In order to study a more efficient multi-modal data fusion scheme, we propose a multi-modal fusion mechanism based on gated unit control (MGSViT). In this paper, we pretrain the ViT model based on BigEarthNet dataset by combining two commonly used SSL algorithms, and propose an intra-modal and inter-modal gated fusion unit for feature learning by combiningmultispectral (MS) and synthetic aperture radar (SAR). Our method can effectively combine different modal data to extract key feature information. Results and discussion: After fine-tuning and comparison experiments, we outperformthemost advanced algorithms in all downstream classification tasks. The validity of our proposed method is verified. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. UTran-DSR: a novel transformer-based model using feature enhancement for dysarthric speech recognition.
- Author
-
Irshad, Usama, Mahum, Rabbia, Ganiyu, Ismaila, Butt, Faisal Shafique, Hidri, Lotfi, Ali, Tamer G., and El-Sherbeeny, Ahmed M.
- Subjects
TRANSFORMER models ,DATA recovery ,SPEECH perception ,DEEP learning ,SPASTIC paralysis - Abstract
Over the past decade, the prevalence of neurological diseases has significantly risen due to population growth and aging. Individuals suffering from spastic paralysis, brain attack, and idiopathic Parkinson's disease (PD), among other neurological illnesses, commonly suffer from dysarthria. Early detection and treatment of dysarthria in these patients are essential for effectively managing the progression of their disease. This paper provides UTrans-DSR, a novel encoder-decoder architecture for analyzing Mel-spectrograms (generated from audios) and classifying speech as healthy or dysarthric. Our model employs transformer encoder features based on a hybrid design, which includes the feature enhancement block (FEB) and the vision transformer (ViT) encoders. This combination effectively extracts global and local pixel information regarding localization while optimizing the mel-spectrograms feature extraction process. We keep up with the original class-token grouping sequence in the vision transformer while generating a new equivalent expanding route. More specifically, two unique growing pathways use a deep-supervision approach to increase spatial data recovery and expedite model convergence. We add consecutive residual connections to the system to reduce feature loss while increasing spatial data retrieval. Our technique is based on identifying gaps in mel-spectrograms distinguishing between normal and dysarthric speech. We conducted several experiments on UTrans-DSR using the UA speech and TORGO datasets, and it outperformed the existing top models. The model performed significantly in pixel's localized and spatial feature extraction, effectively detecting and classifying spectral gaps. The Tran-DSR model outperforms previous research models, achieving an accuracy of 97.75%. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
13. Equipping computational pathology systems with artifact processing pipelines: a showcase for computation and performance trade-offs.
- Author
-
Kanwal, Neel, Khoraminia, Farbod, Kiraz, Umay, Mosquera-Zamudio, Andrés, Monteagudo, Carlos, Janssen, Emiel A. M., Zuiverloon, Tahlita C. M., Rong, Chunming, and Engan, Kjersti
- Subjects
- *
TRANSFORMER models , *CONVOLUTIONAL neural networks , *DISTRIBUTION (Probability theory) , *DEEP learning , *CANCER hospitals , *COHEN'S kappa coefficient (Statistics) - Abstract
Background: Histopathology is a gold standard for cancer diagnosis. It involves extracting tissue specimens from suspicious areas to prepare a glass slide for a microscopic examination. However, histological tissue processing procedures result in the introduction of artifacts, which are ultimately transferred to the digitized version of glass slides, known as whole slide images (WSIs). Artifacts are diagnostically irrelevant areas and may result in wrong predictions from deep learning (DL) algorithms. Therefore, detecting and excluding artifacts in the computational pathology (CPATH) system is essential for reliable automated diagnosis. Methods: In this paper, we propose a mixture of experts (MoE) scheme for detecting five notable artifacts, including damaged tissue, blur, folded tissue, air bubbles, and histologically irrelevant blood from WSIs. First, we train independent binary DL models as experts to capture particular artifact morphology. Then, we ensemble their predictions using a fusion mechanism. We apply probabilistic thresholding over the final probability distribution to improve the sensitivity of the MoE. We developed four DL pipelines to evaluate computational and performance trade-offs. These include two MoEs and two multiclass models of state-of-the-art deep convolutional neural networks (DCNNs) and vision transformers (ViTs). These DL pipelines are quantitatively and qualitatively evaluated on external and out-of-distribution (OoD) data to assess generalizability and robustness for artifact detection application. Results: We extensively evaluated the proposed MoE and multiclass models. DCNNs-based MoE and ViTs-based MoE schemes outperformed simpler multiclass models and were tested on datasets from different hospitals and cancer types, where MoE using (MobileNet) DCNNs yielded the best results. The proposed MoE yields 86.15 % F1 and 97.93% sensitivity scores on unseen data, retaining less computational cost for inference than MoE using ViTs. This best performance of MoEs comes with relatively higher computational trade-offs than multiclass models. Furthermore, we apply post-processing to create an artifact segmentation mask, a potential artifact-free RoI map, a quality report, and an artifact-refined WSI for further computational analysis. During the qualitative evaluation, field experts assessed the predictive performance of MoEs over OoD WSIs. They rated artifact detection and artifact-free area preservation, where the highest agreement translated to a Cohen Kappa of 0.82, indicating substantial agreement for the overall diagnostic usability of the DCNN-based MoE scheme. Conclusions: The proposed artifact detection pipeline will not only ensure reliable CPATH predictions but may also provide quality control. In this work, the best-performing pipeline for artifact detection is MoE with DCNNs. Our detailed experiments show that there is always a trade-off between performance and computational complexity, and no straightforward DL solution equally suits all types of data and applications. The code and HistoArtifacts dataset can be found online at Github and Zenodo, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
14. Plant Disease Identification Based on Encoder–Decoder Model.
- Author
-
Feng, Wenfeng, Sun, Guoying, and Zhang, Xin
- Abstract
Plant disease identification is a crucial issue in agriculture, and with the advancement of deep learning techniques, early and accurate identification of plant diseases has become increasingly critical. In recent years, the rise of vision transformers has attracted significant attention from researchers in various vision-based application areas. We designed a model with an encoder–decoder architecture to efficiently classify plant diseases using a transfer learning approach, which effectively recognizes a large number of plant diseases in multiple crops. The model was tested on the "PlantVillage", "FGVC8", and "EMBRAPA" datasets, which contain leaf information from crops such as apples, soybeans, tomatoes, and potatoes. These datasets cover diseases caused by fungi, including rust, spot, and scab, as well as viral diseases such as leaf curl. The model's performance was rigorously evaluated on datasets, and the results demonstrated its high accuracy. The model achieved 99.9% accuracy on the "PlantVillage" dataset, 97.4% on the "EMBRAPA" dataset, and 91.5% on the "FGVC8" dataset, showcasing its competitiveness with other state-of-the-art models. This study provides a robust and reliable solution for plant disease classification and contributes to the advancement of precision agriculture. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Res-MGCA-SE: a lightweight convolutional neural network based on vision transformer for medical image classification.
- Author
-
Soleimani-Fard, Sina and Ko, Seok-bum
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *IMAGE recognition (Computer vision) , *COMPUTED tomography , *X-ray imaging , *X-rays - Abstract
This paper presents a lightweight and accurate convolution neural network (CNN) based on encoder in vision transformer structure, which uses multigroup convolution rather than multilayer perceptron and multiheaded self-attention. We propose a group convolution block called multigroup convolution attention (MGCA) and squeeze and excitation (SE). The MGCA includes two parts: three 1 × 1 convolutions concatenated along the channel dimension and depth-wise separable convolution. SE is used as a skip connection to provide long-range dependencies. MGCA-SE is introduced to reduce the number of parameters in state-of-the-art network in order to use fewer datasets for training CNN. Furthermore, we provide a lightweight network based on MGCA-SE in Resnet architecture called Resnet-multigroup convolution attention-squeeze and excitation (Res-MGCA-SE) in order to have early detection and treatment of medical images. Finally, Res-MGCA-SE is evaluated on lung cancer and Covid-19 chest X-ray and CT and images. According to our research findings, MGCA-SE can change convolutional layers in state-of-the-art networks and switch them to lightweight networks with properties comparable to heavy-weight networks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Deep learning for pore-scale two-phase flow: Modelling drainage in realistic porous media.
- Author
-
Reza, ASADOLAHPOUR Seyed, JIANG Zeyun, Helen, LEWIS, and MIN Chao
- Subjects
DEEP learning ,POROUS materials ,COMPUTED tomography ,MICROFLUIDICS ,SANDSTONE - Abstract
In order to predict phase distributions within complex pore structures during two-phase capillary-dominated drainage, we select subsamples from computerized tomography (CT) images of rocks and simulated porous media, and develop a pore morphology-based simulator (PMS) to create a diverse dataset of phase distributions. With pixel size, interfacial tension, contact angle, and pressure as input parameters, convolutional neural network (CNN), recurrent neural network (RNN) and vision transformer (ViT) are transformed, trained and evaluated to select the optimal model for predicting phase distribution. It is found that commonly used CNN and RNN have deficiencies in capturing phase connectivity. Subsequently, we develop a higher-dimensional vision transformer (HD-ViT) that drains pores solely based on their size, regardless of their spatial location, with phase connectivity enforced as a post-processing step. This approach enables inference for images of varying sizes and resolutions with inlet-outlet setup at any coordinate directions. We demonstrate that HD-ViT maintains its effectiveness, accuracy and speed advantage on larger sandstone and carbonate images, compared with the microfluidic-based displacement experiment. In the end, we train and validate a 3D version of the model. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
17. Enhancing Autonomous Visual Perception in Challenging Environments: Bilateral Models with Vision Transformer and Multilayer Perceptron for Traversable Area Detection.
- Author
-
Urrea, Claudio and Vélez, Maximiliano
- Abstract
The development of autonomous vehicles has grown significantly recently due to the promise of improving safety and productivity in cities and industries. The scene perception module has benefited from the latest advances in computer vision and deep learning techniques, allowing the creation of more accurate and efficient models. This study develops and evaluates semantic segmentation models based on a bilateral architecture to enhance the detection of traversable areas for autonomous vehicles on unstructured routes, particularly in datasets where the distinction between the traversable area and the surrounding ground is minimal. The proposed hybrid models combine Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and Multilayer Perceptron (MLP) techniques, achieving a balance between precision and computational efficiency. The results demonstrate that these models outperform the base architectures in prediction accuracy, capturing distant details more effectively while maintaining real-time operational capabilities. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
18. Lightweight vision image transformer (LViT) model for skin cancer disease classification.
- Author
-
Dwivedi, Tanay, Chaurasia, Brijesh Kumar, and Shukla, Man Mohan
- Abstract
Skin cancer (SC) is a lethal disease not only in India but also in the world; there are more than a million cases of melanoma per year in India. Early detection of skin cancer through accurate classification of skin lesions is essential for effective treatment. Visual inspection by clinical screening, dermoscopy, or histological tests is strongly emphasised in today's skin cancer diagnosis. It can be challenging to determine the kind of skin cancer, especially in the early stages, due to the resemblance among cancer types. However, the precise classification of skin lesions could be time-consuming and challenging for dermatologists. To address these issues, we propose transfer learning to accurately classify skin lesions into several forms of skin cancer using a lightweight B-16 Vision Image Transformer model (LViT). An extensive dataset is used in the experiment to verify the efficiency of the proposed LViT model. The LViT model can classify skin cancer with high accuracy, sensitivity, and specificity and generalise favourably to new images. The proposed model has a 93.17% accuracy rating for classifying SC images over 25 epochs and a remarkable accuracy of 95.82% over 100 epochs. The proposed LViT model is lightweight, requires minimal processing resources, and achieves good accuracy on small and enormous data sets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
19. OSTNet: overlapping splitting transformer network with integrated density loss for vehicle density estimation.
- Author
-
Qu, Yang, Yang, Liran, Zhong, Ping, and Li, Qiuyue
- Subjects
CONVOLUTIONAL neural networks ,TRANSFORMER models ,TRAFFIC monitoring ,TRAFFIC flow ,TRAFFIC safety ,CONVOLUTION codes - Abstract
Vehicle density estimation plays a crucial role in traffic monitoring, providing the traffic management department with the traffic volume and traffic flow to monitor traffic safety. Currently, all vehicle density estimation methods based on Convolutional Neural Network (CNN) fall short in extracting global information due to the limited receptive field of the convolution kernel, resulting in the loss of vehicle information. Vision Transformer can capture long-distance dependencies and establish global context information through the self-attention mechanism, and is expected to be applied to vehicle density estimation. However, directly using Vision Transformer will result in the discontinuity of vehicle information between patches. In addition, the completion of vehicle density estimation also faces challenges, such as vehicle multi-scale changes, occlusion, and background noise. To solve the above challenges, a novel Overlapping Splitting Transformer Network (OSTNet) tailored for vehicle density estimation is designed. Overlapping splitting is proposed so that each patch shares half of its area, ensuring the continuity of vehicle information between patches. Dilation convolution is introduced to remove fixed-size position codes in order to provide accurate vehicle localization information. Meanwhile, Feature Pyramid Aggregation (FPA) module is utilized to obtain different scale information, which can tackle the issue of multi-scale changes. Moreover, a novel loss function called integrated density loss is designed to address the existing vehicle occlusion and background noise problems. The extensive experimental results on four open source datasets have shown that OSTNet outperforms the SOTA methods and can help traffic management department to better estimate vehicle density. The source code and pre-trained models are available at: https://github.com/quyang-hub/vehicle-density-estimation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. FPGA‐Based Implementation of Real‐Time Cardiologist‐Level Arrhythmia Detection and Classification in Electrocardiograms Using Novel Deep Learning.
- Author
-
Chandrasekaran, Saravanakumar, Chandran, Srinivasan, and Selvam, Immaculate Joy
- Subjects
- *
TRANSFORMER models , *CAPSULE neural networks , *FEATURE extraction , *FEATURE selection , *DEEP learning , *ARRHYTHMIA - Abstract
ABSTRACT Cardiac arrhythmia refers to irregular heartbeats caused by anomalies in electrical transmission in the heart muscle, and it is an important threat to cardiovascular health. Conventional monitoring and diagnosis still depend on the laborious visual examination of electrocardiogram (ECG) devices, even though ECG signals are dynamic and complex. This paper discusses the need for an automated system to assist clinicians in efficiently recognizing arrhythmias. The existing machine‐learning (ML) algorithms have extensive training cycles and require manual feature selection; to eliminate this, we present a novel deep learning (DL) architecture. Our research introduces a novel approach to ECG classification by combining the vision transformer (ViT) and the capsule network (CapsNet) into a hybrid model named ViT‐Cap. We conduct necessary preprocessing operations, including noise removal and signal‐to‐image conversion using short‐time Fourier transform (SIFT) and continuous wavelet transform (CWT) algorithms, on both normal and abnormal ECG data obtained from the MIT‐BIH database. The proposed model intelligently focuses on crucial features by leveraging global and local attention to explore spectrogram and scalogram image data. Initially, the model divides the images into smaller patches and linearly embeds each patch. Features are then extracted using a transformer encoder, followed by classification using the capsule module with feature vectors from the ViT module. Comparisons with existing conventional models show that our proposed model outperforms the original ViT and CapsNet in terms of classification accuracy for both binary and multi‐class ECG classification. The experimental findings demonstrate an accuracy of 99% on both scalogram and spectrogram images. Comparative analysis with state‐of‐the‐art methodologies confirms the superiority of our framework. Additionally, we configure a field‐programmable gate array (FPGA) to implement the proposed model for real‐time arrhythmia classification, aiming to enhance user‐friendliness and speed. Despite numerous suggestions for high‐performance FPGA accelerators in the literature, our FPGA‐based accelerator utilizes optimization of loop parallelization, FP data, and multiply accumulation (MAC) unit. Our accelerator architecture achieves a 57% reduction in processing time and utilizes fewer resources compared to a floating‐point (FlP) design. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
21. A vision transformer‐based deep transfer learning nomogram for predicting lymph node metastasis in lung adenocarcinoma.
- Author
-
Chen, Chuanyu, Luo, Yi, Hou, Qiuyang, Qiu, Jun, Yuan, Shuya, and Deng, Kexue
- Abstract
Background Purpose Methods Results Conclusion Lymph node metastasis (LNM) plays a crucial role in the management of lung cancer; however, the ability of chest computed tomography (CT) imaging to detect LNM status is limited.This study aimed to develop and validate a vision transformer‐based deep transfer learning nomogram for predicting LNM in lung adenocarcinoma patients using preoperative unenhanced chest CT imaging.This study included 528 patients with lung adenocarcinoma who were randomly divided into training and validation cohorts at a 7:3 ratio. The pretrained vision transformer (ViT) was utilized to extract deep transfer learning (DTL) feature, and logistic regression was employed to construct a ViT‐based DTL model. Subsequently, the model was compared with six classical convolutional neural network (CNN) models. Finally, the ViT‐based DTL signature was combined with independent clinical predictors to construct a ViT‐based deep transfer learning nomogram (DTLN).The ViT‐based DTL model showed good performance, with an area under the curve (AUC) of 0.821 (95% CI, 0.775–0.867) in the training cohort and 0.825 (95% CI, 0.758–0.891) in the validation cohort. The ViT‐based DTL model demonstrated comparable performance to classical CNN models in predicting LNM, and the ViT‐based DTL signature was then used to construct ViT‐based DTLN with independent clinical predictors such as tumor maximum diameter, location, and density. The DTLN achieved the best predictive performance, with AUCs of 0.865 (95% CI, 0.827–0.903) and 0.894 (95% CI, 0845–0942), respectively, surpassing both the clinical factor model and the ViT‐based DTL model (
p < 0.001).This study developed a new DTL model based on ViT to predict LNM status in lung adenocarcinoma patients and revealed that the performance of the ViT‐based DTL model was comparable to that of classical CNN models, confirming that ViT was viable for deep learning tasks involving medical images. The ViT‐based DTLN performed exceptionally well and can assist clinicians and radiologists in making accurate judgments and formulating appropriate treatment plans. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
22. Towards improved fundus disease detection using Swin Transformers.
- Author
-
Jawad, M Abdul, Khursheed, Farida, Nawaz, Shah, and Mir, A. H.
- Subjects
TRANSFORMER models ,MACULAR degeneration ,COMPUTER-aided diagnosis ,MACHINE learning ,EQUILIBRIUM testing ,DEEP learning - Abstract
Ocular diseases can have debilitating consequences on visual acuity if left untreated, necessitating early and accurate diagnosis to improve patients' quality of life. Although the contemporary clinical prognosis involving fundus screening is a cost-effective method for detecting ocular abnormalities, however, it is time-intensive due to limited resources and expert ophthalmologists. While computer-aided detection, including traditional machine learning and deep learning, has been employed for enhanced prognosis from fundus images, conventional deep learning models often face challenges due to limited global modeling ability, inducing bias and suboptimal performance on unbalanced datasets. Presently, most studies on ocular disease detection focus on cataract detection or diabetic retinopathy severity prediction, leaving a myriad of vision-impairing conditions unexplored. Minimal research has been conducted utilizing deep models for identifying diverse ocular abnormalities from fundus images, with limited success. The study leveraged the capabilities of four Swin Transformer models (Swin-T, Swin-S, Swin-B, and Swin-L) for detecting various significant ocular diseases (including Cataracts, Hypertensive Retinopathy, Diabetic Retinopathy, Myopia, and Age-Related Macular Degeneration) from fundus images of the ODIR dataset. Swin Transformer models, confining self-attention to local windows while enabling cross-window interactions, demonstrated superior performance and computational efficiency. Upon assessment across three specific ODIR test sets, utilizing metrics such as AUC, F1-score, Kappa score, and a composite metric representing an average of these three (referred to as the final score), all Swin models exhibited superior performance metric scores than those documented in contemporary studies. The Swin-L model, in particular, achieved final scores of 0.8501, 0.8211, and 0.8616 on the Off-site, On-site, and Balanced ODIR test sets, respectively. An external validation on a Retina dataset further substantiated the generalizability of Swin models, with the models reporting final scores of 0.9058 (Swin-T), 0.92907 (Swin-S), 0.95917 (Swin-B), and 0.97042 (Swin-L). The results, corroborated by statistical analysis, underline the consistent and stable performance of Swin models across varied datasets, emphasizing their potential as reliable tools for multi-ocular disease detection from fundus images, thereby aiding in the early diagnosis and intervention of ocular abnormalities. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. DeepCPD: deep learning with vision transformer for colorectal polyp detection.
- Author
-
T.P, Raseena, Kumar, Jitendra, and Balasundaram, S. R.
- Subjects
TRANSFORMER models ,COLON polyps ,DATA augmentation ,MEDICAL screening ,EARLY diagnosis - Abstract
One of the most severe cancers worldwide is Colorectal Cancer (CRC), which has the third-highest incidence of cancer cases and the second-highest rate of cancer mortality. Early diagnosis and treatment are receiving much attention globally due to the increasing incidence and death rates. Colonoscopy is acknowledged as the gold standard for screening CRC. Despite early screening, doctors miss approximately 25% of polyps during a colonoscopy examination because the diagnosis varies from expert to expert. After a few years, this missing polyp may develop into cancer. This study is focused on addressing such diagnostic challenges, aiming to minimize the risk of misdiagnosis and enhance the overall accuracy of diagnostic procedures. Thus, we propose an efficient deep learning method, DeepCPD, combining transformer architecture and Linear Multihead Self-Attention (LMSA) mechanism with data augmentation to classify colonoscopy images into two categories: polyp versus non-polyp and hyperplastic versus adenoma based on the dataset. The experiments are conducted on four benchmark datasets: PolypsSet, CP-CHILD-A, CP-CHILD-B, and Kvasir V2. The proposed model demonstrated superior performance compared to the existing state-of-the-art methods with an accuracy above 98.05%, precision above 97.71%, and recall above 98.10%. Notably, the model exhibited a training time improvement of over 1.2x across all datasets. The strong performance of the recall metric shows the ability of DeepCPD to detect polyps by minimizing the false negative rate. These results indicate that this model can be used effectively to create a diagnostic tool with computer assistance that can be highly helpful to clinicians during the diagnosing process. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Application of machine learning for the differentiation of thymomas and thymic cysts using deep transfer learning: A multi‐center comparison of diagnostic performance based on different dimensional models.
- Author
-
Yang, Yuhua, Cheng, Jia, Chen, Liang, Cui, Can, Liu, Shaoqiang, and Zuo, Minjing
- Subjects
- *
TRANSFORMER models , *DEEP learning , *CONVOLUTIONAL neural networks , *MACHINE learning , *FEATURE extraction - Abstract
Objective Materials and Methods Results Conclusions This study aimed to evaluate the feasibility and performance of deep transfer learning (DTL) networks with different types and dimensions in differentiating thymomas from thymic cysts in a retrospective cohort.Based on chest‐enhanced computed tomography (CT), the region of interest was delineated, and the maximum cross section of the lesion was selected as the input image. Five convolutional neural networks (CNNs) and Vision Transformer (ViT) were used to construct a 2D DTL model. The 2D model constructed by the maximum section (n) and the upper and lower layers (n − 1, n + 1) of the lesion was used for feature extraction, and the features were selected. The remaining features were pre‐fused to construct a 2.5D model. The whole lesion image was selected for input and constructing a 3D model.In the 2D model, the area under curve (AUC) of Resnet50 was 0.950 in the training cohort and 0.907 in the internal validation cohort. In the 2.5D model, the AUCs of Vgg11 in the internal validation cohort and external validation cohort 1 were 0.937 and 0.965, respectively. The AUCs of Inception_v3 in the training cohort and external validation cohort 2 were 0.981 and 0.950, respectively. The AUC values of 3D_Resnet50 in the four cohorts were 0.987, 0.937, 0.938, and 0.905.The DTL model based on multiple different dimensions can be used as a highly sensitive and specific tool for the non‐invasive differential diagnosis of thymomas and thymic cysts to assist clinicians in decision‐making. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Integrating neural networks with advanced optimization techniques for accurate kidney disease diagnosis.
- Author
-
Elbedwehy, Samar, Hassan, Esraa, Saber, Abeer, and Elmonier, Rady
- Subjects
- *
CONVOLUTIONAL neural networks , *KIDNEY stones , *TRANSFORMER models , *IMAGE recognition (Computer vision) , *KIDNEY disease diagnosis - Abstract
Kidney diseases pose a significant global health challenge, requiring precise diagnostic tools to improve patient outcomes. This study addresses this need by investigating three main categories of renal diseases: kidney stones, cysts, and tumors. Utilizing a comprehensive dataset of 12,446 CT whole abdomen and urogram images, this study developed an advanced AI-driven diagnostic system specifically tailored for kidney disease classification. The innovative approach of this study combines the strengths of traditional convolutional neural network architecture (AlexNet) with modern advancements in ConvNeXt architectures. By integrating AlexNet's robust feature extraction capabilities with ConvNeXt's advanced attention mechanisms, the paper achieved an exceptional classification accuracy of 99.85%. A key advancement in this study's methodology lies in the strategic amalgamation of features from both networks. This paper concatenated hierarchical spatial information and incorporated self-attention mechanisms to enhance classification performance. Furthermore, the study introduced a custom optimization technique inspired by the Adam optimizer, which dynamically adjusts the step size based on gradient norms. This tailored optimizer facilitated faster convergence and more effective weight updates, imporving model performance. The model of this study demonstrated outstanding performance across various metrics, with an average precision of 99.89%, recall of 99.95%, and specificity of 99.83%. These results highlight the efficacy of the hybrid architecture and optimization strategy in accurately diagnosing kidney diseases. Additionally, the methodology of this paper emphasizes interpretability and explainability, which are crucial for the clinical deployment of deep learning models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. Ultra-lightweight convolution-transformer network for early fire smoke detection.
- Author
-
Chaturvedi, Shubhangi, Shubham Arun, Chandravanshi, Singh Thakur, Poornima, Khanna, Pritee, and Ojha, Aparajita
- Subjects
MODIS (Spectroradiometer) ,ARTIFICIAL neural networks ,CONVOLUTIONAL neural networks ,TRANSFORMER models ,ARTIFICIAL intelligence ,FIRE detectors - Abstract
Copyright of Fire Ecology is the property of Springer Nature and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
27. An explainable artificial intelligence‐based approach for reliable damage detection in polymer composite structures using deep learning.
- Author
-
Azad, Muhammad Muzammil and Kim, Heung Soo
- Subjects
- *
TRANSFORMER models , *ARTIFICIAL intelligence , *COMPOSITE structures , *POLYMER structure , *LAMINATED materials , *DEEP learning , *STRUCTURAL health monitoring - Abstract
Highlights Artificial intelligence (AI) techniques are increasingly used for structural health monitoring (SHM) of polymer composite structures. However, to be confident in the trustworthiness of AI models, the models must be reliable, interpretable, and explainable. The use of explainable artificial intelligence (XAI) is critical to ensure that the AI model is transparent in the decision‐making process and that the predictions it provides can be trusted and understood by users. However, existing SHM methods for polymer composite structures lack explainability and transparency, and therefore reliable damage detection. Therefore, an interpretable deep learning model based on an explainable vision transformer (X‐ViT) is proposed for the SHM of composites, leading to improved repair planning, maintenance, and performance. The proposed approach has been validated on carbon fiber reinforced polymers (CFRP) composites with multiple health states. The X‐ViT model exhibited better damage detection performance compared to existing popular methods. Moreover, the X‐ViT approach effectively highlighted the area of interest related to the prediction of each health condition in composites through the patch attention aggregation process, emphasizing their influence on the decision‐making process. Consequently, integrating the ViT‐based explainable deep‐learning model into the SHM of polymer composites provided improved diagnostics along with increased transparency and reliability. Autonomous damage detection of polymer composites using vision transformer based deep learning model. Explainable artificial intelligence by highlighting region of interest using patch attention. Comparison with the existing state of the art structural health monitoring methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. Comparison of Vision Transformers and Convolutional Neural Networks in Medical Image Analysis: A Systematic Review.
- Author
-
Takahashi, Satoshi, Sakaguchi, Yusuke, Kouno, Nobuji, Takasawa, Ken, Ishizu, Kenichi, Akagi, Yu, Aoyama, Rina, Teraya, Naoki, Bolatkan, Amina, Shinkai, Norio, Machino, Hidenori, Kobayashi, Kazuma, Asada, Ken, Komatsu, Masaaki, Kaneko, Syuzo, Sugiyama, Masashi, and Hamamoto, Ryuji
- Subjects
- *
STATISTICAL models , *COMPUTER simulation , *DIAGNOSTIC imaging , *ARTIFICIAL intelligence , *SYSTEMATIC reviews , *ATTENTION , *DEEP learning , *ARTIFICIAL neural networks , *DIGITAL image processing , *MACHINE learning - Abstract
In the rapidly evolving field of medical image analysis utilizing artificial intelligence (AI), the selection of appropriate computational models is critical for accurate diagnosis and patient care. This literature review provides a comprehensive comparison of vision transformers (ViTs) and convolutional neural networks (CNNs), the two leading techniques in the field of deep learning in medical imaging. We conducted a survey systematically. Particular attention was given to the robustness, computational efficiency, scalability, and accuracy of these models in handling complex medical datasets. The review incorporates findings from 36 studies and indicates a collective trend that transformer-based models, particularly ViTs, exhibit significant potential in diverse medical imaging tasks, showcasing superior performance when contrasted with conventional CNN models. Additionally, it is evident that pre-training is important for transformer applications. We expect this work to help researchers and practitioners select the most appropriate model for specific medical image analysis tasks, accounting for the current state of the art and future trends in the field. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. VT-BPAN: vision transformer-based bilinear pooling and attention network fusion of RGB and skeleton features for human action recognition.
- Author
-
Sun, Yaohui, Xu, Weiyao, Yu, Xiaoyi, and Gao, Ju
- Subjects
HUMAN activity recognition ,TRANSFORMER models ,KINECT (Motion sensor) ,HUMAN skeleton ,SKELETON - Abstract
Recent generation Microsoft Kinect Camera captures a series of multimodal signals that provide RGB video, depth sequences, and skeleton information, thus it becomes an option to achieve enhanced human action recognition performance by fusing different data modalities. However, most existing fusion methods simply fuse different features, which ignores the underlying semantics between different models, leading to a lack of accuracy. In addition, there exists a large amount of background noise. In this work, we propose a Vision Transformer-based Bilinear Pooling and Attention Network (VT-BPAN) fusion mechanism for human action recognition. This work improves the recognition accuracy in the following ways: 1) An effective two-stream feature pooling and fusion mechanism is proposed. The RGB frames and skeleton are fused to enhance the spatio-temporal feature representation. 2) A spatial lightweight multiscale vision Transformer is proposed, which can reduce the cost of computing. The framework is evaluated based on three widely used video action datasets, and the proposed approach performs a more comparable performance with the state-of-the-art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
30. Strawberry disease identification with vision transformer-based models.
- Author
-
Nguyen, Hai Thanh, Tran, Tri Dac, Nguyen, Thanh Tuong, Pham, Nhi Minh, Nguyen Ly, Phuc Hoang, and Luong, Huong Hoang
- Subjects
CONVOLUTIONAL neural networks ,TRANSFORMER models ,IMAGE recognition (Computer vision) ,NOSOLOGY ,MANUFACTURING processes - Abstract
Strawberry is a healthy, beneficial fruit and one of the most valuable exports for most countries. However, diseases could produce poor-quality strawberries and affect the consumer's health. Thus, quality inspection is a crucial stage in processing production. Convolutional Neural Network (CNN) models can be used to identify specific diseases. Even yet, the performance of Vision Transformer (ViT) has recently improved by using transfer learning to detect strawberry diseases. The goal is to train this model to recognize those diseases, applying fine-tuning to increase the precision of the results to obtain high accuracy. Strawberry photos from the collection are divided into seven classes and mainly focus on strawberry leaves, berries, and flower diseases. The findings demonstrate the benefits of using the ViT model, which outperforms a similar approach to strawberry disease classification with accuracy and an F1-score of 0.927 and 0.927, respectively, on the Strawberry Disease Detection dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Enhancing oil palm segmentation model with GAN-based augmentation.
- Author
-
Kwong, Qi Bin, Kon, Yee Thung, Rusik, Wan Rusydiah W., Shabudin, Mohd Nor Azizi, Rahman, Shahirah Shazana A., Kulaveerasingam, Harikrishna, and Appleton, David Ross
- Subjects
TRANSFORMER models ,DATA augmentation ,OIL palm ,GENERATIVE adversarial networks ,TILES ,PALMS - Abstract
In digital agriculture, accurate crop detection is fundamental to developing automated systems for efficient plantation management. For oil palm, the main challenge lies in developing robust models that perform well in different environmental conditions. This study addresses the feasibility of using GAN augmentation methods to improve palm detection models. For this purpose, drone images of young palms (< 5 year-old) from eight different estates were collected, annotated, and used to build a baseline detection model based on DETR. StyleGAN2 was trained on the extracted palms and then used to generate a series of synthetic palms, which were then inserted into tiles representing different environments. CycleGAN networks were trained for bidirectional translation between synthetic and real tiles, subsequently utilized to augment the authenticity of synthetic tiles. Both synthetic and real tiles were used to train the GAN-based detection model. The baseline model achieved precision and recall values of 95.8% and 97.2%. The GAN-based model achieved comparable result, with precision and recall values of 98.5% and 98.6%. In the challenge dataset 1 consisting older palms (> 5 year-old), both models also achieved similar accuracies, with baseline model achieving precision and recall of 93.1% and 99.4%, and GAN-based model achieving 95.7% and 99.4%. As for the challenge dataset 2 consisting of storm affected palms, the baseline model achieved precision of 100% but recall was only 13%. The GAN-based model achieved a significantly better result, with a precision and recall values of 98.7% and 95.3%. This result demonstrates that images generated by GANs have the potential to enhance the accuracies of palm detection models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. 基于元学习的小样本语义分割算法.
- Author
-
王兰忠 and 牟昌善
- Subjects
- *
TRANSFORMER models , *MACHINE learning , *SPINE , *PROBLEM solving - Abstract
To solve the problem of low segmentation accuracy for unknown novel classes in existing few shot semantic segmentation models, the few shot semantic segmentation algorithm based on meta-learning was proposed. The depth-separable convolutions were utilized to improve the traditional backbone network, and the encoder pre-training on the ImageNet dataset was performed. The pre-trained backbone network was used to map the support and query images into deep feature space. Using the ground truth masks of the support images, the support features were separated into object foreground and background, and the adaptive meta-learning classifier was constructed using vision transformer. The extensive experiments on the PASCAL-5' dataset were completed. The results show that the proposed model achieves mloU (mean Intersection over Union) (1 shot) of 47.1%, 58.3% and 60.4% on VGG-16, ResNet-50 and ResNet-101 backbone networks, respectively, and it achieves mloU of 49.6%, 60.2% and 62.1% under the 5 shot setting. On the COCO-20' dataset, mloU (1 shot) values of 23.6%, 30.3% and 30.7% are achieved with mlou values of 30.1%, 34.7% and 35.2% under the 5 shot setting. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
33. HCFormer: A Lightweight Pest Detection Model Combining CNN and ViT.
- Author
-
Zeng, Meiqi, Chen, Shaonan, Liu, Hongshan, Wang, Weixing, and Xie, Jiaxing
- Subjects
- *
TRANSFORMER models , *CONVOLUTIONAL neural networks , *DEEP learning , *AGRICULTURAL pests , *FEATURE extraction - Abstract
Pests are widely distributed in nature, characterized by their small size, which, along with environmental factors such as lighting conditions, makes their identification challenging. A lightweight pest detection network, HCFormer, combining convolutional neural networks (CNNs) and a vision transformer (ViT) is proposed in this study. Data preprocessing is conducted using a bottleneck-structured convolutional network and a Stem module to reduce computational latency. CNNs with various kernel sizes capture local information at different scales, while the ViT network's attention mechanism and global feature extraction enhance pest feature representation. A down-sampling method reduces the input image size, decreasing computational load and preventing overfitting while enhancing model robustness. Improved attention mechanisms effectively capture feature relationships, balancing detection accuracy and speed. The experimental results show that HCFormer achieves 98.17% accuracy, 91.98% recall, and a mean average precision (mAP) of 90.57%. Compared with SENet, CrossViT, and YOLOv8, HCFormer improves the average accuracy by 7.85%, 2.01%, and 3.55%, respectively, outperforming the overall mainstream detection models. Ablation experiments indicate that the model's parameter count is 26.5 M, demonstrating advantages in lightweight design and detection accuracy. HCFormer's efficiency and flexibility in deployment, combined with its high detection accuracy and precise classification, make it a valuable tool for identifying and classifying crop pests in complex environments, providing essential guidance for future pest monitoring and control. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
34. EfficientUNetViT: Efficient Breast Tumor Segmentation Utilizing UNet Architecture and Pretrained Vision Transformer.
- Author
-
Anari, Shokofeh, de Oliveira, Gabriel Gomes, Ranjbarzadeh, Ramin, Alves, Angela Maria, Vaz, Gabriel Caumo, and Bendechache, Malika
- Subjects
- *
TRANSFORMER models , *IMAGE processing , *FEATURE extraction , *BREAST tumors , *COMPUTATIONAL complexity - Abstract
This study introduces a sophisticated neural network structure for segmenting breast tumors. It achieves this by combining a pretrained Vision Transformer (ViT) model with a UNet framework. The UNet architecture, commonly employed for biomedical image segmentation, is further enhanced with depthwise separable convolutional blocks to decrease computational complexity and parameter count, resulting in better efficiency and less overfitting. The ViT, renowned for its robust feature extraction capabilities utilizing self-attention processes, efficiently captures the overall context within images, surpassing the performance of conventional convolutional networks. By using a pretrained ViT as the encoder in our UNet model, we take advantage of its extensive feature representations acquired from extensive datasets, resulting in a major enhancement in the model's ability to generalize and train efficiently. The suggested model has exceptional performance in segmenting breast cancers from medical images, highlighting the advantages of integrating transformer-based encoders with efficient UNet topologies. This hybrid methodology emphasizes the capabilities of transformers in the field of medical image processing and establishes a new standard for accuracy and efficiency in activities related to tumor segmentation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. CA-ViT: Contour-Guided and Augmented Vision Transformers to Enhance Glaucoma Classification Using Fundus Images.
- Author
-
Tohye, Tewodros Gizaw, Qin, Zhiguang, Al-antari, Mugahed A., Ukwuoma, Chiagoziem C., Lonseko, Zenebe Markos, and Gu, Yeong Hyeon
- Subjects
- *
TRANSFORMER models , *GENERATIVE adversarial networks , *VISION disorders , *OPTIC disc , *GLAUCOMA - Abstract
Glaucoma, a predominant cause of visual impairment on a global scale, poses notable challenges in diagnosis owing to its initially asymptomatic presentation. Early identification is vital to prevent irreversible vision impairment. Cutting-edge deep learning techniques, such as vision transformers (ViTs), have been employed to tackle the challenge of early glaucoma detection. Nevertheless, limited approaches have been suggested to improve glaucoma classification due to issues like inadequate training data, variations in feature distribution, and the overall quality of samples. Furthermore, fundus images display significant similarities and slight discrepancies in lesion sizes, complicating glaucoma classification when utilizing ViTs. To address these obstacles, we introduce the contour-guided and augmented vision transformer (CA-ViT) for enhanced glaucoma classification using fundus images. We employ a Conditional Variational Generative Adversarial Network (CVGAN) to enhance and diversify the training dataset by incorporating conditional sample generation and reconstruction. Subsequently, a contour-guided approach is integrated to offer crucial insights into the disease, particularly concerning the optic disc and optic cup regions. Both the original images and extracted contours are given to the ViT backbone; then, feature alignment is performed with a weighted cross-entropy loss. Finally, in the inference phase, the ViT backbone, trained on the original fundus images and augmented data, is used for multi-class glaucoma categorization. By utilizing the Standardized Multi-Channel Dataset for Glaucoma (SMDG), which encompasses various datasets (e.g., EYEPACS, DRISHTI-GS, RIM-ONE, REFUGE), we conducted thorough testing. The results indicate that the proposed CA-ViT model significantly outperforms current methods, achieving a precision of 93.0%, a recall of 93.08%, an F1 score of 92.9%, and an accuracy of 93.0%. Therefore, the integration of augmentation with the CVGAN and contour guidance can effectively enhance glaucoma classification tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
36. A Comparative Analysis of U-Net and Vision Transformer Architectures in Semi-Supervised Prostate Zonal Segmentation.
- Author
-
Huang, Guantian, Xia, Bixuan, Zhuang, Haoming, Yan, Bohan, Wei, Cheng, Qi, Shouliang, Qian, Wei, and He, Dianning
- Subjects
- *
TRANSFORMER models , *DIAGNOSTIC imaging , *AUTODIDACTICISM , *TIME-varying networks , *ENTROPY - Abstract
The precise segmentation of different regions of the prostate is crucial in the diagnosis and treatment of prostate-related diseases. However, the scarcity of labeled prostate data poses a challenge for the accurate segmentation of its different regions. We perform the segmentation of different regions of the prostate using U-Net- and Vision Transformer (ViT)-based architectures. We use five semi-supervised learning methods, including entropy minimization, cross pseudo-supervision, mean teacher, uncertainty-aware mean teacher (UAMT), and interpolation consistency training (ICT) to compare the results with the state-of-the-art prostate semi-supervised segmentation network uncertainty-aware temporal self-learning (UATS). The UAMT method improves the prostate segmentation accuracy and provides stable prostate region segmentation results. ICT plays a more stable role in the prostate region segmentation results, which provides strong support for the medical image segmentation task, and demonstrates the robustness of U-Net for medical image segmentation. UATS is still more applicable to the U-Net backbone and has a very significant effect on a positive prediction rate. However, the performance of ViT in combination with semi-supervision still requires further optimization. This comparative analysis applies various semi-supervised learning methods to prostate zonal segmentation. It guides future prostate segmentation developments and offers insights into utilizing limited labeled data in medical imaging. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. VTCNet: A Feature Fusion DL Model Based on CNN and ViT for the Classification of Cervical Cells.
- Author
-
Li, Mingzhe, Que, Ningfeng, Zhang, Juanhua, Du, Pingfang, and Dai, Yin
- Subjects
- *
TRANSFORMER models , *EARLY detection of cancer , *DEEP learning , *PAP test ,DEVELOPING countries - Abstract
Cervical cancer is a common malignancy worldwide with high incidence and mortality rates in underdeveloped countries. The Pap smear test, widely used for early detection of cervical cancer, aims to minimize missed diagnoses, which sometimes results in higher false‐positive rates. To enhance manual screening practices, computer‐aided diagnosis (CAD) systems based on machine learning (ML) and deep learning (DL) for classifying cervical Pap cells have been extensively researched. In our study, we introduced a DL‐based method named VTCNet for the task of cervical cell classification. Our approach combines CNN‐SPPF and ViT components, integrating modules like Focus and SeparableC3, to capture more potential information, extract local and global features, and merge them to enhance classification performance. We evaluated our method on the public SIPaKMeD dataset, achieving accuracies, precision, recall, and F1 scores of 97.16%, 97.22%, 97.19%, and 97.18%, respectively. We also conducted additional experiments on the Herlev dataset, where our results outperformed previous methods. The VTCNet method achieved higher classification accuracy than traditional ML or shallow DL models through this integration. Related codes: https://github.com/Camellia‐0892/VTCNet/tree/main. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. Multimodal Neuroimaging Fusion for Alzheimer's Disease: An Image Colorization Approach With Mobile Vision Transformer.
- Author
-
Odusami, Modupe, Damasevicius, Robertas, Milieskaite‐Belousoviene, Egle, and Maskeliunas, Rytis
- Subjects
- *
TRANSFORMER models , *ALZHEIMER'S disease , *MILD cognitive impairment , *FEATURE extraction , *DIAGNOSTIC imaging - Abstract
Multimodal neuroimaging, combining data from different sources, has shown promise in the classification of the Alzheimer's disease (AD) stage. Existing multimodal neuroimaging fusion methods exhibit certain limitations, which require advancements to enhance their objective performance, sensitivity, and specificity for AD classification. This study uses the use of a Pareto‐optimal cosine color map to enhance classification performance and visual clarity of fused images. A mobile vision transformer (ViT) model, incorporating the swish activation function, is introduced for effective feature extraction and classification. Fused images from the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Whole Brain Atlas (AANLIB), and Open Access Series of Imaging Studies (OASIS) datasets, obtained through optimized transposed convolution, are utilized for model training, while evaluation is achieved using images that have not been fused from the same databases. The proposed model demonstrates high accuracy in AD classification across different datasets, achieving 98.76% accuracy for Early Mild Cognitive Impairment (EMCI) versus LMCI, 98.65% for Late Mild Cognitive Impairment (LMCI) versus AD, 98.60% for EMCI versus AD, and 99.25% for AD versus Cognitive Normal (CN) in the ADNI dataset. Similarly, on OASIS and AANLIB, the precision of the AD versus CN classification is 99.50% and 96.00%, respectively. Evaluation metrics showcase the model's precision, recall, and F1 score for various binary classifications, emphasizing its robust performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. Breaking Barriers in Cancer Diagnosis: Super‐Light Compact Convolution Transformer for Colon and Lung Cancer Detection.
- Author
-
Maurya, Ritesh, Pandey, Nageshwar Nath, Karnati, Mohan, and Sahu, Geet
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *COLON cancer , *LUNG cancer , *COLON cancer diagnosis - Abstract
According to the World Health Organization, lung and colon cancers are known for their high mortality rates which necessitate the diagnosis of these cancers at an early stage. However, the limited availability of data such as histopathology images used for diagnosis of these cancers, poses a significant challenge while developing computer‐aided detection system. This makes it necessary to keep a check on the number of parameters in the artificial intelligence (AI) model used for the detection of these cancers considering the limited availability of the data. In this work, a customised compact and efficient convolution transformer architecture, termed, C3‐Transformer has been proposed for the diagnosis of colon and lung cancers using histopathological images. The proposed C3‐Transformer relies on convolutional tokenisation and sequence pooling approach to keep a check on the number of parameters and to combine the advantage of convolution neural network with the advantages of transformer model. The novelty of the proposed method lies in efficient classification of colon and lung cancers using the proposed C3‐Transformer architecture. The performance of the proposed method has been evaluated on the 'LC25000' dataset. Experimental results shows that the proposed method has been able to achieve average classification accuracy, precision and recall value of 99.30%, 0.9941 and 0.9950, in classifying the five different classes of colon and lung cancer with only 0.0316 million parameters. Thus, the present computer‐aided detection system developed using proposed C3‐Transformer can efficiently detect the colon and lung cancers using histopathology images with high detection accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. RetinaViT: Efficient Visual Backbone for Online Video Streams.
- Author
-
Suzuki, Tomoyuki and Aoki, Yoshimitsu
- Subjects
- *
STREAMING video & television , *TRANSFORMER models , *RECOGNITION (Psychology) , *FEATURE extraction , *NEIGHBORHOODS - Abstract
In online video understanding, which has a wide range of real-world applications, inference speed is crucial. Many approaches involve frame-level visual feature extraction, which often represents the biggest bottleneck. We propose RetinaViT, an efficient method for extracting frame-level visual features in an online video stream, aiming to fundamentally enhance the efficiency of online video understanding tasks. RetinaViT is composed of efficiently approximated Transformer blocks that only take changed tokens (event tokens) as queries and reuse the already processed tokens from the previous timestep for the others. Furthermore, we restrict keys and values to the spatial neighborhoods of event tokens to further improve efficiency. RetinaViT involves tuning multiple parameters, which we determine through a multi-step process. During model training, we randomly vary these parameters and then perform black-box optimization to maximize accuracy and efficiency on the pre-trained model. We conducted extensive experiments on various online video recognition tasks, including action recognition, pose estimation, and object segmentation, validating the effectiveness of each component in RetinaViT and demonstrating improvements in the speed/accuracy trade-off compared to baselines. In particular, for action recognition, RetinaViT built on ViT-B16 reduces inference time by approximately 61.9% on the CPU and 50.8% on the GPU, while achieving slight accuracy improvements rather than degradation. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
41. Scale-aware token-matching for transformer-based object detector.
- Author
-
Jung, Aecheon, Hong, Sungeun, and Hyun, Yoonsuk
- Subjects
- *
TRANSFORMER models , *DETECTORS - Published
- 2024
- Full Text
- View/download PDF
42. Self-supervised Domain Adaptation with Significance-Oriented Masking for Pelvic Organ Prolapse detection.
- Author
-
Li, Shichang, Wu, Hongjie, Tang, Chenwei, Chen, Dongdong, Chen, Yueyue, Mei, Ling, Yang, Fan, and Lv, Jiancheng
- Subjects
- *
PELVIC organ prolapse , *TRANSFORMER models - Published
- 2024
- Full Text
- View/download PDF
43. VerFormer: Vertebrae-Aware Transformer for Automatic Spine Segmentation from CT Images.
- Author
-
Li, Xinchen, Hong, Yuan, Xu, Yang, and Hu, Mu
- Subjects
- *
TRANSFORMER models , *CONVOLUTIONAL neural networks , *VERTEBRAL fractures , *COMPUTED tomography , *SPINE - Abstract
The accurate and efficient segmentation of the spine is important in the diagnosis and treatment of spine malfunctions and fractures. However, it is still challenging because of large inter-vertebra variations in shape and cross-image localization of the spine. In previous methods, convolutional neural networks (CNNs) have been widely applied as a vision backbone to tackle this task. However, these methods are challenged in utilizing the global contextual information across the whole image for accurate spine segmentation because of the inherent locality of the convolution operation. Compared with CNNs, the Vision Transformer (ViT) has been proposed as another vision backbone with a high capacity to capture global contextual information. However, when the ViT is employed for spine segmentation, it treats all input tokens equally, including vertebrae-related tokens and non-vertebrae-related tokens. Additionally, it lacks the capability to locate regions of interest, thus lowering the accuracy of spine segmentation. To address this limitation, we propose a novel Vertebrae-aware Vision Transformer (VerFormer) for automatic spine segmentation from CT images. Our VerFormer is designed by incorporating a novel Vertebrae-aware Global (VG) block into the ViT backbone. In the VG block, the vertebrae-related global contextual information is extracted by a Vertebrae-aware Global Query (VGQ) module. Then, this information is incorporated into query tokens to highlight vertebrae-related tokens in the multi-head self-attention module. Thus, this VG block can leverage global contextual information to effectively and efficiently locate spines across the whole input, thus improving the segmentation accuracy of VerFormer. Driven by this design, the VerFormer demonstrates a solid capacity to capture more discriminative dependencies and vertebrae-related context in automatic spine segmentation. The experimental results on two spine CT segmentation tasks demonstrate the effectiveness of our VG block and the superiority of our VerFormer in spine segmentation. Compared with other popular CNN- or ViT-based segmentation models, our VerFormer shows superior segmentation accuracy and generalization. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. Robust anomaly detection in industrial images by blending global–local features.
- Author
-
Pei, Mingjing, Liu, Ningzhong, and Xia, Shifeng
- Subjects
- *
TRANSFORMER models , *DATA mining , *DEEP learning , *FEATURE extraction , *RANDOM noise theory - Abstract
Industrial image anomaly detection achieves automated detection and localization of defects or abnormal regions in images through image processing and deep learning techniques. Currently, utilizing the approach of reverse knowledge distillation has yielded favourable outcomes. However, it is still a challenge in terms of the feature extraction capability of the image and the robustness of the decoding of the student network. This study first addresses the issue that the teacher network has not been able to extract global information more effectively. To acquire more global information, a vision transformer network is introduced to enhance the model's global information extraction capability, obtaining better features to further assist the student network in decoding. Second, for anomalous samples, to address the low similarity between features extracted by the teacher network and features restored by the student network, Gaussian noise is introduced. This further increases the probability that the features decoded by the student model match normal sample features, enhancing the robustness of the student model. Extensive experiments were conducted on industrial image datasets AeBAD, MvtecAD, and BTAD. In the AeBAD dataset, under the PRO performance metric, the result is 89.83%, achieving state‐of‐the‐art performance. Under the AUROC performance metric, it reaches 83.35%. Similarly, good results were achieved on the MvtecAD and BTAD datasets. The proposed method's effectiveness and performance advantages were validated across multiple industrial datasets, providing a valuable reference for the application of industrial image anomaly detection methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. Parallel desires: unifying local and semantic feature representations in marine species images for classification.
- Author
-
Manikandan, Dhana Lakshmi and Santhanam, Sakthivel Murugan
- Abstract
Accurate identification of marine species is essential for ecological monitoring, habitat assessment, biodiversity conservation, and sustainable resource management. To address the challenges associated with diverse and complex marine environments, the paper proposes a integrated model that combines the strengths of a Vision Transformer (ViT) and Transfer Learning (TL). The paper introduces a novel methodology for the classification of marine species images by integrating the capabilities of a Amended Dual Attention oN Self-locale and External (ADANSE) Vision Transformer and a DenseNet-169 Transfer Learning model. The ADANSE-ViT, serving as the foundational architecture, excels in capturing long-range dependencies and intricate patterns in large-scale images, forming a robust basis for subsequent classification tasks. On Fine-tuning further, it customizes the model for marine species images. Additionally, we utilize transfer learning with the DenseNet-169 architecture, pre-trained on a comprehensive dataset, to extract relevant features and enhance classification effectiveness specifically for marine species. This synergistic combination enables a comprehensive analysis of both local and semantic features in species images, leading to accurate classification results. Experimental evaluations conducted on self-collected and benchmark datasets showcase the efficacy of our approach, surpassing existing fish classifiers and TL variants in terms of classification accuracy. Our integrated model achieves an impressive accuracy of 96.21% for the self-collected dataset and 95.09% for the benchmarked dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
46. Attention-based multi-scale feature fusion network for myopia grading using optical coherence tomography images.
- Author
-
Huang, Gengyou, Wen, Yang, Qian, Bo, Bi, Lei, Chen, Tingli, and Sheng, Bin
- Subjects
- *
MYOPIA , *TRANSFORMER models , *OPTICAL coherence tomography , *DEEP learning - Abstract
Myopia is a serious threat to eye health and can even cause blindness. It is important to grade myopia and carry out targeted intervention. Nowadays, various studies using deep learning models based on optical coherence tomography (OCT) images to screen for high myopia. However, since regions of interest (ROIs) of pre-myopia and low myopia on OCT images are relatively small, it is rather difficult to use OCT images to conduct detailed myopia grading. There are few studies using OCT images for more detailed myopia grading. To address these problems, we propose a novel attention-based multi-scale feature fusion network named AMFF for myopia grading using OCT images. The proposed AMFF mainly consists of five modules: a pre-trained vision transformer (ViT) module, a multi-scale convolutional module, an attention feature fusion module, an Avg-TopK pooling module and a fully connected (FC) classifier. Firstly, unsupervised pre-training of ViT on the training set can better extract feature maps. Secondly, multi-scale convolutional layers further extract multi-scale feature maps to obtain more receptive fields and extract scale-invariant features. Thirdly, feature maps of different scales are fused through channel attention and spatial attention to further obtain more meaningful features. Lastly, the most prominent features are obtained by the weighted average of the highest activation values of each channel, and then they are used to classify myopia through a fully connected layer. Extensive experiments show that our proposed model has the superior performance compared with other state-of-the-art myopia grading models. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
47. Performance of vision transformer and swin transformer models for lemon quality classification in fruit juice factories.
- Author
-
Dümen, Sezer, Kavalcı Yılmaz, Esra, Adem, Kemal, and Avaroglu, Erdinç
- Subjects
- *
TRANSFORMER models , *MACHINE learning , *DEEP learning , *FRUIT juices , *ARTIFICIAL intelligence - Abstract
Assessing the quality of agricultural products holds vital significance in enhancing production efficiency and market viability. The adoption of artificial intelligence (AI) has notably surged for this purpose, employing deep learning and machine learning techniques to process and classify agricultural product images, adhering to defined standards. This study focuses on the lemon dataset, encompassing 'good' and 'bad' quality classes, initiate by augmenting data through rescaling, random zoom, flip, and rotation methods. Subsequently, employing eight diverse deep learning approaches and two transformer methods for classification, the study culminated in the ViT method achieving an unprecedented 99.84% accuracy, 99.95% recall, and 99.66% precision, marking the highest accuracy documented. These findings strongly advocate for the efficacy of the ViT method in successfully classifying lemon quality, spotlighting its potential impact on agricultural quality assessment. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
48. SSG2: A new modeling paradigm for semantic segmentation.
- Author
-
Diakogiannis, Foivos I., Furby, Suzanne, Caccetta, Peter, Wu, Xiaoliang, Ibata, Rodrigo, Hlinka, Ondrej, and Taylor, John
- Subjects
- *
CONVOLUTIONAL neural networks , *TRANSFORMER models , *SPATIAL resolution , *ERROR rates , *SPECTRAL imaging - Abstract
State-of-the-art models in semantic segmentation primarily operate on single, static images, generating corresponding segmentation masks. This one-shot approach leaves little room for error correction, as the models lack the capability to integrate multiple observations for enhanced accuracy. Inspired by work on semantic change detection, we address this limitation by introducing a methodology that leverages a sequence of observables generated for each static input image. By adding this "temporal" dimension, we exploit strong signal correlations between successive observations in the sequence to reduce error rates. Our framework, dubbed SSG2 (Semantic Segmentation Generation 2), employs a dual-encoder, single-decoder base network augmented with a sequence model. The base model learns to predict the set intersection, union, and difference of labels from dual-input images. Given a fixed target input image and a set of support images, the sequence model builds the predicted mask of the target by synthesizing the partial views from each sequence step and filtering out noise. We evaluate SSG2 across four diverse datasets: UrbanMonitor, featuring orthoimage tiles from Darwin, Australia with four spectral bands at 0.2 m spatial resolution and a surface model; ISPRS Potsdam, which includes true orthophoto images with multiple spectral bands and a 5 cm ground sampling distance; ISPRS Vahingen, which also includes true orthophoto images and a 9 cm ground sampling distance; and ISIC2018, a medical dataset focused on skin lesion segmentation, particularly melanoma. The SSG2 model demonstrates rapid convergence within the first few tens of epochs and significantly outperforms UNet-like baseline models with the same number of gradient updates. However, the addition of the temporal dimension results in an increased memory footprint. While this could be a limitation, it is offset by the advent of higher-memory GPUs and coding optimizations. Our code is available at https://github.com/feevos/ssg2. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
49. EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm.
- Author
-
Zhang, Jiangning, Li, Xiangtai, Wang, Yabiao, Wang, Chengjie, Yang, Yibo, Liu, Yong, and Tao, Dacheng
- Subjects
- *
IMAGE recognition (Computer vision) , *TRANSFORMER models , *OBJECT recognition (Computer vision) , *COMPUTER vision , *BIOLOGICAL evolution - Abstract
Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
50. 结合多尺度多注意力的遥感图像超分辨率重构.
- Author
-
熊承义, 郑瑞华, 高志荣, 何缘, and 完颜静萱
- Subjects
TRANSFORMER models ,REMOTE sensing ,HIGH resolution imaging - Abstract
Copyright of Journal of South-Central Minzu University (Natural Science Edition) is the property of Journal of South-Central Minzu University (Natural Science Edition) Editorial Office and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.