Back to Search
Start Over
Scene representation using a new two-branch neural network model.
- Source :
-
Visual Computer . Sep2024, Vol. 40 Issue 9, p6219-6244. 26p. - Publication Year :
- 2024
-
Abstract
- Scene classification and recognition have always been one of the most challenging tasks of scene understanding due to the inherent ambiguity in visual scenes. The core of scene classification and recognition tasks is scene representation. Deep learning advances in computer vision, especially deep CNNs, have significantly improved scene representation in the last decade. Deep convolutional features extracted from deep CNNs provide discriminative representations of the images and are widely used in various computer vision tasks, such as scene classification. Deep convolutional features capture the appearance characteristics of the image and the spatial information about different image regions. Meanwhile, the semantic and context information obtained from high-level concepts about scene images, such as objects and their relationships, can significantly contribute to identifying scene images. Therefore, in this paper, we divide visual scenes into two categories, object-based and layout-based. Object-based scenes are scenes that have scene-specific objects and, based on those objects, can be described and identified. In contrast, the layout-based scenes do not have scene-specific objects and are described and identified based on the appearance and layout of the image. This paper proposes a new neural network model for representing and classifying visual scenes, which we call G-CNN (GNN-CNN). The proposed model includes two modules, feature extraction and feature fusion, and the feature extraction module composes of visual and semantic branches. The visual branch is responsible for extracting deep CNN features from the image, and the semantic branch is responsible for extracting semantic GNN features from the scene graph corresponding to the image. The feature fusion module is a novel two-stream neural network that fuses the CNN and GNN feature vectors to produce a comprehensive representation of the scene image. Finally, a fully-connected classifier classified the obtained comprehensive feature vector into one of the pre-defined categories. The proposed model has been evaluated on three benchmark scene datasets, UIUC Sports, MIT67, and SUN397, and obtained classification accuracy of 99.91%, 96.01%, and 85.32%, respectively. In addition, a new dataset named Scene40, which has been introduced in our previous paper, is also used for further evaluation of the proposed method. The comparison results based on classification accuracy criteria show that the proposed model can outperform the best previous methods on three benchmark scene datasets. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 01782789
- Volume :
- 40
- Issue :
- 9
- Database :
- Academic Search Index
- Journal :
- Visual Computer
- Publication Type :
- Academic Journal
- Accession number :
- 179041379
- Full Text :
- https://doi.org/10.1007/s00371-023-03162-9