Author: "Didier Stricker" / Publisher: mdpi ag - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Didier Stricker"' showing total 44 results

Start Over Author "Didier Stricker" Publisher mdpi ag

44 results on '"Didier Stricker"'

1. PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

Author: Nikolas Ebert, Didier Stricker, and Oliver Wasenmüller
Subjects: transformer, self-attention, image classification, object detection, semantic segmentation, Chemical technology, TP1-1185
Abstract: Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively.
Published: 2023
Full Text: View/download PDF

2. Formation of a Lightweight, Deep Learning-Based Weed Detection System for a Commercial Autonomous Laser Weeding Robot

Author: Hafiza Sundus Fatima, Imtiaz ul Hassan, Shehzad Hasan, Muhammad Khurram, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: real-time detection, deep-learning, single-shot detector (SSD) model, light-weight, YOLO, weed dataset, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Weed management is becoming increasingly important for sustainable crop production. Weeds cause an average yield loss of 11.5% billion in Pakistan, which is more than PKR 65 billion per year. A real-time laser weeding robot can increase the crop’s yield by efficiently removing weeds. Therefore, it helps decrease the environmental risks associated with traditional weed management approaches. However, to work efficiently and accurately, the weeding robot must have a robust weed detection mechanism to avoid physical damage to the targeted crops. This work focuses on developing a lightweight weed detection mechanism to assist laser weeding robots. The weed images were collected from six different agriculture farms in Pakistan. The dataset consisted of 9000 images of three crops: okra, bitter gourd, sponge gourd, and four weed species (horseweed, herb paris, grasses, and small weeds). We chose a single-shot object detection model, YOLO5. The selected model achieved a mAP of 0.88@IOU 0.5, indicating that the model predicted a large number of true positive (TP) with much less prediction of false positive (FP) and false negative (FN). While SSD-ResNet50 achieved a mAP of 0.53@IOU 0.5, the model predicted fewer TP with significant outcomes as FP or FN. The superior performance of the YOLOv5 model made it suitable for detecting and classifying weeds and crops within fields. Furthermore, the model was ported to an Nvidia Xavier AGX standalone device to make it a high-performance and low-power computation detection system. The model achieved an FPS rate of 27. Therefore, it is highly compatible with the laser weeding robot, which takes approximately 22.04 h at a velocity of 0.25 feet per second to remove weeds from a one-acre plot.
Published: 2023
Full Text: View/download PDF

3. Development of Cost-Effective and Easily Replicable Robust Weeding Machine—Premiering Precision Agriculture in Pakistan

Author: Azmat Hussain, Hafiza Sundus Fatima, Syed Mohiuddin Zia, Shehzad Hasan, Muhammad Khurram, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: weeding machine, mobile robot, laser weeding, precision agriculture, Computer-Aided Design, Mechanical engineering and machinery, TJ1-1570
Abstract: Weed management has become a highly labor-intensive activity, which is the reason for decreased yields and high costs. Moreover, the lack of skilled labor and weed-resistant herbicides severely impact the agriculture sector and food production, hence increasing the need for automation in agriculture. The use of agricultural robots will help in the assurance of higher yields and proactive control of the crops. This study proposes a laser-based weeding vehicle with a unique mechanical body that is adjustable relative to the field structure, called the Robot Operating System (ROS) based robust control system, and is customizable, cost-effective and easily replicable. Hence, an autonomous-mobile-agricultural robot with a 20 watt laser has been developed for the precise removal of weed plants. The assembled robot’s testing was conducted in the agro living lab. The field trials have demonstrated that the robot takes approximately 23.7 h at the linear velocity of 0.07 m/s for the weeding of one acre plot. It includes 5 s of laser to kill one weed plant. Comparatively, the primitive weeding technique is highly labor intensive and takes several days to complete an acre plot area. The data presented herein reflects that implementing this technology could become an excellent approach to removing unwanted plants from agricultural fields. This solution is relatively cost-efficient and provides an alternative to expensive human labor initiatives to deal with the increased labor wages.
Published: 2023
Full Text: View/download PDF

4. Driving Activity Recognition Using UWB Radar and Deep Neural Networks

Author: Iuliia Brishtel, Stephan Krauss, Mahdi Chamseddine, Jason Raphael Rambach, and Didier Stricker
Subjects: modern radar applications, artificial intelligence and machine learning for radar, radar sensors for driver monitoring, radar signal processing techniques, Chemical technology, TP1-1185
Abstract: In-car activity monitoring is a key enabler of various automotive safety functions. Existing approaches are largely based on vision systems. Radar, however, can provide a low-cost, privacy-preserving alternative. To this day, such systems based on the radar are not widely researched. In our work, we introduce a novel approach that uses the Doppler signal of an ultra-wideband (UWB) radar as an input to deep neural networks for the classification of driving activities. In contrast to previous work in the domain, we focus on generalization to unseen persons and make a new radar driving activity dataset (RaDA) available to the scientific community to encourage comparison and the benchmarking of future methods.
Published: 2023
Full Text: View/download PDF

5. INV-Flow2PoseNet: Light-Resistant Rigid Object Pose from Optical Flow of RGB-D Images Using Images, Normals and Vertices

Author: Torben Fetzer, Gerd Reis, and Didier Stricker
Subjects: light resistance, optical flow, scene flow, rigid alignment, point cloud registration, Chemical technology, TP1-1185
Abstract: This paper presents a novel architecture for simultaneous estimation of highly accurate optical flows and rigid scene transformations for difficult scenarios where the brightness assumption is violated by strong shading changes. In the case of rotating objects or moving light sources, such as those encountered for driving cars in the dark, the scene appearance often changes significantly from one view to the next. Unfortunately, standard methods for calculating optical flows or poses are based on the expectation that the appearance of features in the scene remains constant between views. These methods may fail frequently in the investigated cases. The presented method fuses texture and geometry information by combining image, vertex and normal data to compute an illumination-invariant optical flow. By using a coarse-to-fine strategy, globally anchored optical flows are learned, reducing the impact of erroneous shading-based pseudo-correspondences. Based on the learned optical flows, a second architecture is proposed that predicts robust rigid transformations from the warped vertex and normal maps. Particular attention is paid to situations with strong rotations, which often cause such shading changes. Therefore, a 3-step procedure is proposed that profitably exploits correlations between the normals and vertices. The method has been evaluated on a newly created dataset containing both synthetic and real data with strong rotations and shading effects. These data represent the typical use case in 3D reconstruction, where the object often rotates in large steps between the partial reconstructions. Additionally, we apply the method to the well-known Kitti Odometry dataset. Even if, due to fulfillment of the brightness assumption, this is not the typical use case of the method, the applicability to standard situations and the relation to other methods is therefore established.
Published: 2022
Full Text: View/download PDF

6. Unsupervised Image-to-Image Translation: A Review

Author: Henri Hoyez, Cédric Schockaert, Jason Rambach, Bruno Mirbach, and Didier Stricker
Subjects: unsupervised image-to-image translation, machine learning, computer vision, deep learning, generative adversarial networks, review, Chemical technology, TP1-1185
Abstract: Supervised image-to-image translation has been proven to generate realistic images with sharp details and to have good quantitative performance. Such methods are trained on a paired dataset, where an image from the source domain already has a corresponding translated image in the target domain. However, this paired dataset requirement imposes a huge practical constraint, requires domain knowledge or is even impossible to obtain in certain cases. Due to these problems, unsupervised image-to-image translation has been proposed, which does not require domain expertise and can take advantage of a large unlabeled dataset. Although such models perform well, they are hard to train due to the major constraints induced in their loss functions, which make training unstable. Since CycleGAN has been released, numerous methods have been proposed which try to address various problems from different perspectives. In this review, we firstly describe the general image-to-image translation framework and discuss the datasets and metrics involved in the topic. Furthermore, we revise the current state-of-the-art with a classification of existing works. This part is followed by a small quantitative evaluation, for which results were taken from papers.
Published: 2022
Full Text: View/download PDF

7. Attention-Guided Disentangled Feature Aggregation for Video Object Detection

Author: Shishir Muralidhara, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: object detection, video object detection, attention, computer vision, deep learning, Chemical technology, TP1-1185
Abstract: Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from individual frames. The proposed framework is a two-stage object detector based on the Faster R-CNN architecture. The disentanglement head integrates scale, spatial and task-aware attention and applies it to the features extracted by the backbone network across all the frames. Subsequently, the aggregation head incorporates temporal attention and improves detection in the target frame by aggregating the features of the support frames. These include the features extracted from the disentanglement network along with the temporal features. We evaluate the proposed framework using the ImageNet VID dataset and achieve a mean Average Precision (mAP) of 49.8 and 52.5 using the backbones of ResNet-50 and ResNet-101, respectively. The improvement in performance over the individual baseline methods validates the efficacy of the proposed approach.
Published: 2022
Full Text: View/download PDF

8. Rethinking Learnable Proposals for Graphical Object Detection in Scanned Document Images

Author: Sankalp Sinha, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: graphical page object detection, deep learning, computer vision, proposals, document image analysis, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: In the age of deep learning, researchers have looked at domain adaptation under the pre-training and fine-tuning paradigm to leverage the gains in the natural image domain. These backbones and subsequent networks are designed for object detection in the natural image domain. They do not consider some of the critical characteristics of document images. Document images are sparse in contextual information, and the graphical page objects are logically clustered. This paper investigates the effectiveness of deep and robust backbones in the document image domain. Further, it explores the idea of learnable object proposals through Sparse R-CNN. This paper shows that simple domain adaptation of top-performing object detectors to the document image domain does not lead to better results. Furthermore, empirically showing that detectors based on dense object priors like Faster R-CNN, Mask R-CNN, and Cascade Mask R-CNN are perhaps not best suited for graphical page object detection. Detectors that reduce the number of object candidates while making them learnable are a step towards a better approach. We formulate and evaluate the Sparse R-CNN (SR-CNN) model on the IIIT-AR-13k, PubLayNet, and DocBank datasets and hope to inspire a rethinking of object proposals in the domain of graphical page object detection.
Published: 2022
Full Text: View/download PDF

9. Advanced Scene Perception for Augmented Reality

Author: Jason Rambach and Didier Stricker
Subjects: n/a, Photography, TR1-1050, Computer applications to medicine. Medical informatics, R858-859.7, Electronic computers. Computer science, QA75.5-76.95
Abstract: Augmented reality (AR), combining virtual elements with the real world, has demonstrated impressive results in a variety of application fields and gained significant research attention in recent years due to its limitless potential [...]
Published: 2022
Full Text: View/download PDF

10. Continual Learning for Table Detection in Document Images

Author: Mohammad Minouei, Khurram Azeem Hashmi, Mohammad Reza Soheili, Muhammad Zeshan Afzal, and Didier Stricker
Subjects: table detection, document layout analysis, continual learning, incremental learning, experience replay, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: The growing amount of data demands methods that can gradually learn from new samples. However, it is not trivial to continually train a network. Retraining a network with new data usually results in a phenomenon called “catastrophic forgetting”. In a nutshell, the performance of the model on the previous data drops by learning from the new instances. This paper explores this issue in the table detection problem. While there are multiple datasets and sophisticated methods for table detection, the utilization of continual learning techniques in this domain has not been studied. We employed an effective technique called experience replay and performed extensive experiments on several datasets to investigate the effects of catastrophic forgetting. The results show that our proposed approach mitigates the performance drop by 15 percent. To the best of our knowledge, this is the first time that continual learning techniques have been adopted for table detection, and we hope this stands as a baseline for future research.
Published: 2022
Full Text: View/download PDF

11. A Comprehensive Survey of Depth Completion Approaches

Author: Muhammad Ahmed Ullah Khan, Danish Nazir, Alain Pagani, Hamam Mokayed, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: depth completion, depth maps, image-guidance, Chemical technology, TP1-1185
Abstract: Depth maps produced by LiDAR-based approaches are sparse. Even high-end LiDAR sensors produce highly sparse depth maps, which are also noisy around the object boundaries. Depth completion is the task of generating a dense depth map from a sparse depth map. While the earlier approaches focused on directly completing this sparsity from the sparse depth maps, modern techniques use RGB images as a guidance tool to resolve this problem. Whilst many others rely on affinity matrices for depth completion. Based on these approaches, we have divided the literature into two major categories; unguided methods and image-guided methods. The latter is further subdivided into multi-branch and spatial propagation networks. The multi-branch networks further have a sub-category named image-guided filtering. In this paper, for the first time ever we present a comprehensive survey of depth completion methods. We present a novel taxonomy of depth completion approaches, review in detail different state-of-the-art techniques within each category for depth completion of LiDAR data, and provide quantitative results for the approaches on KITTI and NYUv2 depth completion benchmark datasets.
Published: 2022
Full Text: View/download PDF

12. Mask-Aware Semi-Supervised Object Detection in Floor Plans

Author: Tahira Shehzadi, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: object detection, semi-supervised learning, Mask R-CNN, floor-plan images, computer vision, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Research has been growing on object detection using semi-supervised methods in past few years. We examine the intersection of these two areas for floor-plan objects to promote the research objective of detecting more accurate objects with less labeled data. The floor-plan objects include different furniture items with multiple types of the same class, and this high inter-class similarity impacts the performance of prior methods. In this paper, we present Mask R-CNN-based semi-supervised approach that provides pixel-to-pixel alignment to generate individual annotation masks for each class to mine the inter-class similarity. The semi-supervised approach has a student–teacher network that pulls information from the teacher network and feeds it to the student network. The teacher network uses unlabeled data to form pseudo-boxes, and the student network uses both label data with the pseudo boxes and labeled data as the ground truth for training. It learns representations of furniture items by combining labeled and label data. On the Mask R-CNN detector with ResNet-101 backbone network, the proposed approach achieves a mAP of 98.8%, 99.7%, and 99.8% with only 1%, 5% and 10% labeled data, respectively. Our experiment affirms the efficiency of the proposed approach, as it outperforms the previous semi-supervised approaches using only 1% of the labels.
Published: 2022
Full Text: View/download PDF

13. Three-Dimensional Reconstruction from a Single RGB Image Using Deep Learning: A Review

Author: Muhammad Saif Ullah Khan, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: deep learning, 3D reconstruction, convolutional neural networks, textureless surfaces, Photography, TR1-1050, Computer applications to medicine. Medical informatics, R858-859.7, Electronic computers. Computer science, QA75.5-76.95
Abstract: Performing 3D reconstruction from a single 2D input is a challenging problem that is trending in literature. Until recently, it was an ill-posed optimization problem, but with the advent of learning-based methods, the performance of 3D reconstruction has also significantly improved. Infinitely many different 3D objects can be projected onto the same 2D plane, which makes the reconstruction task very difficult. It is even more difficult for objects with complex deformations or no textures. This paper serves as a review of recent literature on 3D reconstruction from a single view, with a focus on deep learning methods from 2018 to 2021. Due to the lack of standard datasets or 3D shape representation methods, it is hard to compare all reviewed methods directly. However, this paper reviews different approaches for reconstructing 3D shapes as depth maps, surface normals, point clouds, and meshes; along with various loss functions and metrics used to train and evaluate these methods.
Published: 2022
Full Text: View/download PDF

14. Investigating Attention Mechanism for Page Object Detection in Document Images

Author: Shivam Naik, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: attention mechanism, page object detection, transfer learning, document image analysis, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Page object detection in scanned document images is a complex task due to varying document layouts and diverse page objects. In the past, traditional methods such as Optical Character Recognition (OCR)-based techniques have been employed to extract textual information. However, these methods fail to comprehend complex page objects such as tables and figures. This paper addresses the localization problem and classification of graphical objects that visually summarize vital information in documents. Furthermore, this work examines the benefit of incorporating attention mechanisms in different object detection networks to perform page object detection on scanned document images. The model is designed with a Pytorch-based framework called Detectron2. The proposed pipelines can be optimized end-to-end and exhaustively evaluated on publicly available datasets such as DocBank, PublayNet, and IIIT-AR-13K. The achieved results reflect the effectiveness of incorporating the attention mechanism for page object detection in documents.
Published: 2022
Full Text: View/download PDF

15. Toward Semi-Supervised Graphical Object Detection in Document Images

Author: Goutham Kallempudi, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: graphical page objects, object detection, document image analysis, semi-supervised, soft teacher, Information technology, T58.5-58.64
Abstract: The graphical page object detection classifies and localizes objects such as Tables and Figures in a document. As deep learning techniques for object detection become increasingly successful, many supervised deep neural network-based methods have been introduced to recognize graphical objects in documents. However, these models necessitate a substantial amount of labeled data for the training process. This paper presents an end-to-end semi-supervised framework for graphical object detection in scanned document images to address this limitation. Our method is based on a recently proposed Soft Teacher mechanism that examines the effects of small percentage-labeled data on the classification and localization of graphical objects. On both the PubLayNet and the IIIT-AR-13K datasets, the proposed approach outperforms the supervised models by a significant margin in all labeling ratios (1%, 5%, and 10%). Furthermore, the 10% PubLayNet Soft Teacher model improves the average precision of Table, Figure, and List by +5.4,+1.2, and +3.2 points, respectively, with a similar total mAP as the Faster-RCNN baseline. Moreover, our model trained on 10% of IIIT-AR-13K labeled data beats the previous fully supervised method +4.5 points.
Published: 2022
Full Text: View/download PDF

16. Autoencoder and Partially Impossible Reconstruction Losses

Author: Steve Dias Da Cruz, Bertram Taetz, Thomas Stifter, and Didier Stricker
Subjects: autoencoder, generalization, Sim2Real, illumination, reconstruction, sampling, Chemical technology, TP1-1185
Abstract: The generally unsupervised nature of autoencoder models implies that the main training metric is formulated as the error between input images and their corresponding reconstructions. Different reconstruction loss variations and latent space regularizations have been shown to improve model performances depending on the tasks to solve and to induce new desirable properties such as disentanglement. Nevertheless, measuring the success in, or enforcing properties by, the input pixel space is a challenging endeavour. In this work, we want to make use of the available data more efficiently and provide design choices to be considered in the recording or generation of future datasets to implicitly induce desirable properties during training. To this end, we propose a new sampling technique which matches semantically important parts of the image while randomizing the other parts, leading to salient feature extraction and a neglection of unimportant details. The proposed method can be combined with any existing reconstruction loss and the performance gain is superior to the triplet loss. We analyse the resulting properties on various datasets and show improvements on several computer vision tasks: illumination and unwanted features can be normalized or smoothed out and shadows are removed such that classification or other tasks work more reliably; a better invariances with respect to unwanted features is induced; the generalization capacities from synthetic to real images is improved, such that more of the semantics are preserved; uncertainty estimation is superior to Monte Carlo Dropout and an ensemble of models, particularly for datasets of higher visual complexity. Finally, classification accuracy by means of simple linear classifiers in the latent space is improved compared to the triplet loss. For each task, the improvements are highlighted on several datasets commonly used by the research community, as well as in automotive applications.
Published: 2022
Full Text: View/download PDF

17. Exploiting Concepts of Instance Segmentation to Boost Detection in Challenging Environments

Author: Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: object detection, challenging environments, low-light, complex environments, deep neural networks, computer vision, Chemical technology, TP1-1185
Abstract: In recent years, due to the advancements in machine learning, object detection has become a mainstream task in the computer vision domain. The first phase of object detection is to find the regions where objects can exist. With the improvements in deep learning, traditional approaches, such as sliding windows and manual feature selection techniques, have been replaced with deep learning techniques. However, object detection algorithms face a problem when performed in low light, challenging weather, and crowded scenes, similar to any other task. Such an environment is termed a challenging environment. This paper exploits pixel-level information to improve detection under challenging situations. To this end, we exploit the recently proposed hybrid task cascade network. This network works collaboratively with detection and segmentation heads at different cascade levels. We evaluate the proposed methods on three complex datasets of ExDark, CURE-TSD, and RESIDE, and achieve a mAP of 0.71, 0.52, and 0.43, respectively. Our experimental results assert the efficacy of the proposed approach.
Published: 2022
Full Text: View/download PDF

18. TIMo—A Dataset for Indoor Building Monitoring with a Time-of-Flight Camera

Author: Pascal Schneider, Yuriy Anisimov, Raisul Islam, Bruno Mirbach, Jason Rambach, Didier Stricker, and Frédéric Grandidier
Subjects: time-of-flight, depth imaging, person detection, anomaly detection, dataset, machine learning, Chemical technology, TP1-1185
Abstract: We present TIMo (Time-of-flight Indoor Monitoring), a dataset for video-based monitoring of indoor spaces captured using a time-of-flight (ToF) camera. The resulting depth videos feature people performing a set of different predefined actions, for which we provide detailed annotations. Person detection for people counting and anomaly detection are the two targeted applications. Most existing surveillance video datasets provide either grayscale or RGB videos. Depth information, on the other hand, is still a rarity in this class of datasets in spite of being popular and much more common in other research fields within computer vision. Our dataset addresses this gap in the landscape of surveillance video datasets. The recordings took place at two different locations with the ToF camera set up either in a top-down or a tilted perspective on the scene. Moreover, we provide experimental evaluation results from baseline algorithms.
Published: 2022
Full Text: View/download PDF

19. AnyGesture: Arbitrary One-Handed Gestures for Augmented, Virtual, and Mixed Reality Applications

Author: Alexander Schäfer, Gerd Reis, and Didier Stricker
Subjects: gestures, interaction, natural user interface, gestural input, freehand, hands-free, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Natural user interfaces based on hand gestures are becoming increasingly popular. The need for expensive hardware left a wide range of interaction possibilities that hand tracking enables largely unexplored. Recently, hand tracking has been built into inexpensive and widely available hardware, allowing more and more people access to this technology. This work provides researchers and users with a simple yet effective way to implement various one-handed gestures to enable deeper exploration of gesture-based interactions and interfaces. To this end, this work provides a framework for design, prototyping, testing, and implementation of one-handed gestures. The proposed framework was implemented with two main goals: First, it should be able to recognize any one-handed gesture. Secondly, the design and implementation of gestures should be as simple as performing the gesture and pressing a button to record it. The contribution of this paper is a simple yet unique way to record and recognize static and dynamic one-handed gestures. A static gesture can be captured with a template matching approach, while dynamic gestures use previously captured spatial information. The presented approach was evaluated in a user study with 33 participants and the implementable gestures received high accuracy and user acceptance.
Published: 2022
Full Text: View/download PDF

20. EmmDocClassifier: Efficient Multimodal Document Image Classifier for Scarce Data

Author: Shrinidhi Kanchi, Alain Pagani, Hamam Mokayed, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: BERT, document image classification, EfficientNet, fine-tuned BERT, hierarchical attention networks, Multimodal, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Document classification is one of the most critical steps in the document analysis pipeline. There are two types of approaches for document classification, known as image-based and multimodal approaches. Image-based document classification approaches are solely based on the inherent visual cues of the document images. In contrast, the multimodal approach co-learns the visual and textual features, and it has proved to be more effective. Nonetheless, these approaches require a huge amount of data. This paper presents a novel approach for document classification that works with a small amount of data and outperforms other approaches. The proposed approach incorporates a hierarchical attention network (HAN) for the textual stream and the EfficientNet-B0 for the image stream. The hierarchical attention network in the textual stream uses dynamic word embedding through fine-tuned BERT. HAN incorporates both the word level and sentence level features. While earlier approaches rely on training on a large corpus (RVL-CDIP), we show that our approach works with a small amount of data (Tobacco-3482). To this end, we trained the neural network at Tobacco-3482 from scratch. Therefore, we outperform the state-of-the-art by obtaining an accuracy of 90.3%. This results in a relative error reduction rate of 7.9%.
Published: 2022
Full Text: View/download PDF

21. Nonlinear Optimization of Light Field Point Cloud

Author: Yuriy Anisimov, Jason Raphael Rambach, and Didier Stricker
Subjects: light field, depth estimation, point cloud, Chemical technology, TP1-1185
Abstract: The problem of accurate three-dimensional reconstruction is important for many research and industrial applications. Light field depth estimation utilizes many observations of the scene and hence can provide accurate reconstruction. We present a method, which enhances existing reconstruction algorithm with per-layer disparity filtering and consistency-based holes filling. Together with that we reformulate the reconstruction result to a form of point cloud from different light field viewpoints and propose a non-linear optimization of it. The capability of our method to reconstruct scenes with acceptable quality was verified by evaluation on a publicly available dataset.
Published: 2022
Full Text: View/download PDF

22. Towards Robust Object Detection in Floor Plan Images: A Data Augmentation Approach

Author: Shashank Mishra, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: object detection, Cascade Mask R-CNN, floor plan images, deep learning, transfer learning, dataset augmentation, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Object detection is one of the most critical tasks in the field of Computer vision. This task comprises identifying and localizing an object in the image. Architectural floor plans represent the layout of buildings and apartments. The floor plans consist of walls, windows, stairs, and other furniture objects. While recognizing floor plan objects is straightforward for humans, automatically processing floor plans and recognizing objects is challenging. In this work, we investigate the performance of the recently introduced Cascade Mask R-CNN network to solve object detection in floor plan images. Furthermore, we experimentally establish that deformable convolution works better than conventional convolutions in the proposed framework. Prior datasets for object detection in floor plan images are either publicly unavailable or contain few samples. We introduce SFPI, a novel synthetic floor plan dataset consisting of 10,000 images to address this issue. Our proposed method conveniently exceeds the previous state-of-the-art results on the SESYD dataset with an mAP of 98.1%. Moreover, it sets impressive baseline results on our novel SFPI dataset with an mAP of 99.8%. We believe that introducing the modern dataset enables the researcher to enhance the research in this domain.
Published: 2021
Full Text: View/download PDF

23. Contrastive Learning for 3D Point Clouds Classification and Shape Completion

Author: Danish Nazir, Muhammad Zeshan Afzal, Alain Pagani, Marcus Liwicki, and Didier Stricker
Subjects: point cloud classification, point cloud shape completion, AutoEncoders, contrastive AutoEncoders, contrasitive learning for point clouds, self-supervised learning for point cloud shape completion, Chemical technology, TP1-1185
Abstract: In this paper, we present the idea of Self Supervised learning on the shape completion and classification of point clouds. Most 3D shape completion pipelines utilize AutoEncoders to extract features from point clouds used in downstream tasks such as classification, segmentation, detection, and other related applications. Our idea is to add contrastive learning into AutoEncoders to encourage global feature learning of the point cloud classes. It is performed by optimizing triplet loss. Furthermore, local feature representations learning of point cloud is performed by adding the Chamfer distance function. To evaluate the performance of our approach, we utilize the PointNet classifier. We also extend the number of classes for evaluation from 4 to 10 to show the generalization ability of the learned features. Based on our results, embeddings generated from the contrastive AutoEncoder enhances shape completion and classification performance from 84.2% to 84.9% of point clouds achieving the state-of-the-art results with 10 classes.
Published: 2021
Full Text: View/download PDF

24. To Drive or to Be Driven? The Impact of Autopilot, Navigation System, and Printed Maps on Driver’s Cognitive Workload and Spatial Knowledge

Author: Iuliia Brishtel, Thomas Schmidt, Igor Vozniak, Jason Raphael Rambach, Bruno Mirbach, and Didier Stricker
Subjects: spatial cognition, spatial knowledge, cognitive workload, electrodermal activity, autonomous driving, human navigation, Geography (General), G1-922
Abstract: The technical advances in navigation systems should enhance the driving experience, supporting drivers’ spatial decision making and learning in less familiar or unfamiliar environments. Furthermore, autonomous driving systems are expected to take over navigation and driving in the near future. Yet, previous studies pointed at a still unresolved gap between environmental exploration using topographical maps and technical navigation means. Less is known about the impact of the autonomous system on the driver’s spatial learning. The present study investigates the development of spatial knowledge and cognitive workload by comparing printed maps, navigation systems, and autopilot in an unfamiliar virtual environment. Learning of a new route with printed maps was associated with a higher cognitive demand compared to the navigation system and autopilot. In contrast, driving a route by memory resulted in an increased level of cognitive workload if the route had been previously learned with the navigation system or autopilot. Way-finding performance was found to be less prone to errors when learning a route from a printed map. The exploration of the environment with the autopilot was not found to provide any compelling advantages for landmark knowledge. Our findings suggest long-term disadvantages of self-driving vehicles for spatial memory representations.
Published: 2021
Full Text: View/download PDF

25. CasTabDetectoRS: Cascade Network for Table Detection in Document Images with Recursive Feature Pyramid and Switchable Atrous Convolution

Author: Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: table detection, table recognition, cascade Mask R-CNN, atrous convolution, recursive feature pyramid networks, document image analysis, Photography, TR1-1050, Computer applications to medicine. Medical informatics, R858-859.7, Electronic computers. Computer science, QA75.5-76.95
Abstract: Table detection is a preliminary step in extracting reliable information from tables in scanned document images. We present CasTabDetectoRS, a novel end-to-end trainable table detection framework that operates on Cascade Mask R-CNN, including Recursive Feature Pyramid network and Switchable Atrous Convolution in the existing backbone architecture. By utilizing a comparativelyightweight backbone of ResNet-50, this paper demonstrates that superior results are attainable without relying on pre- and post-processing methods, heavier backbone networks (ResNet-101, ResNeXt-152), and memory-intensive deformable convolutions. We evaluate the proposed approach on five different publicly available table detection datasets. Our CasTabDetectoRS outperforms the previous state-of-the-art results on four datasets (ICDAR-19, TableBank, UNLV, and Marmot) and accomplishes comparable results on ICDAR-17 POD. Upon comparing with previous state-of-the-art results, we obtain a significant relative error reduction of 56.36%, 20%, 4.5%, and 3.5% on the datasets of ICDAR-19, TableBank, UNLV, and Marmot, respectively. Furthermore, this paper sets a new benchmark by performing exhaustive cross-datasets evaluations to exhibit the generalization capabilities of the proposed method.
Published: 2021
Full Text: View/download PDF

26. HybridTabNet: Towards Better Table Detection in Scanned Document Images

Author: Danish Nazir, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: table detection, table localization, deep learning, hybrid task cascade, object detection, deformable convolution, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: Tables in document images are an important entity since they contain crucial information. Therefore, accurate table detection can significantly improve the information extraction from documents. In this work, we present a novel end-to-end trainable pipeline, HybridTabNet, for table detection in scanned document images. Our two-stage table detector uses the ResNeXt-101 backbone for feature extraction and Hybrid Task Cascade (HTC) to localize the tables in scanned document images. Moreover, we replace conventional convolutions with deformable convolutions in the backbone network. This enables our network to detect tables of arbitrary layouts precisely. We evaluate our approach comprehensively on ICDAR-13, ICDAR-17 POD, ICDAR-19, TableBank, Marmot, and UNLV. Apart from the ICDAR-17 POD dataset, our proposed HybridTabNet outperformed earlier state-of-the-art results without depending on pre- and post-processing steps. Furthermore, to investigate how the proposed method generalizes unseen data, we conduct an exhaustive leave-one-out-evaluation. In comparison to prior state-of-the-art results, our method reduced the relative error by 27.57% on ICDAR-2019-TrackA-Modern, 42.64% on TableBank (Latex), 41.33% on TableBank (Word), 55.73% on TableBank (Latex + Word), 10% on Marmot, and 9.67% on the UNLV dataset. The achieved results reflect the superior performance of the proposed method.
Published: 2021
Full Text: View/download PDF

27. Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images

Author: Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: formula detection, Cascade Mask R-CNN, mathematical expression detection, document image analysis, deep neural networks, computer vision, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: This paper presents a novel architecture for detecting mathematical formulas in document images, which is an important step for reliable information extraction in several domains. Recently, Cascade Mask R-CNN networks have been introduced to solve object detection in computer vision. In this paper, we suggest a couple of modifications to the existing Cascade Mask R-CNN architecture: First, the proposed network uses deformable convolutions instead of conventional convolutions in the backbone network to spot areas of interest better. Second, it uses a dual backbone of ResNeXt-101, having composite connections at the parallel stages. Finally, our proposed network is end-to-end trainable. We evaluate the proposed approach on the ICDAR-2017 POD and Marmot datasets. The proposed approach demonstrates state-of-the-art performance on ICDAR-2017 POD at a higher IoU threshold with an f1-score of 0.917, reducing the relative error by 7.8%. Moreover, we accomplished correct detection accuracy of 81.3% on embedded formulas on the Marmot dataset, which results in a relative error reduction of 30%.
Published: 2021
Full Text: View/download PDF

28. Survey and Performance Analysis of Deep Learning Based Object Detection in Challenging Environments

Author: Muhammad Ahmed, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: object detection, challenging environments, low light, image enhancement, complex environments, state of the art, Chemical technology, TP1-1185
Abstract: Recent progress in deep learning has led to accurate and efficient generic object detection networks. Training of highly reliable models depends on large datasets with highly textured and rich images. However, in real-world scenarios, the performance of the generic object detection system decreases when (i) occlusions hide the objects, (ii) objects are present in low-light images, or (iii) they are merged with background information. In this paper, we refer to all these situations as challenging environments. With the recent rapid development in generic object detection algorithms, notable progress has been observed in the field of deep learning-based object detection in challenging environments. However, there is no consolidated reference to cover the state of the art in this domain. To the best of our knowledge, this paper presents the first comprehensive overview, covering recent approaches that have tackled the problem of object detection in challenging environments. Furthermore, we present a quantitative and qualitative performance analysis of these approaches and discuss the currently available challenging datasets. Moreover, this paper investigates the performance of current state-of-the-art generic object detection algorithms by benchmarking results on the three well-known challenging datasets. Finally, we highlight several current shortcomings and outline future directions.
Published: 2021
Full Text: View/download PDF

29. A Survey of Graphical Page Object Detection with Deep Neural Networks

Author: Jwalin Bhatt, Khurram Azeem Azeem Hashmi, Muhammad Zeshan Afzal, and Didier Stricker
Subjects: deep neural network, document images, review paper, deep learning, performance evaluation, page object detection, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: In any document, graphical elements like tables, figures, and formulas contain essential information. The processing and interpretation of such information require specialized algorithms. Off-the-shelf OCR components cannot process this information reliably. Therefore, an essential step in document analysis pipelines is to detect these graphical components. It leads to a high-level conceptual understanding of the documents that make the digitization of documents viable. Since the advent of deep learning, deep learning-based object detection performance has improved many folds. This work outlines and summarizes the deep learning approaches for detecting graphical page objects in document images. Therefore, we discuss the most relevant deep learning-based approaches and state-of-the-art graphical page object detection in document images. This work provides a comprehensive understanding of the current state-of-the-art and related challenges. Furthermore, we discuss leading datasets along with the quantitative evaluation. Moreover, it discusses briefly the promising directions that can be utilized for further improvements.
Published: 2021
Full Text: View/download PDF

30. From IR Images to Point Clouds to Pose: Point Cloud-Based AR Glasses Pose Estimation

Author: Ahmet Firintepe, Carolin Vey, Stylianos Asteriadis, Alain Pagani, and Didier Stricker
Subjects: computer vision, augmented reality, object pose estimation, point clouds, deep learning, Photography, TR1-1050, Computer applications to medicine. Medical informatics, R858-859.7, Electronic computers. Computer science, QA75.5-76.95
Abstract: In this paper, we propose two novel AR glasses pose estimation algorithms from single infrared images by using 3D point clouds as an intermediate representation. Our first approach “PointsToRotation” is based on a Deep Neural Network alone, whereas our second approach “PointsToPose” is a hybrid model combining Deep Learning and a voting-based mechanism. Our methods utilize a point cloud estimator, which we trained on multi-view infrared images in a semi-supervised manner, generating point clouds based on one image only. We generate a point cloud dataset with our point cloud estimator using the HMDPose dataset, consisting of multi-view infrared images of various AR glasses with the corresponding 6-DoF poses. In comparison to another point cloud-based 6-DoF pose estimation named CloudPose, we achieve an error reduction of around 50%. Compared to a state-of-the-art image-based method, we reduce the pose estimation error by around 96%.
Published: 2021
Full Text: View/download PDF

31. SynPo-Net—Accurate and Fast CNN-Based 6DoF Object Pose Estimation Using Synthetic Training

Author: Yongzhi Su, Jason Rambach, Alain Pagani, and Didier Stricker
Subjects: object pose estimation, convolutional neural networks, training with synthetic images, deep learning, domain adaptation, 6DoF object pose, Chemical technology, TP1-1185
Abstract: Estimation and tracking of 6DoF poses of objects in images is a challenging problem of great importance for robotic interaction and augmented reality. Recent approaches applying deep neural networks for pose estimation have shown encouraging results. However, most of them rely on training with real images of objects with severe limitations concerning ground truth pose acquisition, full coverage of possible poses, and training dataset scaling and generalization capability. This paper presents a novel approach using a Convolutional Neural Network (CNN) trained exclusively on single-channel Synthetic images of objects to regress 6DoF object Poses directly (SynPo-Net). The proposed SynPo-Net is a network architecture specifically designed for pose regression and a proposed domain adaptation scheme transforming real and synthetic images into an intermediate domain that is better fit for establishing correspondences. The extensive evaluation shows that our approach significantly outperforms the state-of-the-art using synthetic training in terms of both accuracy and speed. Our system can be used to estimate the 6DoF pose from a single frame, or be integrated into a tracking system to provide the initial pose.
Published: 2021
Full Text: View/download PDF

32. Real-Time Energy Efficient Hand Pose Estimation: A Case Study

Author: Mhd Rashed Al Koutayni, Vladimir Rybalkin, Jameel Malik, Ahmed Elhayek, Christian Weis, Gerd Reis, Norbert Wehn, and Didier Stricker
Subjects: hardware architecture, FPGA, Zynq, UltraScale+, HLS, PyTorch, Chemical technology, TP1-1185
Abstract: The estimation of human hand pose has become the basis for many vital applications where the user depends mainly on the hand pose as a system input. Virtual reality (VR) headset, shadow dexterous hand and in-air signature verification are a few examples of applications that require to track the hand movements in real-time. The state-of-the-art 3D hand pose estimation methods are based on the Convolutional Neural Network (CNN). These methods are implemented on Graphics Processing Units (GPUs) mainly due to their extensive computational requirements. However, GPUs are not suitable for the practical application scenarios, where the low power consumption is crucial. Furthermore, the difficulty of embedding a bulky GPU into a small device prevents the portability of such applications on mobile devices. The goal of this work is to provide an energy efficient solution for an existing depth camera based hand pose estimation algorithm. First, we compress the deep neural network model by applying the dynamic quantization techniques on different layers to achieve maximum compression without compromising accuracy. Afterwards, we design a custom hardware architecture. For our device we selected the FPGA as a target platform because FPGAs provide high energy efficiency and can be integrated in portable devices. Our solution implemented on Xilinx UltraScale+ MPSoC FPGA is 4.2× faster and 577.3× more energy efficient than the original implementation of the hand pose estimation algorithm on NVIDIA GeForce GTX 1070.
Published: 2020
Full Text: View/download PDF

33. Amharic OCR: An End-to-End Learning

Author: Birhanu Belay, Tewodros Habtegebrial, Million Meshesha, Marcus Liwicki, Gebeyehu Belay, and Didier Stricker
Subjects: amharic script, cnn, ctc, end-to-end learning, lstm, ocr, pattern recognition, text-line image, Technology, Engineering (General). Civil engineering (General), TA1-2040, Biology (General), QH301-705.5, Physics, QC1-999, Chemistry, QD1-999
Abstract: In this paper, we introduce an end-to-end Amharic text-line image recognition approach based on recurrent neural networks. Amharic is an indigenous Ethiopic script which follows a unique syllabic writing system adopted from an ancient Geez script. This script uses 34 consonant characters with the seven vowel variants of each (called basic characters) and other labialized characters derived by adding diacritical marks and/or removing parts of the basic characters. These associated diacritics on basic characters are relatively smaller in size, visually similar, and challenging to distinguish from the derived characters. Motivated by the recent success of end-to-end learning in pattern recognition, we propose a model which integrates a feature extractor, sequence learner, and transcriber in a unified module and then trained in an end-to-end fashion. The experimental results, on a printed and synthetic benchmark Amharic Optical Character Recognition (OCR) database called ADOCR, demonstrated that the proposed model outperforms state-of-the-art methods by 6.98% and 1.05%, respectively.
Published: 2020
Full Text: View/download PDF

34. Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data

Author: Onorina Kovalenko, Vladislav Golyanik, Jameel Malik, Ahmed Elhayek, and Didier Stricker
Subjects: structure from motion, human pose estimation, articulated structure recovery, Chemical technology, TP1-1185
Abstract: Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone lengths. We use alternating optimization strategy to recover optimal geometry (i.e., bone proportions) together with 3D joint positions by enforcing the bone lengths consistency over a series of frames. SfAM is highly robust to noisy 2D annotations, generalizes to arbitrary objects and does not rely on training data, which is shown in extensive experiments on public benchmarks and real video sequences. We believe that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture.
Published: 2019
Full Text: View/download PDF

35. WHSP-Net: A Weakly-Supervised Approach for 3D Hand Shape and Pose Recovery from a Single Depth Image

Author: Jameel Malik, Ahmed Elhayek, and Didier Stricker
Subjects: depth sensor, convolutional neural network (CNN), 3D hand pose, 3D hand shape, Chemical technology, TP1-1185
Abstract: Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task.
Published: 2019
Full Text: View/download PDF

36. 3DAirSig: A Framework for Enabling In-Air Signatures Using a Multi-Modal Depth Sensor

Author: Jameel Malik, Ahmed Elhayek, Sheraz Ahmed, Faisal Shafait, Muhammad Imran Malik, and Didier Stricker
Subjects: in-air signature, depth sensor, convolutional neural network (CNN), 3D hand pose estimation, multidimensional dynamic time warping (MD-DTW), Chemical technology, TP1-1185
Abstract: In-air signature is a new modality which is essential for user authentication and access control in noncontact mode and has been actively studied in recent years. However, it has been treated as a conventional online signature, which is essentially a 2D spatial representation. Notably, this modality bears a lot more potential due to an important hidden depth feature. Existing methods for in-air signature verification neither capture this unique depth feature explicitly nor fully explore its potential in verification. Moreover, these methods are based on heuristic approaches for fingertip or hand palm center detection, which are not feasible in practice. Inspired by the great progress in deep-learning-based hand pose estimation, we propose a real-time in-air signature acquisition method which estimates hand joint positions in 3D using a single depth image. The predicted 3D position of fingertip is recorded for each frame. We present four different implementations of a verification module, which are based on the extracted depth and spatial features. An ablation study was performed to explore the impact of the depth feature in particular. For matching, we employed the most commonly used multidimensional dynamic time warping (MD-DTW) algorithm. We created a new database which contains 600 signatures recorded from 15 different subjects. Extensive evaluations were performed on our database. Our method, called 3DAirSig, achieved an equal error rate (EER) of 0.46 %. Experiments showed that depth itself is an important feature, which is sufficient for in-air signature verification.
Published: 2018
Full Text: View/download PDF

37. 3D Reconstruction from a Single RGB Image using Deep Learning: A Review

Author: Muhammad Saif Ullah Khan, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal
Subjects: artificial_intelligence_robotics
Abstract: 3D reconstruction from a single 2D input is a classic problem in the field of computer vision. With the advancements in deep learning, the performance of 3D reconstruction has also significantly improved. The reconstruction task is more difficult for objects with no textures or complex deformations. This paper serves as a review of recent literature on 3D reconstruction from a single view, with a focus on deep learning methods from 2018 to 2021. Due to lack of standard datasets or 3D shape representation methods, it is hard make direct comparisons between all reviewed methods. However, this paper reviews different approaches for reconstructing 3d shape as depth maps, surface normals, point clouds and meshes; along with various loss functions and evaluation metrics used to train and evaluate these methods.
Published: 2022

38. DeHyFoNet: Deformable Hybrid Network for Formula Detection in Scanned Document Images

Author: Muhammad Zeshan Afzal, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, and Didier Stricker
Subjects: artificial_intelligence_robotics
Abstract: This work presents an approach for detecting mathematical formulas in scanned document images. The proposed approach is end-to-end trainable. Since many OCR engines cannot reliably work with the formulas, it is essential to isolate them to obtain the clean text for information extraction from the document. Our proposed pipeline comprises a hybrid task cascade network with deformable convolutions and a Resnext101 backbone. Both of these modifications help in better detection. We evaluate the proposed approaches on the ICDAR-2017 POD and Marmot datasets and achieve an overall accuracy of 96% for the ICDAR-2017 POD dataset. We achieve an overall reduction of error of 13%. Furthermore, the results on Marmot datasets are improved for the isolated and embedded formulas. We achieved an accuracy of 98.78% for the isolated formula and 90.21% overall accuracy for embedded formulas. Consequently, it results in an error reduction rate of 43% for isolated and 17.9% for embedded formulas.
Published: 2022

39. HybridTabNet: Towards Better Table Detection in Scanned Document Images

Author: Marcus Liwicki, Danish Nazir, Didier Stricker, Muhammad Zeshan Afzal, Alain Pagani, and Khurram Azeem Hashmi
Subjects: Technology, Computer science, QH301-705.5, QC1-999, computer vision, artificial_intelligence_robotics, Computer vision, table detection, Biology (General), QD1-999, scanned document images, table localization, business.industry, Deep learning, Physics, deep learning, object detection, hybrid task cascade, Engineering (General). Civil engineering (General), Object detection, deformable convolution, Chemistry, deep neural networks, document image analysis, Table (database), Deep neural networks, Artificial intelligence, TA1-2040, business
Abstract: Tables in document images are an important entity since they contain crucial information. Therefore, accurate table detection can significantly improve the information extraction from documents. In this work, we present a novel end-to-end trainable pipeline, HybridTabNet, for table detection in scanned document images. Our two-stage table detector uses the ResNeXt-101 backbone for feature extraction and Hybrid Task Cascade (HTC) to localize the tables in scanned document images. Moreover, we replace conventional convolutions with deformable convolutions in the backbone network. This enables our network to detect tables of arbitrary layouts precisely. We evaluate our approach comprehensively on ICDAR-13, ICDAR-17 POD, ICDAR-19, TableBank, Marmot, and UNLV. Apart from the ICDAR-17 POD dataset, our proposed HybridTabNet outperformed earlier state-of-the-art results without depending on pre- and post-processing steps. Furthermore, to investigate how the proposed method generalizes unseen data, we conduct an exhaustive leave-one-out-evaluation. In comparison to prior state-of-the-art results, our method reduced the relative error by 27.57% on ICDAR-2019-TrackA-Modern, 42.64% on TableBank (Latex), 41.33% on TableBank (Word), 55.73% on TableBank (Latex + Word), 10% on Marmot, and 9.67% on the UNLV dataset. The achieved results reflect the superior performance of the proposed method.
Published: 2021

40. Cascade Network with Deformable Composite Backbone for Formula Detection in Scanned Document Images

Author: Alain Pagani, Muhammad Zeshan Afzal, Marcus Liwicki, Didier Stricker, and Khurram Azeem Hashmi
Subjects: Technology, QH301-705.5, Computer science, QC1-999, computer.software_genre, computer vision, Reduction (complexity), Datorseende och robotik (autonoma system), Approximation error, Cascade network, Biology (General), QD1-999, Computer Vision and Robotics (Autonomous Systems), Backbone network, Physics, DUAL (cognitive architecture), Engineering (General). Civil engineering (General), algebra_number_theory, Object detection, Cascade Mask R-CNN, formula detection, Chemistry, Information extraction, deep neural networks, Cascade, document image analysis, TA1-2040, mathematical expression detection, computer, Algorithm
Abstract: This paper presents a novel architecture for detecting mathematical formulas in document images, which is an important step for reliable information extraction in several domains. Recently, Cascade Mask R-CNN networks have been introduced to solve object detection in computer vision. In this paper, we suggest a couple of modifications to the existing Cascade Mask R-CNN architecture: First, the proposed network uses deformable convolutions instead of conventional convolutions in the backbone network to spot areas of interest better. Second, it uses a dual backbone of ResNeXt-101, having composite connections at the parallel stages. Finally, our proposed network is end-to-end trainable. We evaluate the proposed approach on the ICDAR-2017 POD and Marmot datasets. The proposed approach demonstrates state-of-the-art performance on ICDAR-2017 POD at a higher IoU threshold with an f1-score of 0.917, reducing the relative error by 7.8%. Moreover, we accomplished correct detection accuracy of 81.3% on embedded formulas on the Marmot dataset, which results in a relative error reduction of 30%. Validerad;2021;Nivå 2;2021-09-01 (alebob);Forskningsfinansiär: European project INFINITY (883293)
Published: 2021

41. Controlling Teleportation-Based Locomotion in Virtual Reality with Hand Gestures: A Comparative Evaluation of Two-Handed and One-Handed Techniques

Author: Didier Stricker, Alexander Schäfer, and Gerd Reis
Subjects: VR, Computer Networks and Communications, Computer science, Headset, lcsh:TK7800-8360, 02 engineering and technology, Virtual reality, Metaverse, Teleportation, Human–computer interaction, Component (UML), 0202 electrical engineering, electronic engineering, information engineering, Immersion (virtual reality), 0501 psychology and cognitive sciences, Electrical and Electronic Engineering, navigation, 050107 human factors, gestures, 05 social sciences, lcsh:Electronics, 020207 software engineering, hands-free, freehand, locomotion, Hardware and Architecture, Control and Systems Engineering, Signal Processing, Eye tracking, virtual reality, gestural input, movement, Gesture, bare hand
Abstract: Virtual Reality (VR) technology offers users the possibility to immerse and freely navigate through virtual worlds. An important component for achieving a high degree of immersion in VR is locomotion. Often discussed in the literature, a natural and effective way of controlling locomotion is still a general problem which needs to be solved. Recently, VR headset manufacturers have been integrating more sensors, allowing hand or eye tracking without any additional required equipment. This enables a wide range of application scenarios with natural freehand interaction techniques where no additional hardware is required. This paper focuses on techniques to control teleportation-based locomotion with hand gestures, where users are able to move around in VR using their hands only. With the help of a comprehensive study involving 21 participants, four different techniques are evaluated. The effectiveness and efficiency as well as user preferences of the presented techniques are determined. Two two-handed and two one-handed techniques are evaluated, revealing that it is possible to move comfortable and effectively through virtual worlds with a single hand only.
Published: 2021

42. Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data

Author: Ahmed Elhayek, Vladislav Golyanik, Didier Stricker, Onorina Kovalenko, and Jameel Malik
Subjects: FOS: Computer and information sciences, Computer science, Computer Vision and Pattern Recognition (cs.CV), Constraint (computer-aided design), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, lcsh:Chemical technology, Biochemistry, Motion (physics), Article, Analytical Chemistry, Consistency (database systems), articulated structure recovery, 0202 electrical engineering, electronic engineering, information engineering, Structure from motion, Computer vision, lcsh:TP1-1185, Electrical and Electronic Engineering, Instrumentation, Monocular, business.industry, structure from motion, 3D reconstruction, Perspective (graphical), human pose estimation, 020207 software engineering, Object (computer science), Atomic and Molecular Physics, and Optics, 020201 artificial intelligence & image processing, Artificial intelligence, business
Abstract: Recovery of articulated 3D structure from 2D observations is a challenging computer vision problem with many applications. Current learning-based approaches achieve state-of-the-art accuracy on public benchmarks but are restricted to specific types of objects and motions covered by the training datasets. Model-based approaches do not rely on training data but show lower accuracy on these datasets. In this paper, we introduce a model-based method called Structure from Articulated Motion (SfAM), which can recover multiple object and motion types without training on extensive data collections. At the same time, it performs on par with learning-based state-of-the-art approaches on public benchmarks and outperforms previous non-rigid structure from motion (NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while integrating a soft spatio-temporal constraint on the bone lengths. We use alternating optimization strategy to recover optimal geometry (i.e., bone proportions) together with 3D joint positions by enforcing the bone lengths consistency over a series of frames. SfAM is highly robust to noisy 2D annotations, generalizes to arbitrary objects and does not rely on training data, which is shown in extensive experiments on public benchmarks and real video sequences. We believe that it brings a new perspective on the domain of monocular 3D recovery of articulated structures, including human motion capture., 21 pages, 8 figures, 2 tables
Published: 2019

43. 3DAirSig: A Framework for Enabling In-Air Signatures Using a Multi-Modal Depth Sensor

Author: Didier Stricker, Jameel Malik, Muhammad Imran Malik, Faisal Shafait, Sheraz Ahmed, and Ahmed Elhayek
Subjects: Dynamic time warping, Computer science, 3D hand pose estimation, Word error rate, 02 engineering and technology, lcsh:Chemical technology, Biochemistry, Article, Analytical Chemistry, 0202 electrical engineering, electronic engineering, information engineering, depth sensor, multidimensional dynamic time warping (MD-DTW), lcsh:TP1-1185, Electrical and Electronic Engineering, Instrumentation, Pose, convolutional neural network (CNN), Modality (human–computer interaction), business.industry, Frame (networking), 020207 software engineering, Pattern recognition, Atomic and Molecular Physics, and Optics, Signature (logic), in-air signature, Feature (computer vision), 020201 artificial intelligence & image processing, Artificial intelligence, business
Abstract: In-air signature is a new modality which is essential for user authentication and access control in noncontact mode and has been actively studied in recent years. However, it has been treated as a conventional online signature, which is essentially a 2D spatial representation. Notably, this modality bears a lot more potential due to an important hidden depth feature. Existing methods for in-air signature verification neither capture this unique depth feature explicitly nor fully explore its potential in verification. Moreover, these methods are based on heuristic approaches for fingertip or hand palm center detection, which are not feasible in practice. Inspired by the great progress in deep-learning-based hand pose estimation, we propose a real-time in-air signature acquisition method which estimates hand joint positions in 3D using a single depth image. The predicted 3D position of fingertip is recorded for each frame. We present four different implementations of a verification module, which are based on the extracted depth and spatial features. An ablation study was performed to explore the impact of the depth feature in particular. For matching, we employed the most commonly used multidimensional dynamic time warping (MD-DTW) algorithm. We created a new database which contains 600 signatures recorded from 15 different subjects. Extensive evaluations were performed on our database. Our method, called 3DAirSig, achieved an equal error rate (EER) of 0.46 %. Experiments showed that depth itself is an important feature, which is sufficient for in-air signature verification.
Published: 2018

44. SynPo-Net—Accurate and Fast CNN-Based 6DoF Object Pose Estimation Using Synthetic Training

Author: Didier Stricker, Yongzhi Su, Alain Pagani, and Jason Rambach
Subjects: object pose estimation, domain adaptation, Computer science, 02 engineering and technology, lcsh:Chemical technology, Biochemistry, Convolutional neural network, Article, Analytical Chemistry, convolutional neural networks, 0202 electrical engineering, electronic engineering, information engineering, lcsh:TP1-1185, Computer vision, Electrical and Electronic Engineering, Instrumentation, Pose, Ground truth, business.industry, Deep learning, deep learning, Tracking system, training with synthetic images, 021001 nanoscience & nanotechnology, Real image, Atomic and Molecular Physics, and Optics, 6DoF object tracking, 020201 artificial intelligence & image processing, Augmented reality, Artificial intelligence, 6DoF object pose, 0210 nano-technology, business
Abstract: Estimation and tracking of 6DoF poses of objects in images is a challenging problem of great importance for robotic interaction and augmented reality. Recent approaches applying deep neural networks for pose estimation have shown encouraging results. However, most of them rely on training with real images of objects with severe limitations concerning ground truth pose acquisition, full coverage of possible poses, and training dataset scaling and generalization capability. This paper presents a novel approach using a Convolutional Neural Network (CNN) trained exclusively on single-channel Synthetic images of objects to regress 6DoF object Poses directly (SynPo-Net). The proposed SynPo-Net is a network architecture specifically designed for pose regression and a proposed domain adaptation scheme transforming real and synthetic images into an intermediate domain that is better fit for establishing correspondences. The extensive evaluation shows that our approach significantly outperforms the state-of-the-art using synthetic training in terms of both accuracy and speed. Our system can be used to estimate the 6DoF pose from a single frame, or be integrated into a tracking system to provide the initial pose.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

44 results on '"Didier Stricker"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources