Author: "Pantic, Maja" / Publication Type: Dissertations - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Pantic, Maja"' showing total 29 results

Start Over Author "Pantic, Maja" Publication Type Dissertations

29 results on '"Pantic, Maja"'

1. Deep audio-visual speech recognition

Author: Ma, Pingchuan and Pantic, Maja
Abstract: Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations. This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure. Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading. We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved. Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.
Published: 2022
Full Text: View/download PDF

2. Human-controllable and structured deep generative models

Author: Tran, Dieu Linh and Pantic, Maja
Abstract: Deep generative models are a class of probabilistic models that attempts to learn the underlying data distribution. These models are usually trained in an unsupervised way and thus, do not require any labels. Generative models such as Variational Autoencoders and Generative Adversarial Networks have made astounding progress over the last years. These models have several benefits: eased sampling and evaluation, efficient learning of low-dimensional representations for downstream tasks, and better understanding through interpretable representations. However, even though the quality of these models has improved immensely, the ability to control their style and structure is limited. Structured and human-controllable representations of generative models are essential for human-machine interaction and other applications, including fairness, creativity, and entertainment. This thesis investigates learning human-controllable and structured representations with deep generative models. In particular, we focus on generative modelling of 2D images. For the first part, we focus on learning clustered representations. We propose semi-parametric hierarchical variational autoencoders to estimate the intensity of facial action units. The semi-parametric model forms a hybrid generative-discriminative model and leverages both parametric Variational Autoencoder and non-parametric Gaussian Process autoencoder. We show superior performance in comparison with existing facial action unit estimation approaches. Based on the results and analysis of the learned representation, we focus on learning Mixture-of-Gaussians representations in an autoencoding framework. We deviate from the conventional autoencoding framework and consider a regularized objective with the Cauchy-Schwarz divergence. The Cauchy-Schwarz divergence allows a closed-form solution for Mixture-of-Gaussian distributions and, thus, efficiently optimizing the autoencoding objective. We show that our model outperforms existing Variational Autoencoders in density estimation, clustering, and semi-supervised facial action detection. We focus on learning disentangled representations for conditional generation and fair facial attribute classification for the second part. Conditional image generation relies on the accessibility to large-scale annotated datasets. Nevertheless, the geometry of visual objects, such as in faces, cannot be learned implicitly and deteriorate image fidelity. We propose incorporating facial landmarks with a statistical shape model and a differentiable piecewise affine transformation to separate the representation for appearance and shape. The goal of incorporating facial landmarks is that generation is controlled and can separate different appearances and geometries. In our last work, we use weak supervision for disentangling groups of variations. Works on learning disentangled representation have been done in an unsupervised fashion. However, recent works have shown that learning disentangled representations is not identifiable without any inductive biases. Since then, there has been a shift towards weakly-supervised disentanglement learning. We investigate using regularization based on the Kullback-Leiber divergence to disentangle groups of variations. The goal is to have consistent and separated subspaces for different groups, e.g., for content-style learning. Our evaluation shows increased disentanglement abilities and competitive performance for image clustering and fair facial attribute classification with weak supervision compared to supervised and semi-supervised approaches.
Published: 2022
Full Text: View/download PDF

3. Generation of realistic human behaviour

Author: Vougioukas, Konstantinos and Pantic, Maja
Abstract: As the use of computers and robots in our everyday lives increases so does the need for better interaction with these devices. Human-computer interaction relies on the ability to understand and generate human behavioural signals such as speech, facial expressions and motion. This thesis deals with the synthesis and evaluation of such signals, focusing not only on their intelligibility but also on their realism. Since these signals are often correlated, it is common for methods to drive the generation of one signal using another. The thesis begins by tackling the problem of speech-driven facial animation and proposing models capable of producing realistic animations from a single image and an audio clip. The goal of these models is to produce a video of a target person, whose lips move in accordance with the driving audio. Particular focus is also placed on a) generating spontaneous expression such as blinks, b) achieving audio-visual synchrony and c) transferring or producing natural head motion. The second problem addressed in this thesis is that of video-driven speech reconstruction, which aims at converting a silent video into waveforms containing speech. The method proposed for solving this problem is capable of generating intelligible and accurate speech for both seen and unseen speakers. The spoken content is correctly captured thanks to a perceptual loss, which uses features from pre-trained speech-driven animation models. The ability of the video-to-speech model to run in real-time allows its use in hearing assistive devices and telecommunications. The final work proposed in this thesis is a generic domain translation system, that can be used for any translation problem including those mapping across different modalities. The framework is made up of two networks performing translations in opposite directions and can be successfully applied to solve diverse sets of translation problems, including speech-driven animation and video-driven speech reconstruction.
Published: 2022
Full Text: View/download PDF

4. Bias in deep learning and applications to face analysis

Author: Georgopoulos, Markos and Pantic, Maja
Abstract: Deep learning has fostered the progress in the field of face analysis, resulting in the integration of these models in multiple aspects of society. Even though the majority of research has focused on optimizing standard evaluation metrics, recent work has exposed the bias of such algorithms as well as the dangers of their unaccountable utilization.n this thesis, we explore the bias of deep learning models in the discriminative and the generative setting. We begin by investigating the bias of face analysis models with regards to different demographics. To this end, we collect KANFace, a large-scale video and image dataset of faces captured 'in-the-wild'. The rich set of annotations allows us to expose the demographic bias of deep learning models, which we mitigate by utilizing adversarial learning to debias the deep representations. Furthermore, we explore neural augmentation as a strategy towards training fair classifiers. We propose a style-based multi-attribute transfer framework that is able to synthesize photo-realistic faces of the underrepresented demographics. This is achieved by introducing a multi-attribute extension to Adaptive Instance Normalisation that captures the multiplicative interactions between the representations of different attributes. Focusing on bias in gender recognition, we showcase the efficacy of the framework in training classifiers that are more fair compared to generative and fairness-aware methods. In the second part, we focus on bias in deep generative models. In particular, we start by studying the generalization of generative models on images of unseen attribute combinations. To this end, we extend the conditional Variational Autoencoder by introducing a multilinear conditioning framework. The proposed method is able to synthesize unseen attribute combinations by modeling the multiplicative interactions between the attributes. Lastly, in order to control protected attributes, we investigate controlled image generation without training on a labelled dataset. We leverage pre-trained Generative Adversarial Networks that are trained in an unsupervised fashion and exploit the clustering that occurs in the representation space of intermediate layers of the generator. We show that these clusters capture semantic attribute information and condition image synthesis on the cluster assignment using Implicit Maximum Likelihood Estimation.
Published: 2022
Full Text: View/download PDF

5. Deep face tracking and parsing in the wild

Author: Lin, Yiming and Pantic, Maja
Abstract: Face analysis has been a long-standing research direction in the field of computer vision and pattern recognition. A complete face analysis system involves solving several tasks including face detection, face tracking, face parsing, and face recognition. Recently, the performance of methods in all tasks has significantly improved thanks to the employment of Deep Convolutional Neural Networks (DCNNs). However, existing face analysis algorithms mainly focus on solving facial images captured in the constrained laboratory environment, and their performance on real-world images has remained less explored. Compared with the lab environment, the in-the-wild settings involve greater diversity in face sizes, poses, facial expressions, background clutters, lighting conditions and imaging quality. This thesis investigates two fundamental tasks in face analysis under in-the-wild settings: face tracking and face parsing. Both tasks serve as important prerequisites for downstream face analysis applications. However, in-the-wild datasets remain scarce in both fields and models have not been rigorously evaluated in such settings. In this thesis, we aim to bridge that gap of lacking in-the-wild data, evaluate existing methods in these settings, and develop accurate, robust and efficient deep learning-based methods for the two tasks. For face tracking in the wild, we introduce the first in-the-wild face tracking dataset, MobiFace, that consists of 80 videos captured by mobile phones during mobile live-streaming. The environment of the live-streaming performance is fully unconstrained and the interactions between users and mobile phones are natural and spontaneous. Next, we evaluate existing tracking methods, including generic object trackers and dedicated face trackers. The results show that MobiFace represent unique challenges in face tracking in the wild and cannot be readily solved by existing methods. Finally, we present a DCNN-based framework, FT-RCNN, that significantly outperforms other methods in face tracking in the wild. For face parsing in the wild, we introduce the first large-scale in-the-wild face dataset, iBugMask, that contains 21, 866 training images and 1, 000 testing images. Unlike existing datasets, the images in iBugMask are captured in the fully unconstrained environment and are not cropped or preprocessed of any kind. Manually annotated per-pixel labels for eleven facial regions are provided for each target face. Next, we benchmark existing parsing methods and the results show that iBugMask is extremely challenging for all methods. By rigorous benchmarking, we observe that the pre-processing of facial images with bounding boxes in face parsing in the wild introduces bias. When cropping the face with a bounding box, a cropping margin has to be hand-picked. If face alignment is used, fiducial landmarks are required and a predefined alignment template has to be selected. These additional hyper-parameters have to be carefully considered and can have a significant impact on the face parsing performance. To solve this, we propose Region-of-Interest (RoI) Tanh-polar transform that warps the whole image to a fixed-sized representation. Moreover, the RoI Tanh-polar transform is differentiable and allows for rotation equivariance in 1 DCNNs. We show that when coupled with a simple Fully Convolutional Network, our RoI Tanh-polar transformer Network has achieved state-of-the-art results on face parsing in the wild. This thesis contributes towards in-the-wild face tracking and face parsing by providing novel datasets and proposing effective frameworks. Both tasks can benefit real-world downstream applications such as facial age estimation, facial expression recognition and lip-reading. The proposed RoI Tanh-polar transform also provides a new perspective in how to preprocess the face images and make the DCNNs truly end-to-end for real-world face analysis applications.
Published: 2021
Full Text: View/download PDF

6. Dynamic face parsing in the wild

Author: Wang, Yujiang and Pantic, Maja
Abstract: Landmark-based facial descriptors are widely utilised in face video analysis to obtain useful facial dynamics. Such a sparse facial representation is not capable of constructing the full dynamics of each facial component like eyes and mouths, while such dynamics can be essential to the recognition of higher-level features like facial expressions, emotions, identity, and so on. A dense facial descriptor such as the segmentation masks generated by face parsing, however, can effectively overcome those limitations by providing pixel-wise semantic information that is generally more discriminative and more desirable to facial analysis tasks. Recently, Deep Convolutional Neural Networks (DCNNs) have made impressive progress in semantic image segmentation, a task that performs per-pixel classifications. Those deep segmentation models can naturally generate pixel-level predictions for facial images, however, face parsing in the wild is still a challenging task. The model's ability to accurately segment different facial regions is crucial to generate high-quality face masks. Besides, how to adapt the segmentation models designed for static images to the continuous environment of face videos also requires consideration. To satisfy the real-time requirement under realistic scenarios, the acceleration problem needs to be resolved. This thesis investigates different aspects of in-the-wild face parsing and proposes several novel approaches of constructing robust face segmentation masks. To increase the robustness of eye segmentation against low-quality video scenarios, we encode the shape priors of eyes into the training procedure of deep segmentation model. Additionally, the segmentation model's sensitivity to semantic facial contours is enhanced by introducing the Dilated Convolutions with Lateral Inhibitions, which is a convolutional operator biologically inspired by human visual systems. To exploit information from both temporal and spatial domains in face videos, we propose a ConvLSTM-FCN model to generate temporal-smoothed face segmentation masks which are more tolerant to video variations. Eventually, we consider to accelerate the process of dynamic face parsing via Reinforcement Learning to learn a globally-optimised key scheduler. This thesis contributes towards in-the-wild face parsing from different aspects such as improving the fundamental network architectures and optimising the performance under realistic scenarios. It can benefit downstream tasks that require detailed facial dynamics such as facial expression recognition and lip-reading. It can also be inspiring to future works on semantic image/video segmentation, and to other works pursuing face parsing with higher visual qualities or with better working efficiency.
Published: 2021
Full Text: View/download PDF

7. Machine learning methods for audio-visual event analysis

Author: Pu, Jie and Pantic, Maja
Abstract: This thesis studies machine learning techniques for localizing, separating and recognizing audio-visual events. Until recently, the widely-used methods for analyzing audio-visual events involves either laborious pre-processing/post-processing steps to handle videos, or a huge amount of training data to supervise their learning processes. To overcome these limitations, we aim to develop novel approaches that have one end-to-end framework, and much less dependence on training data than state-of-the-art deep-learning approaches. In particular, we propose novel low-rank and sparse matrix decomposition methods, kernelized matrix decomposition methods and deep neural networks for audio-visual event analysis, with a particular focus on their data-efficiency and computational-efficiency. First, we investigate the problem of blind audio-visual localization and separation, which aims to localize visual objects associated with an audio signal and simultaneously separate the audio signal from irrelevant audio components. To this end, a novel low-rank and sparse matrix decomposition method is devised, where we use a sparse matrix capturing the correlated components between visual and audio modalities, and hence uncovering the sound source in visual modality and the associated sound in audio modality. After that, we propose a novel kernelized sparse and low-rank decomposition method, which generalizes the linear correlation of our low-rank and sparse model into non-linear ones. This leads to superior performances in the application of active speaker detection and localization in audio-visual recordings. Given the recent surge of deep learning, we also propose a novel deep low-rank and sparse model, which uses deep neural networks to achieve low-rank and sparse decomposition. It is a further extension of our kernelized matrix decomposition method, where the non-linear correlation between visual and audio modalities is captured by deep networks. Finally, we have worked on the problem of sound event detection and localization. A novel filterbank learning approach is proposed, which takes the raw waveform as input and produces its auditory spectro-temporal representation. In contrast to standard convolutional neural networks that learn all elements of each filter, the proposed filters are parameterized with only two learnable variables. This offers the interpretability of our model after training, and greatly lessen its requirement for training data.
Published: 2020
Full Text: View/download PDF

8. Machine learning methods for face modelling and analysis in-the-wild

Author: Kossaifi, Jean and Pantic, Maja
Subjects: 006.3
Abstract: Automatic facial analysis is at the intersection between computer vision and machine learning. It consists of two main steps. First, facial alignment, which typically consists of the detection of a set of fiducial points, or landmarks, on the face. Secondly, the aligned faces are then used, either directly as pixels intensity, or after extracting more robust features (either hand-crafted ones such as histograms of oriented gradients, or learned with a deep neural network), as input to estimate emotional states. In this thesis, we develop a complete pipeline for facial analysis in real-life, naturalistic conditions (in-the-wild), covering both steps. We first explore generative models for the task of facial alignment, Active Appearance Models (AAMs). Specifically, we introduce a new second order method for fitting AAMs. We then introduce a bidirectional method that simultaneously deforms the model and the image, leading to faster convergence. In both cases, we leverage the structure in the problem to obtain exact solutions with better computational complexity. We show that, when trained in-the-wild, they achieve state- of-the-art performance, while requiring smaller datasets than discriminative methods. We also demonstrate how to leverage the statistical shape model and motion model from AAMs to constrain generative adversarial networks. We then build on the facial alignment framework to estimate dimensional measures of emotion. Specifically, we estimate continuous levels of valence (how positive or negative a state of mind is) and arousal (how exciting or calming the experience is). To do so, we introduce a new database of images collected in-the-wild, and annotated per-frame in terms of continuous levels of valence and arousal, along with accurate facial landmarks. We then demonstrate the importance of training models on data collected in-the-wild as opposed to existing databases, mainly collected in laboratory, or controlled environments. While developing tools for better facial analysis, it became clear that, while the data we work with has a rich multi-linear structure (e.g. spatial and temporal), this is discarded by current methods. We therefore endeavoured in devising new methods able to leverage that structure. In particular, given the absence of software for tensor methods, we created TensorLy, a high level API for tensor algebra, decomposition and regression in Python. Its flexible backend system makes it possible to seamlessly run computation on various hardware with several libraries, including deep learning libraries such as PyTorch, Tensor- Flow or MXNet. This allowed us to introduce new ways of combining tensor methods with deep learning, such as tensor contraction and regression layers. This type of hybrid method combines the power of tensor algebra with the efficiency of deep learning. It makes it possible to devise efficient algorithms that achieve state-of-the-art performance and are scalable to very large datasets, while enabling large parameter space savings.
Published: 2019
Full Text: View/download PDF

9. Structured machine learning methods for automated analysis of facial expressions

Author: Walecki, Robert and Pantic, Maja
Subjects: 6.2
Abstract: Automated recognition of facial expressions, and detection of facial action units (AUs) from videos depends critically on modeling of their dynamics. Some of these dynamics are characterized by changes in temporal phases (onset-apex-offset) and intensity of emotion expressions and AUs. The appearance of these changes may vary considerably among subjects, making the recognition/detection task very challenging. Recent advances in deep neural networks (DNN) and, in particular, convolutional models have facilitated “end-to-end” learning and reduced or even completely eliminated the dependence and need for physics-based models and/or other pre-processing techniques. While the effect- iveness of these models has been demonstrated on many computer vision problems, only baseline tasks such as expression recognition, AU detection and AU intensity estimation have been investigated. The structure of facial expressions arises from statistically induced co-occurrence patterns of AU intensity levels. Our goal is to model this structure by combining conditional random fields (CRF) with deep learning. The contribution of this thesis is two-fold. First, we introduce a novel Latent-CRF model for classification of image sequences. Second, we propose a deep probabilistic framework for modeling multivariate ordinal variables. Latent-CRFs efficiently encode dynamics through latent states accounting for temporal consistency. These latent states are typically assumed to be either unordered (nominal) or fully ordered (ordinal). Yet, while the video segments containing activation of the target AU may better be described using ordinal latent states (corresponding to the AU intensity levels), the segments where this AU does not occur, may better be described using unordered (nominal) latent states. To address this, we propose the Variable-state L-CRF model that automatically selects the optimal latent states for the target image sequence, based on the input data and underlying dynamics of the sequence. The deep probabilistic framework introduced in the second part of this thesis accounts for ordinal structure in the output variables and their non-linear dependencies via Copula functions modeled as cliques of a CRF. These are jointly optimized with deep CNN feature encoding layers using a newly introduced balanced batch iterative training algorithm. We show that joint learning of the deep features and the target output structure results in significant performance gains compared to existing deep structured models for analysis of facial expressions. We show that the proposed models consistently outperforms (i) independent modeling of AU intensities and (ii) the state-of-the-art approach for the target task and (iii) deep convolutional neural networks.
Published: 2018
Full Text: View/download PDF

10. Robust statistical deformable models

Author: Antonakos, Epameinondas, Zafeiriou, Stefanos, and Pantic, Maja
Subjects: 004
Abstract: During the last few years, we have witnessed tremendous advances in the field of 2D Deformable Models for the problem of landmark localization. These advances, which are mainly reported on the task of face alignment, have created two major and opposing families of methodologies. On the one hand, there are the generative Deformable Models that utilize a Newton-type optimization. This family of techniques has attracted extensive research effort during the last two decades, but has lately been criticized of achieving inaccurate performance. On the other hand, there is the currently predominant family of discriminative Deformable Models that treat the problem of landmark localization as a regression problem. These techniques commonly employ cascaded linear regression and have proved to be very accurate. In this thesis, we argue that even though generative Deformable Models are less accurate than discriminative, they are still very valuable for several tasks. In the first part of the thesis, we propose two novel generative Deformable Models. In the second part of the thesis, we show that the combination of generative and discriminative Deformable Models achieves state-of-the-art results on the tasks of (i) landmark localization and (ii) semi-supervised annotation of large visual data.
Published: 2018
Full Text: View/download PDF

11. Machine learning for high-level social behaviour

Author: Bilakhia, Sanjay and Pantic, Maja
Subjects: 004
Abstract: The ability to recognize and interpret the complex displays of nonverbal behavioral cues that arise in social interaction comes naturally to humans. Indeed, the survival and flourishing of early groups of homo sapiens may have depended on this ability to share implicit social information. It is a process so innate that complex social behaviours can occur without conscious awareness, even in young babies. Though we would benefit from artificial devices having the ability to understand these nonverbal cues, it has proven an elusive goal. In this thesis we are primarily motivated by the problem of recognizing and exploiting displays of high–level social behavior, focusing on behavioural mimicry. Mimicry describes the tendency of individuals to adopt the postures, gestures and expressions of social interaction partners. We first provide a background to the phenomenon of behavioural mimicry, disambiguate it from other related phenomena in social interaction, and survey its surprisingly complex dependencies on the broader social context. We then discuss a number of methods that could be used to recognize mimicry behaviour in naturalistic interaction. We list some publicly available databases these tools could be trained on for the analysis of spontaneous instances of mimicry. We also examine the scarce prior work on recognition of naturalistic mimicry behaviour, and we discuss the challenges in automatically recognizing mimicry in spontaneous data. Subsequently we present a database of naturalistic social interactions, designed for analysis of spontaneous mimicry behaviour. This has been annotated for mimicry episodes, low-level non-verbal behavioural cues, and continuous affect. We also present a new software package for web-based annotation, AstAn, which has been extensively deployed for temporal event segmentation and continuous annotation. Collecting annotation data for high-level social affect is a difficult problem. This is due to inter-annotator variance, dependent on a variety of factors including i) the content of the data to annotate ii) the complexity of the variables to annotate, and iii) the annotators' cultures and personality traits. AstAn is the first software package to enable large-scale collection of annotations relevant to affective computing, without the costly manual distribution and management of (perhaps sensitive) data. Large-scale and cost-effective data collection can significantly help to overcome the aforementioned difficulties. We present experiments showing that prevailing methods for mimicry recognition on posed data, generalize suboptimally to spontaneous data. These include methods based on cross-correlation and dynamic time warping, which are prevalent in current work on recognition of interpersonal co-ordination, including mimicry and synchrony. We also show that popular temporal models such as recurrent neural networks, when applied in a straightforward classification approach, also find it challenging to discriminate between mimicry and non-mimicry. We expand upon these baseline results using methods adapted from work on multimodal classification. Nonlinear regression models are used to learn the relationships between the non-verbal cues from each subject. Namely, for mimicry and non-mimicry classes, we learn a set of neural networks to forecast the behaviour of each subject, given the behaviour of their counterpart. The set of networks that produces the best behavioural forecast corresponds to the predicted class. Subsequently, we investigate whether high-level social affect like mimicry, conflict, valence and arousal are uniquely displayed between individuals. Specifically, we show that for episodes of a given behavioural display such as mimicry or high-conflict, the spatiotemporal movement characteristics are unique enough to construct a "kinematic template" for that behaviour. Given an unseen episode of the same behavioural display, we can compare it against the template in order to verify identity. This is useful in verification contexts where facial appearance and geometry can change due to lighting, facial hair, facial decoration, or weight loss. We present a new method, Multi-Sequence Robust Canonical Time Warping (M-RCTW), in order to construct this subject- and behaviour-specific template. Unlike prior methods, M-RCTW can warp together multiple multivariate sequences in the presence of large non-Gaussian errors, which can occur due to e.g. tracking artefacts in naturalistic behaviour, such as those resulting from occlusions. We show on two databases of natural interaction that identity verification is possible from a number of high- and low-level behaviours, and that M-RCTW outperforms existing methods for multiple sequence warping on the task of subject verification.
Published: 2018
Full Text: View/download PDF

12. Robust deformable model for 3D face alignment and tracking

Author: Cheng, Shiyang, Zafeiriou, Stefanos, and Pantic, Maja
Subjects: 004
Abstract: This thesis investigates the use of robust deformable model for 3D face alignment and tracking. Our main objective is to establish accurate and reliable dense correspondence for 4D faces. Our contribution is two-fold. First, we build a robust face alignment framework to establish dense correspondence between 4D faces. Second, we develop 4DFAB, the first large scale high-resolution 4D face database for facial expression analysis and biometric. Our deformable 4D face alignment framework contains three parts: (1) robust 2D face alignment; (2) active non-rigid 3D face registration; (3) deformable 4D face tracking. For 2D face alignment, we start with the study of robust 2D and 3D geometry features (and their fusion) for Constrained Local Models (CLMs). We show that by leveraging robust features, CLMs can handle faces captured in controlled and uncontrolled environment. To exploit the discriminative power of CLMs, we propose an alignment framework based on the texture model of response maps. Under this framework, we devise two generative fitting methods (GFRM-Alt and GFRM-PO) and one part-based discriminative fitting method (DFRM), which achieve favorable performances in generic face alignment in-the-wild. Additionally, we implement a real-time face tracking software using DFRM. Next, we study the registration problem for high quality 3D facial scan. We build a part-based statistical face model and combine it with the non-rigid ICP. We name the method as Active Non-rigid ICP (ANICP). ANICP is integrated into a dynamic local fitting framework and produces accurate fitting. Note that DFRM is also used to provide better initialisation. Finally, we develop a dense 4D alignment framework that capitalises on GRMF-Alt and ANICP. This framework is employed to align faces from 4DFAB database, which contains over 1,800,000 meshes from 180 subjects captured in four different sessions during 5 years. As the subjects display both spontaneous and posed expressions, it can also be used for facial expression modeling and analysis. To demonstrate the usefulness of 4DFAB, we conduct extensive expression recognition, face recognition, speech recognition experiments. We also investigate, for the first time, the use of spontaneous 4D behaviour for biometric applications.
Published: 2018
Full Text: View/download PDF

13. Gaussian processes for modeling of facial expressions

Author: Eleftheriadis, Stefanos and Pantic, Maja
Subjects: 006.3
Abstract: Automated analysis of facial expressions has been gaining significant attention over the past years. This stems from the fact that it constitutes the primal step toward developing some of the next-generation computer technologies that can make an impact in many domains, ranging from medical imaging and health assessment to marketing and education. No matter the target application, the need to deploy systems under demanding, real-world conditions that can generalize well across the population is urgent. Hence, careful consideration of numerous factors has to be taken prior to designing such a system. The work presented in this thesis focuses on tackling two important problems in automated analysis of facial expressions: (i) view-invariant facial expression analysis; (ii) modeling of the structural patterns in the face, in terms of well coordinated facial muscle movements. Driven by the necessity for efficient and accurate inference mechanisms we explore machine learning techniques based on the probabilistic framework of Gaussian processes (GPs). Our ultimate goal is to design powerful models that can efficiently handle imagery with spontaneously displayed facial expressions, and explain in detail the complex configurations behind the human face in real-world situations. To effectively decouple the head pose and expression in the presence of large out-of-plane head rotations we introduce a manifold learning approach based on multi-view learning strategies. Contrary to the majority of existing methods that typically treat the numerous poses as individual problems, in this model we first learn a discriminative manifold shared by multiple views of a facial expression. Subsequently, we perform facial expression classification in the expression manifold. Hence, the pose normalization problem is solved by aligning the facial expressions from different poses in a common latent space. We demonstrate that the recovered manifold can efficiently generalize to various poses and expressions even from a small amount of training data, while also being largely robust to corrupted image features due to illumination variations. State-of-the-art performance is achieved in the task of facial expression classification of basic emotions. The methods that we propose for learning the structure in the configuration of the muscle movements represent some of the first attempts in the field of analysis and intensity estimation of facial expressions. In these models, we extend our multi-view approach to exploit relationships not only in the input features but also in the multi-output labels. The structure of the outputs is imposed into the recovered manifold either from heuristically defined hard constraints, or in an auto-encoded manner, where the structure is learned automatically from the input data. The resulting models are proven to be robust to data with imbalanced expression categories, due to our proposed Bayesian learning of the target manifold. We also propose a novel regression approach based on product of GP experts where we take into account people's individual expressiveness in order to adapt the learned models on each subject. We demonstrate the superior performance of our proposed models on the task of facial expression recognition and intensity estimation.
Published: 2017
Full Text: View/download PDF

14. Dense 3D facial shape recovery employing shading and correspondences

Author: Snape, Patrick, Zafeiriou, Stefanos, and Pantic, Maja
Subjects: 006.4
Abstract: Human faces are one of the most frequently captured objects in both videos and photographs due to their fundamental role in communication and social interactions. The variability of this facial imagery makes it difficult to automate the understanding of scenes containing faces under unconstrained conditions. For faces, recovering accurate dense 3D facial shape from images and videos enables much richer understanding of the human face and its interaction with the scene. In this thesis, we seek to extend the work in the area of dense 3D facial shape recovery under challenging unconstrained conditions. There are a wealth of ways to recover 3D shape from images and videos, all of which make specific assumptions about the relationship between the individual images and the construction of the scene. Given this broad selection of methods available, we examine three different scenarios for dense 3D facial shape recovery: i) recovery from a single image, ii) recovery from an unconstrained image collection without any explicit 3D shape priors and iii) recovery from a video sequence. We focus on these three cases and show how facial priors can be introduced to tackle the dense 3D facial surface recovery problem. We propose to investigate the use of shading constraints for dense shape recovery from unconstrained images. Given the challenging nature of these images, the introduction of priors greatly improves performance over the generic shape-from-shading literature. However, the introduction of explicit priors comes with a further problem, that of correspondence. That is, recovering the relationship between pixels in the image and the structure of our model. For this reason, we also investigate the importance of finding dense correspondences between facial images. We show that it is possible to recover plausible dense 3D facial surfaces under a variety of different input conditions.
Published: 2017
Full Text: View/download PDF

15. Component analysis of complex-valued data for machine learning and computer vision tasks

Author: Papaioannou, Athanasios, Zafeiriou, Stefanos, and Pantic, Maja
Subjects: 006.3
Abstract: This thesis studies component analysis techniques when complex-valued data arise. Until recently, the usual way to apply these techniques to complex-valued data has been either by splitting the complex-valued data into their real and imaginary parts and afterwards use the techniques with these real data or by applying directly the techniques in the complex domain with some mild assumptions. We have focused in bridging the gap between the two alternatives and shown how to work natively with complex data. To this end, we introduce the use of the quite recently proposed widely linear model on component analysis and we compare it with the classical approaches. Applications and experiments for these methods have been performed for the case of computer vision problems such as face reconstruction, face recognition, expression recognition and video tracking. In order to decipher the whole approach in the complex domain, we present an introduction to the theory of complex calculus, where the concepts of widely linear transformations, augmented matrix algebra, Wirtinger and complex-valued matrix derivatives are illustrated showing how all these principles can be used. Also, we conduct an overview of the recent advances in the field of augmented statistics and widely linear modeling. The theory of complex-valued kernels and its usage on Principal Component Analysis (PCA) is analysed in depth and the widely linear version of PCA is presented as well as applications of the proposed method. Furthermore, we examine shape representation in real and complex domain and we compare alternative representations for computer vision tasks. Finally, we have worked towards the unification of component analysis methods in a least-square framework and we look for similarities and differences between the circular and widely linear model.
Published: 2017
Full Text: View/download PDF

16. Robust subspace learning for static and dynamic affect and behaviour modelling

Author: Georgakis, Christos and Pantic, Maja
Subjects: 006.3
Abstract: Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.
Published: 2017
Full Text: View/download PDF

17. Advances in compositional fitting of active appearance models

Author: Alabort Medina, Joan, Zafeiriou, Stefanos, and Pantic, Maja
Subjects: 006.4
Abstract: This thesis presents a detailed and complete study of compositional gradient descent (CGD) algorithms for fitting active appearance models (AAM) and advances the state-of-the-art in generative AAM fitting by incorporating: (i) novel robust texture representations; (ii) novel cost functions and compositional types; and (iii) combined fitting approaches with complementary deformable models; into the original CGD framework. In particular, a robust texture representation based on image gradient orientations is used to define a new type of generative deformable model that generalizes well to variations in identity, pose, expression, illumination and occlusions and that can be fitted to images using standard CGD algorithms. Moreover, a novel Bayesian formulation of the AAM fitting problem, which can be interpreted as a probabilistic generalization of the well-known project-out inverse compositional (PIC) algorithm, is proposed along with two new types of composition, asymmetric and bidirectional, that lead to better convergent and more robust CGD fitting algorithms. At the same time, interesting insights into existent strategies used to derive fast and exact simultaneous CGD algorithms are provided by reinterpreting them as direct applications of the Schur complement and the Wiberg method. Finally, CGD algorithms are combined with similar generative fitting techniques for constrained local models (CLM) to create a unified probabilistic fitting framework that combines the strengths of both models (AAM and CLM) and produces state-of-the-art results on the problem of non-rigid face alignment in the wild.
Published: 2017
Full Text: View/download PDF

18. Robust bespoke facial deformable models

Author: Sagonas, Christos, Pantic, Maja, and Zafeiriou, Stefanos
Subjects: 006.6
Abstract: Automatic analysis of facial images is a problem of paramount importance due to its application in numerous real world scenarios including security, entertainment, medicine, health care, multimedia, human-computer, and human-robot interactions. Arguably the most important step of an automatic face analysis system is the localization of the facial landmarks. This is due to the fact that it has a crucial impact on the robustness and accuracy of the designed system. Facial landmarks localization is a very challenging Computer Vision problem, since the face is a highly deformable object and its appearance drastically changes under different poses, expressions, and illuminations conditions. Recently, Computer Vision has witnessed great research advance towards automatic facial landmarks localization. Numerous methodologies have been proposed during the last few years that achieve accurate and efficient performance. The most successful methods are based on statistical deformable models. Developing powerful facial deformable models requires massive, annotated facial databases on which techniques can be trained, validated and tested. The past twenty years the research community has collected and annotated a number of facial databases captured under both constrained and unconstrained (in-the-wild) conditions. However, the existing facial databases cannot be utilised for training the aforementioned models due to several limitations of the provided annotations. More specifically, most databases have been annotated using different mark-ups, and in most cases, the accuracy of the provided annotations is low. Additional to the aforementioned problems the use of different training/testing sets and different error metrics makes the fair comparison between the existing methodologies almost infeasible. In this Thesis, we first aim to overcome the aforementioned problems by (a) proposing a semi-automatic annotation technique that was employed to re-annotate most existing facial databases under a unified protocol, and (b) presenting the 300 Faces In-The-Wild Challenge (300-W), the first facial landmark localization challenge that was organized twice, in 2013 and 2015. This is the first effort towards a unified annotation scheme of massive databases and a fair experimental comparison of existing facial landmarks localization systems. By tracking the published papers in recent Computer Vision conferences it can be seen that the produced annotations allowed the researchers to propose powerful generic facial deformable models for facial landmarks localization in still images. Nevertheless, when it comes to applications that require perfect facial landmarks localization and tracking accuracy, such as the analysis of human facial behaviour and facial motion capture, generic facial deformable models could be insufficient. In this case, person-specific facial deformable models are mainly employed, requiring manual annotation of facial landmarks for each person and subsequently person-specific training. In this Thesis, a novel method for the automatic construction of person-specific facial deformable models is proposed. To this end, an orthonormal subspace which is suitable for facial image reconstruction is learned. Next, to correct the erroneous fittings produced by a generic facial deformable model, image congealing (i.e., ensemble image alignment) is performed by employing only the learned orthonormal subspace. The image congealing problem is solved by formulating a suitable sparsity regularized rank minimization problem. After correcting the fittings, the next step is to construct the person-specific facial deformable model which could be further used to localize or track the facial landmarks in images that depict the same subject. This consists another contribution of this Thesis. After applying a generic or person-specific facial deformable model into still or a sequence of facial images, the next step of an automatic face analysis system is to remove the pose effect from the faces. To do that, landmark points-driven normalization (i.e., warping) of the faces into a common frame (e.g., frontal-view frame) is performed. However, most face normalization (pose correction) methods can be greatly affected from huge poses, illumination variations, occlusions and bad localization of the facial landmarks. A final, significant contribution of this Thesis is the development of a novel method, robust to aforementioned problems, for joint face frontalization (i.e., pose correction) and facial landmarks localization. Unlike the state-of-the-art methods for facial landmarks localization and pose correction, where large amount of manually annotated images or 3D facial models are required, the proposed method relies on a small set of frontal images only. By observing that the frontal facial image of both humans and animals, is the one having the minimum rank of all different poses, a model which is able to jointly recover the frontalized version of the face as well as the facial landmarks is devised. Therefore, we solve the optimization problem concerning minimization of the nuclear norm and the matrix ell_1 norm accounting for occlusions. This method is assessed for frontal view reconstruction of human and animal faces, landmark localization, pose-invariant face recognition, face verification in unconstrained conditions, and video in-painting by conducting experiment on nine databases. The experimental results demonstrate the effectiveness of the proposed method in comparison to the state-of-the-art methods for the target problems.
Published: 2017
Full Text: View/download PDF

19. Unsupervised analysis of behaviour dynamics

Author: Zafeiriou, Lazaros and Pantic, Maja
Subjects: 006.3
Abstract: Human facial behaviour analysis is an important task in developing automatic Human-Computer Interaction systems, having received rapidly increased attention over the past two decades. Dynamics of facial behaviour convey important information (e.g., discriminating posed to spontaneous expressions) and remain up to date a quite unexploited field. This thesis presents machine learning algorithms that focus on solving the relatively unexplored problem of extracting features that can efficiently and effectively capture the temporal dynamics of the behaviour, and can hence be also used for temporal alignment. The proposed methods are all unsupervised, i.e. they do not exploit any label information. The motivation behind the development of unsupervised algorithms lies in the fact that labelled/annotated data are really hard to obtain, since annotating behaviour dynamics is a very time demanding, expensive and labour intensive procedure. Additionally, in these models we incorporate temporal alignment enabling a joint temporal decomposition of two or more time-series into a common expression manifold by employing either low-dimensional sets of landmarks or raw pixel intensities. This is a challenging problem for many scientific disciplines in which the observation samples need to be aligned in time. In particular, this is mainly significant in terms of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. The methods that we propose for capturing the dynamics of facial expressions use Component Analysis (CA) which is a fundamental step in most computer vision applications, especially in terms of reducing the usually high-dimensional input data in a meaningful manner by preserving a certain function. These CA methodologies can be distinguished in deterministic and probabilistic techniques. In deterministic CA, the noise cannot be modelled and these methods they do not provide prior information. On the other hand, probabilistic CA is a very powerful framework that naturally allows the incorporation of noise and a-priori knowledge in the developed models. A significant contribution of our work lies in proposing an Expectation Maximization (EM) algorithm for performing inference in a probabilistic formulation of Slow Feature Analysis (SFA) and extending it in order to handle more than one time varying data sequences. Moreover, we demonstrate that the probabilistic SFA (EM-SFA) algorithm that discovers the common slowest varying latent space of multiple sequences can be combined with Dynamic Time Warping (DTW) techniques for robust sequence time-alignment. Most of the unsupervised learning techniques such as Principal Components Analysis (PCA) enforce only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability. This results to a holistic representation which makes the latent features difficult to be interpreted. For alleviating this, a group of unsupervised learning algorithms known as Non-negative Matrix Factorization (NMF), have been proposed. These algorithms enforce non-negativity constraints resulting to a part-based representation, since they allow only additive and not subtractive combinations. Another major contribution of this thesis lies in proposing a model that combines the properties of temporal slowness and nonnegative parts-based learning into a common framework that aims to learn slow varying parts-based representations of time varying sequences. The proposed representations can be used in order to capture the underlying dynamics of temporal phenomena such as facial behaviour. Furthermore, we extend the above framework in order to align two visual sequences that display the same dynamic phenomenon by proposing a novel joint NMF. The proposed framework enables a joint temporal decomposition of two non-negative time-series into a non-negative shared latent space, where they can be temporally aligned. The proposed method is tailored for the temporal alignment of facial events since it is able to discover the facial parts that are jointly activated in the sequences along with their temporal activation envelope. We demonstrate the power of the proposed decompositions in unsupervised analysis of dynamic visual phenomena, as well as temporal alignment of facial behaviour. The predominant strategy for facial expression analysis and temporal analysis of facial events is the following: a generic facial landmarks tracker, usually trained on thousands of carefully annotated examples, is applied to track the landmark points, and then analysis is performed using mostly the shape and more rarely the facial texture. In this thesis, we challenge the above framework by showing that is feasible to perform joint landmarks localization and temporal analysis of behavioural sequence with the use of a simple face detector and a simple shape model. To this end, we formulate a generative model which jointly describes the data and also captures temporal dependencies by incorporating an autoregressive chain in the latent space. We also extend this model by integrating temporal alignment process in order to align two unsynchronized sequences of observations displaying highly deformable texture-varying objects. The resulting model is the first to perform simultaneous spatial and temporal alignment showing that by treating the problems of deformable spatial and temporal alignment jointly, we achieve better results than considering the problems independent.
Published: 2016
Full Text: View/download PDF

20. Towards spatial and temporal analysis of facial expressions in 3D data

Author: Rajamanoharan, Georgia, Zafeiriou, Stefanos, and Pantic, Maja
Subjects: 006.3
Abstract: Facial expressions are one of the most important means for communication of emotions and meaning. They are used to clarify and give emphasis, to express intentions, and form a crucial part of any human interaction. The ability to automatically recognise and analyse expressions could therefore prove to be vital in human behaviour understanding, which has applications in a number of areas such as psychology, medicine and security. 3D and 4D (3D+time) facial expression analysis is an expanding field, providing the ability to deal with problems inherent to 2D images, such as out-of-plane motion, head pose, and lighting and illumination issues. Analysis of data of this kind requires extending successful approaches applied to the 2D problem, as well as the development of new techniques. The introduction of recent new databases containing appropriate expression data, recorded in 3D or 4D, has allowed research into this exciting area for the first time. This thesis develops a number of techniques, both in 2D and 3D, that build towards a complete system for analysis of 4D expressions. Suitable feature types, designed by employing binary pattern methods, are developed for analysis of 3D facial geometry data. The full dynamics of 4D expressions are modelled, through a system reliant on motion-based features, to demonstrate how the different components of the expression (neutral-onset-apex-offset) can be distinguished and harnessed. Further, the spatial structure of expressions is harnessed to improve expression component intensity estimation in 2D videos. Finally, it is discussed how this latter step could be extended to 3D facial expression analysis, and also combined with temporal analysis. Thus, it is demonstrated that both spatial and temporal information, when combined with appropriate 3D features, is critical in analysis of 4D expression data.
Published: 2016
Full Text: View/download PDF

21. Robust subspace learning techniques for tracking and recognition of human faces

Author: Marras, Ioannis and Pantic, Maja
Subjects: 006.3
Abstract: Computer vision, in general, aims to duplicate (or in some cases compensate) human vision, and traditionally, have been used in performing routine, repetitive tasks, such as classification in massive assembly lines. Today, research on computer vision is spreading enormously so that it is almost impossible to itemize all of its subtopics. Despite of this fact, one can list relevant several applications, such as face processing (i.e. face, expression, and gesture recognition), computer human interaction, crowd surveillance, and content-based image retrieval. In this thesis, we propose subspace learning algorithms that head toward solving two important but largely understudied problems in automated face analysis: robust 2D plus 3D face tracking and robust 2D/3D face recognition in the wild. The methods that we propose for the former represent the pioneering work on face tracking and recognition. After describing all the unsolved problems a computer vision method for automated facial analysis has to deal with, we propose algorithms to deal with these problems. More specifically, we propose a subspace technique for robust rigid object tracking by fusing appearance models created based on different modalities. The proposed learning and fusing framework is robust, exact, computationally efficient and does not require off-line training. By using 3D information and an appropriate 3D motion model, pose and appearance are decoupled, and therefore learning and maintaining an updated model for appearance only is feasible by using efficient online subspace learning schemes, achieving in that way robust performance in very difficult tracking scenarios including extreme pose variations. Furthermore, we propose an efficient and robust subspace technique to gradient ascent automatic face recognition method which is based on a correlation-based approach to parametric object alignment. Our algorithm performs the face recognition task by registering two face images by iteratively maximizing their correlation coefficient using gradient ascent as well as an appropriate motion model. We show the robustness of our algorithm for the problem of face recognition in the presence of occlusions and non-uniform illumination changes. In addition, we introduce a simple, efficient and robust subspace-based method for learning from the azimuth angle of surface normals for 3D face recognition. We show that an efficient subspace-based data representation based on the normal azimuth angles can be used for robust face recognition from facial surfaces. We demonstrated some of the favourable properties of this framework for the application of 3D face recognition. Extensions of our scheme span a wide range of theoretical topics and applications, from statistical machine learning and clustering to 3D object recognition. An important aspect of this method is that it can achieve good face recognition/ verification performance by using raw 3D scans without any heavy preprocessing (i.e., model fitting, surface smoothing etc.). Finally, we propose a methodology that jointly learns a generative deformable model with minimal human intervention by using only a simple shape model of the object and images automatically downloaded from the Internet, and also extracts features appropriate for classification. The proposed algorithm is tested on various classification problems such as 'in-the-wild' face recognition, as well as, Internet image based vision applications such as gender classification and eye glasses detection on data collected automatically by querying into a web image search engine.
Published: 2016
Full Text: View/download PDF

22. Infinite hidden conditional random fields for the recognition of human behaviour

Author: Bousmalis, Konstantinos and Pantic, Maja
Subjects: 004
Abstract: While detecting and interpreting temporal patterns of nonverbal behavioral cues in a given context is a natural and often unconscious process for humans, it remains a rather difficult task for computer systems. In this thesis we are primarily motivated by the problem of recognizing expressions of high--level behavior, and specifically agreement and disagreement. We thoroughly dissect the problem by surveying the nonverbal behavioral cues that could be present during displays of agreement and disagreement; we discuss a number of methods that could be used or adapted to detect these suggested cues; we list some publicly available databases these tools could be trained on for the analysis of spontaneous, audiovisual instances of agreement and disagreement, we examine the few existing attempts at agreement and disagreement classification, and we discuss the challenges in automatically detecting agreement and disagreement. We present experiments that show that an existing discriminative graphical model, the Hidden Conditional Random Field (HCRF) is the best performing on this task. The HCRF is a discriminative latent variable model which has been previously shown to successfully learn the hidden structure of a given classification problem (provided an appropriate validation of the number of hidden states). We show here that HCRFs are also able to capture what makes each of these social attitudes unique. We present an efficient technique to analyze the concepts learned by the HCRF model and show that these coincide with the findings from social psychology regarding which cues are most prevalent in agreement and disagreement. Our experiments are performed on a spontaneous expressions dataset curated from real televised debates. The HCRF model outperforms conventional approaches such as Hidden Markov Models and Support Vector Machines. Subsequently, we examine existing graphical models that use Bayesian nonparametrics to have a countably infinite number of hidden states and adapt their complexity to the data at hand. We identify a gap in the literature that is the lack of a discriminative such graphical model and we present our suggestion for the first such model: an HCRF with an infinite number of hidden states, the Infinite Hidden Conditional Random Field (IHCRF). In summary, the IHCRF is an undirected discriminative graphical model for sequence classification and uses a countably infinite number of hidden states. We present two variants of this model. The first is a fully nonparametric model that relies on Hierarchical Dirichlet Processes and a Markov Chain Monte Carlo inference approach. The second is a semi--parametric model that uses Dirichlet Process Mixtures and relies on a mean--field variational inference approach. We show that both models are able to converge to a correct number of represented hidden states, and perform as well as the best finite HCRFs ---chosen via cross--validation--- for the difficult tasks of recognizing instances of agreement, disagreement, and pain in audiovisual sequences.
Published: 2015
Full Text: View/download PDF

23. Robust online subspace learning

Author: Liwicki, Stephan and Pantic, Maja
Subjects: 004
Abstract: In this thesis, I aim to advance the theories of online non-linear subspace learning through the development of strategies which are both efficient and robust. The use of subspace learning methods is very popular in computer vision and they have been employed to numerous tasks. With the increasing need for real-time applications, the formulation of online (i.e. incremental and real-time) learning methods is a vibrant research field and has received much attention from the research community. A major advantage of incremental systems is that they update the hypothesis during execution, thus allowing for the incorporation of the real data seen in the testing phase. Tracking acts as an attractive and popular evaluation tool for incremental systems, and thus, the connection between online learning and adaptive tracking is seen commonly in the literature. The proposed system in this thesis facilitates learning from noisy input data, e.g. caused by occlusions, casted shadows and pose variations, that are challenging problems in general tracking frameworks. First, a fast and robust alternative to standard L2-norm principal component analysis (PCA) is introduced, which I coin Euler PCA (e-PCA). The formulation of e-PCA is based on robust, non-linear kernel PCA (KPCA) with a cosine-based kernel function that is expressed via an explicit feature space. When applied to tracking, face reconstruction and background modeling, promising results are achieved. In the second part, the problem of matching vectors of 3D rotations is explicitly targeted. A novel distance which is robust for 3D rotations is introduced, and formulated as a kernel function. The kernel leads to a new representation of 3D rotations, the full-angle quaternion (FAQ) representation. Finally, I propose 3D object recognition from point clouds, and object tracking with color values using FAQs. A domain-specific kernel function designed for visual data is then presented. KPCA with Krein space kernels is introduced, as this kernel is indefinite, and an exact incremental learning framework for the new kernel is developed. In a tracker framework, the presented online learning outperforms the competitors in nine popular and challenging video sequences. In the final part, the generalized eigenvalue problem is studied. Specifically, incremental slow feature analysis (SFA) with indefinite kernels is proposed, and applied to temporal video segmentation and tracking with change detection. As online SFA allows for drift detection, further improvements are achieved in the evaluation of the tracking task.
Published: 2015
Full Text: View/download PDF

24. Regression-based estimation of pain and facial expression intensity

Author: Kaltwang, Sebastian and Pantic, Maja
Subjects: 616.89
Abstract: Human inner feelings and psychological states like pain are subjective states that cannot be directly measured, but can be estimated from non-verbal behaviour such as spontaneous facial expressions. Since these expressions are typically characterized by subtle movements of facial parts, analysis of the facial details is required. The contribution of this thesis is two-fold. First, we propose a novel set of Bayesian regression-based learning methods for intensity estimation of facial expressions. Second, we create and publicly release the first multi-modal database of patients experiencing chronic pain, in order to facilitate further research into machine learning for automated analysis of pain. We formulate three novel regression methods for continuous estimation of the intensity of facial expressions of pain and facial muscle groups (AUs). The first regression model treats the observed face holistically and estimates the intensity of target expressions using the framework of Relevance Vector Machine (RVM) and the newly proposed fusion of the shape and appearance features. This is the first method in the field that addresses automated continuous intensity estimation of facial expressions of pain. We then extend this approach to the Doubly Sparse RVM (DSRVM) that automatically learns the importance of various facial parts for the target task at hand. DSRVM achieves this by enforcing double sparsity by jointly selecting the most relevant training examples (a.k.a. relevance vectors) and the most important kernels associated with the informative facial parts for estimation of facial expression intensity. This advances prior work on multiple-kernel learning, where the kernel sparsity is typically ignored. leading to improved intensity estimation performance over existing MKL methods, and the state-of-the-art methods for intensity estimation of pain and AUs. Lastly, we introduce a regression-based approach that jointly learns the inter-dependence of facial parts and multiple AU or pain targets. This is accomplished by a newly formulated latent tree (LT) model, that efficiently learns a hidden inference structure between features and targets. The proposed approach is the first that addresses the joint estimation of continuous intensity of multiple AU outputs in a principled manner. We show that this joint approach achieves better intensity estimation of AUs compared to existing methods, especially in the presence of noisy inputs. The proposed regression methods have been evaluated on two established datasets of naturalistic facial expressions, i.e., DISFA and ShoulderPain, and our newly created dataset, named EmoPain. The new database consists of spontaneously displayed pain-related facial expressions and body movements recorded by multiple modalities, while patients with chronic back-pain were performing instructed physical exercises. Facial expression videos have been annotated frame-wise in terms of the continuous pain intensity. We empirically demonstrated the advantages of using the proposed local methods, which model the face explicitly as the sum of its parts. We empirically demonstrated on all three datasets the advantages of using the proposed methods for intensity estimation of facial expressions. We empirically show that the proposed methods, which model the face explicitly as the sum of its parts, outperform the existing state-of-the-art methods for the target tasks. This supports the findings in psychology research which suggest that only components of expressions rather than the holistic face play the key role in interpretation of human facial expression interpretation, and, in particular, its intensity estimation.
Published: 2015
Full Text: View/download PDF

25. Realising affect-sensitive multimodal human-computer interface : hardware and software infrastructure

Author: Shen, Jie and Pantic, Maja
Subjects: 004
Abstract: With the industry's recent paradigm shift from PC-centred applications to services delivered through ubiquitous computing in a more human centred manner, multimodal human-computer interfaces (MHCI) became an emerging research topic. As an important but often neglected aspect, the lack of appropriate system integration tools hinders the development of MHCI systems. Therefore, the work presented in this thesis aims at delivering hardware / software infrastructure to facilitate the full development cycle of MHCI systems. Specifically, we first built a hardware platform for synchronised, multimodal-data capturing to support and facilitate automatic human behaviour understanding from multiple audiovisual sensors. Then we developed a software framework, called the HCI^2 Framework, to facilitate the modular development and rapid prototyping of readily-applicable MHCI systems. As a proof of concept, we also present an affect-sensitive game with humanoid robot NAO developed using the HCI^2 Framework. Studies on automatic human behaviour understanding require high-bandwidth recording from multiple cameras, as well as from other sensors such as microphones and eye-gaze trackers. In addition, sensor fusion should be realised with high accuracy as to achieve tight synchronisation between sensors and, in turn, enable studies of correlation between various behavioural signals. Using commercial off-the-shelf components may compromise quality and accuracy due to several issues including handling the combined data rate from multiple sensors, unknown offset and rate discrepancies between independent hardware clocks, the absence of trigger inputs or -outputs in the hardware, as well as the existence of different methods for time-stamping the recorded data. To achieve accurate synchronisation, we centralise the synchronisation task by recording all trigger or timestamp signals with a multi-channel audio interface. For sensors not having an external trigger signal, we let the computer that captures the sensor data periodically generate timestamp signals from its serial port output. These signals can also be used as a common time base to synchronise multiple asynchronous audio interfaces. The resulted data recording platform, which is built upon two consumer-grade PCs, is capable of capturing 8-bit video data with 1024 x 1024 spatial- and 59.1 Hz temporal resolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96 kHz and eye-gaze tracking result sampled at a frequency of 60 or 120 Hz. The attained synchronisation accuracy is unprecedented up to date. To facilitate rapid development of readily-applicable MHCI systems using algorithms designed to detect and track behavioural signals (e.g. face detector, facial fiducially points tracker, expression recogniser, etc.), a software integration framework is required. The proposed software framework, which is called the HCI^2 Framework, is built upon publish/subscribe (P/S) architecture. It implements a shared-memory-based data transport protocol for message delivery and a TCP-based system management protocol. The latter ensures that the integrity of system structure is maintained at runtime. With the inclusion of 'bridging modules', the HCI^2 Framework is interoperable with other software frameworks including Psyclone and ActiveMQ. In addition to the core communication middleware, we also present the integrated development environment (IDE) of the HCI^2 Framework. It provides a complete graphical environment to support every step in a typical MHCI system development process, including module development, debugging, packaging, and management, as well as the whole system management and testing. The quantitative evaluation indicates that our framework outperforms other similar tools in terms of average message latency and maximum data throughput under a typical single PC scenario. To demonstrate HCI^2 Framework's capabilities in integrating heterogeneous modules, we present several example modules working with a variety of hardware and software. We also present two use cases of the HCI^2 Framework: a computer game, called CamGame, based on hand-held marker(s) and low-cost camera(s) and the human affective signal analysis component of the Fun Robotic Outdoor Guide (FROG) project (http://www.frogrobot.eu/). Using the HCI^2 Framework, we further developed the Mimic-Me Game, which consists of an interactive game played with the NAO humanoid robot. The game involves the robot 'mimicking' the player's facial expression using a combination of body gestures and audio cues. A multimodal dialogue model has been designed and implemented to enable the robot to interact with the human player in a naturalistic way using only natural language, head movement and facial expressions.
Published: 2014
Full Text: View/download PDF

26. Machine learning techniques for automated analysis of facial expressions

Author: Rudovic, Ognjen and Pantic, Maja
Subjects: 004
Abstract: Automated analysis of facial expressions paves the way for numerous next-generation computing tools including aff ective computing technologies (proactive and a ctive user interfaces), learner-adaptive tutoring systems, medical and marketing applications, etc. In this thesis, we propose machine learning algorithms that head toward solving two important but largely understudied problems in automated analysis of facial expressions from facial images: pose-invariant facial expression classi fication, and modeling of dynamics of facial expressions, in terms of their temporal segments and intensity. The methods that we propose for the former represent the pioneering work on pose-invariant facial expression analysis. In these methods, we use our newly introduced models for pose normalization that achieve successful decoupling of head pose and expression in the presence of large out-of-plane head rotations, followed by facial expression classification. This is in contrast to most existing works, which can deal only with small in-plane head rotations. We derive our models for pose normalization using the Gaussian Process (GP) framework for regression and manifold learning. In these, we model the structure encoded in relationships between facial expressions from di fferent poses and also in facial shapes. This results in the models that can successfully perform pose normalization either by warping facial expressions from non-frontal poses to the frontal pose, or by aligning facial expressions from different poses on a common expression manifold. These models solve some of the most important challenges of pose-invariant facial expression classification by being able to generalize to various poses and expressions from a small amount of training data, while also being largely robust to corrupted image features and imbalanced examples of different facial expression categories. We demonstrate this on the task of pose-invariant facial expression classi fication of six basic emotions. The methods that we propose for temporal segmentation and intensity estimation of facial expressions represent some of the first attempts in the fi eld to model facial expression dynamics. In these methods, we use the Conditional Random Fields (CRF) framework to define dynamic models that encode the spatio-temporal structure of the expression data, reflected in ordinal and temporal relationships between temporal segments and intensity levels of facial expressions. We also propose several means of addressing the subject variability in the data by simultaneously exploiting various priors, and the e ffects of heteroscedasticity and context of target facial expressions. The resulting models are the first to address simultaneous classi fication and temporal segmentation of facial expressions of six basic emotions, and dynamic modeling of intensity of facial expressions of pain. Moreover, the context-sensitive model that we propose for intensity estimation of spontaneously displayed facial expressions of pain and Action Units (AUs), is the first approach in the field that performs context-sensitive modeling of facial expressions in a principled manner.
Published: 2014
Full Text: View/download PDF

27. Spatial and temporal analysis of facial actions

Author: Jiang, Bihan and Pantic, Maja
Subjects: 004
Abstract: Facial expression recognition has been an active topic in computer vision since 90s due to its wide applications in human-computer interaction, entertainment, security, and health care. Previous works on automatic analysis of facial expressions have focused mostly on detecting prototypic expressions of basic emotions like happiness and anger. In contrast, the Facial Action Coding System (FACS) is one of the most comprehensive and objective ways to describe facial expressions. It associates facial expressions with the actions of the muscles that produce them by defining a set of atomic movements called Action Units (AUs). The system allows any facial expressions to be uniquely described by a combination of AUs. Over the past decades, extensive research has been conducted by psychologists and neuroscientists on various applications of facial expression analysis using FACS. Automating FACS coding would make this research faster and more widely applicable, opening up new avenues to understanding how we communicate through facial expressions. Morphology and dynamics are the two aspects of facial actions, that are crucial for the interpretation of human facial behaviour. The focus of this thesis is how to represent and learn the rich facial texture changes in both the spatial and temporal domain. The effectiveness of spatial and spatio-temporal facial representations and their roles in detecting the activation and temporal dynamics of facial actions are explored. In the spatial domain, a novel feature extraction strategy is proposed based on a heuristically defined regions from which a separate classifier is trained and fused in the decision-level. In the temporal domain, a novel dynamic appearance descriptor is presented by extending the static appearance descriptor Local Phase Quantisation (LPQ) to the temporal domain by using the Three Orthogonal Planes (TOP). The resulting dynamic appearance descriptor LPQ-TOP is applied to detect the latent temporal information representing facial appearance changes and explicitly model facial dynamics of AUs in terms of their temporal segments. Finally, a parametric temporal alignment method is proposed. Such strategy can accommodate very flexible time warp functions and is able to deal with both sequence-to-sequence and sub-sequence alignment. This method also opens up a new approach to the problem of AU temporal segment detection. This thesis contributes to facial action recognition by modelling the spatial and temporal texture changes for AU activation detection and AU temporal segmentation. We advance the performance of state-of-the-art facial action recognition systems and this has been demonstrated on a number of commonly used databases.
Published: 2014
Full Text: View/download PDF

28. Spatiotemporal visual analysis of human actions

Author: Oikonomopoulos, Antonios, Pantic, Maja, and Davison, Andrew
Subjects: 006.3
Abstract: In this dissertation we propose four methods for the recognition of human activities. In all four of them, the representation of the activities is based on spatiotemporal features that are automatically detected at areas where there is a significant amount of independent motion, that is, motion that is due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features throughout this dissertation. The algorithms presented, however, can be used with any kind of features, as long as the latter are well localized and have a well-defined area of support in space and time. We introduce the utilized spatiotemporal salient points in the first method presented in this dissertation. By extending previous work on spatial saliency, we measure the variations in the information content of pixel neighborhoods both in space and time, and detect the points at the locations and scales for which this information content is locally maximized. In this way, an activity is represented as a collection of spatiotemporal salient points. We propose an iterative linear space-time warping technique in order to align the representations in space and time and propose to use Relevance Vector Machines (RVM) in order to classify each example into an action category. In the second method proposed in this dissertation we propose to enhance the acquired representations of the first method. More specifically, we propose to track each detected point in time, and create representations based on sets of trajectories, where each trajectory expresses how the information engulfed by each salient point evolves over time. In order to deal with imperfect localization of the detected points, we augment the observation model of the tracker with background information, acquired using a fully automatic background estimation algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels. In addition, we perform experiments where the tracked templates are localized on specific parts of the body, like the hands and the head, and we further augment the tracker’s observation model using a human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm (LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and RVMs for classification. In the third method that we propose, we assume that neighboring salient points follow a similar motion. This is in contrast to the previous method, where each salient point was tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are extracted across the whole dataset are subsequently clustered in order to create a codebook, which is used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for classification. The fourth and last method addresses the joint problem of localization and recognition of human activities depicted in unsegmented image sequences. Its main contribution is the use of an implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal localization of characteristic ensembles of spatiotemporal features. The latter are localized around automatically detected salient points. Evidence for the spatiotemporal localization of the activity is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct class-specific spatiotemporal models, which encode where in space and time each codeword ensemble appears in the training set. During testing, each activated codeword ensemble casts probabilistic votes concerning the spatiotemporal localization of the activity, according to the information stored during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume which potentially engulfs the activity, and is verified by performing action category classification with an RVM classifier.
Published: 2010
Full Text: View/download PDF

29. Timing is everything : a spatio-temporal approach to the analysis of facial actions

Author: Valstar, Michel Francois, Rueckert, Daniel, and Pantic, Maja
Subjects: 006.42
Abstract: This thesis presents a fully automatic facial expression analysis system based on the Facial Action Coding System (FACS). FACS is the best known and the most commonly used system to describe facial activity in terms of facial muscle actions (i.e., action units, AUs). We will present our research on the analysis of the morphological, spatio-temporal and behavioural aspects of facial expressions. In contrast with most other researchers in the field who use appearance based techniques, we use a geometric feature based approach. We will argue that that approach is more suitable for analysing facial expression temporal dynamics. Our system is capable of explicitly exploring the temporal aspects of facial expressions from an input colour video in terms of their onset (start), apex (peak) and offset (end). The fully automatic system presented here detects 20 facial points in the first frame and tracks them throughout the video. From the tracked points we compute geometry-based features which serve as the input to the remainder of our systems. The AU activation detection system uses GentleBoost feature selection and a Support Vector Machine (SVM) classifier to find which AUs were present in an expression. Temporal dynamics of active AUs are recognised by a hybrid GentleBoost-SVM-Hidden Markov model classifier. The system is capable of analysing 23 out of 27 existing AUs with high accuracy. The main contributions of the work presented in this thesis are the following: we have created a method for fully automatic AU analysis with state-of-the-art recognition results. We have proposed for the first time a method for recognition of the four temporal phases of an AU. We have build the largest comprehensive database of facial expressions to date. We also present for the first time in the literature two studies for automatic distinction between posed and spontaneous expressions.
Published: 2008
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

29 results on '"Pantic, Maja"'

1. Deep audio-visual speech recognition

2. Human-controllable and structured deep generative models

3. Generation of realistic human behaviour

4. Bias in deep learning and applications to face analysis

5. Deep face tracking and parsing in the wild

6. Dynamic face parsing in the wild

7. Machine learning methods for audio-visual event analysis

8. Machine learning methods for face modelling and analysis in-the-wild

9. Structured machine learning methods for automated analysis of facial expressions

10. Robust statistical deformable models

11. Machine learning for high-level social behaviour

12. Robust deformable model for 3D face alignment and tracking

13. Gaussian processes for modeling of facial expressions

14. Dense 3D facial shape recovery employing shading and correspondences

15. Component analysis of complex-valued data for machine learning and computer vision tasks

16. Robust subspace learning for static and dynamic affect and behaviour modelling

17. Advances in compositional fitting of active appearance models

18. Robust bespoke facial deformable models

19. Unsupervised analysis of behaviour dynamics

20. Towards spatial and temporal analysis of facial expressions in 3D data

21. Robust subspace learning techniques for tracking and recognition of human faces

22. Infinite hidden conditional random fields for the recognition of human behaviour

23. Robust online subspace learning

24. Regression-based estimation of pain and facial expression intensity

25. Realising affect-sensitive multimodal human-computer interface : hardware and software infrastructure

26. Machine learning techniques for automated analysis of facial expressions

27. Spatial and temporal analysis of facial actions

28. Spatiotemporal visual analysis of human actions

29. Timing is everything : a spatio-temporal approach to the analysis of facial actions

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

29 results on '"Pantic, Maja"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources