Author: "Shi, Bertram E." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shi, Bertram E."' showing total 702 results

Start Over Author "Shi, Bertram E."

702 results on '"Shi, Bertram E."'

1. Merging Multiple Datasets for Improved Appearance-Based Gaze Estimation

Author: Wu, Liang and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of head poses, gaze an-gles, illumination, etc. Another reason may be the inconsistency between methods used to define gaze angles (label mismatch). We propose two innovations to improve the performance of gaze estimation by leveraging multiple datasets, a change in the estimator architecture and the intro-duction of a gaze adaptation module. Most state-of-the-art estimators merge information extracted from images of the two eyes and the entire face either in parallel or combine information from the eyes first then with the face. Our proposed Two-stage Transformer-based Gaze-feature Fusion (TTGF) method uses transformers to merge information from each eye and the face separately and then merge across the two eyes. We argue that this improves head pose invariance since changes in head pose affect left and right eye images in different ways. Our proposed Gaze Adaptation Module (GAM) method handles annotation inconsis-tency by applying a Gaze Adaption Module for each dataset to correct gaze estimates from a single shared estimator. This enables us to combine information across datasets despite differences in labeling. Our experi-ments show that these innovations improve gaze estimation performance over the SOTA both individually and collectively (by 10% - 20%). Our code is available at https://github.com/HKUST-NISL/GazeSetMerge., Comment: 14 pages
Published: 2024

2. Survey of Design Paradigms for Social Robots

Author: Frieske, Rita, Mo, Xiaoyu, Fang, Yini, Nieles, Jay, and Shi, Bertram E.
Subjects: Computer Science - Robotics, Computer Science - Computation and Language, Computer Science - Computers and Society
Abstract: The demand for social robots in fields like healthcare, education, and entertainment increases due to their emotional adaptation features. These robots leverage multimodal communication, incorporating speech, facial expressions, and gestures to enhance user engagement and emotional support. The understanding of design paradigms of social robots is obstructed by the complexity of the system and the necessity to tune it to a specific task. This article provides a structured review of social robot design paradigms, categorizing them into cognitive architectures, role design models, linguistic models, communication flow, activity system models, and integrated design models. By breaking down the articles on social robot design and application based on these paradigms, we highlight the strengths and areas for improvement in current approaches. We further propose our original integrated design model that combines the most important aspects of the design of social robots. Our approach shows the importance of integrating operational, communicational, and emotional dimensions to create more adaptive and empathetic interactions between robots and humans.
Published: 2024

3. ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation

Author: Frieske, Rita and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Computers and Society
Abstract: ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in an underrepresented age group in machine learning visual emotion recognition. The dataset is validated through comprehensive experiments indicating its importance in neural multimodal fusion research.
Published: 2024

4. Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Author: Frieske, Rita and Shi, Bertram E.
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of hallucinations to probable natural language outputs of the model creates a danger of deception and impacts the credibility of the system. We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. We demonstrate that this method helps to distinguish between hallucinatory and non-hallucinatory models that have similar baseline word error rates. We further explore the relationship between the types of ASR errors and the types of dataset noise to determine what types of noise are most likely to create hallucinatory outputs. We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency. Finally, we discover how to induce hallucinations with a random noise injection to the utterance.
Published: 2024

5. Reducing Stress and Anxiety in the Metaverse: A Systematic Review of Meditation, Mindfulness and Virtual Reality

Author: Wang, Xian, Mo, Xiaoyu, Fan, Mingming, Lee, Lik-Hang, Shi, Bertram E., and Hui, Pan
Subjects: Computer Science - Human-Computer Interaction
Abstract: Meditation, or mindfulness, is widely used to improve mental health. With the emergence of Virtual Reality technology, many studies have provided evidence that meditation with VR can bring health benefits. However, to our knowledge, there are no guidelines and comprehensive reviews in the literature on how to conduct such research in virtual reality. In order to understand the role of VR technology in meditation and future research opportunities, we conducted a systematic literature review in the IEEE and ACM databases. Our process yielded 19 eligible papers and we conducted a structured analysis. We understand the state-of-art of meditation type, design consideration and VR and technology through these papers and conclude research opportunities and challenges for the future.
Published: 2022

6. CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Author: Dai, Wenliang, Cahyawijaya, Samuel, Yu, Tiezheng, Barezi, Elham J., Xu, Peng, Yiu, Cheuk Tung Shadow, Frieske, Rita, Lovenia, Holy, Winata, Genta Indra, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR., Comment: 6 pages
Published: 2022

7. Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Author: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung Shadow, Lovenia, Holy, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., and Fung, Pascale
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
Published: 2022

8. ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Author: Lovenia, Holy, Cahyawijaya, Samuel, Winata, Genta Indra, Xu, Peng, Yan, Xu, Liu, Zihan, Frieske, Rita, Yu, Tiezheng, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., and Fung, Pascale
Subjects: Computer Science - Computation and Language
Abstract: Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND's design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69\% character error rate and 27.05% mixed error rate.
Published: 2021

9. Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition

Author: Deng, Didan, Wu, Liang, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors. We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation. Our method generates single student models that provide accurate estimates of uncertainty for in-domain samples and a student ensemble that can detect out-of-domain samples. Our experiments on emotion recognition and uncertainty estimation using the Aff-wild2 dataset demonstrate that our algorithm gives more reliable uncertainty estimates than both Temperature Scaling and Monte Carol Dropout., Comment: Accepted as a Workshop paper in ICCV2021 proceeding
Published: 2021

10. Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Author: Zhao, Zhetuo, Triesch, Jochen, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The active efficient coding (AEC) framework parsimoniously explains the joint development of visual processing and eye movements, e.g., the emergence of binocular disparity selective neurons and fusional vergence, the disjunctive eye movements that align left and right eye images. Vergence can be driven by information in both the fovea and periphery, which play complementary roles. The high resolution fovea can drive precise short range movements. The lower resolution periphery supports coarser long range movements. The fovea and periphery may also contain conflicting information, e.g. due to objects at different depths. While past AEC models did integrate peripheral and foveal information, they did not explicitly take into account these characteristics. We propose here a two-level hierarchical approach that does. The bottom level generates different vergence actions from foveal and peripheral regions. The top level selects one. We demonstrate that the hierarchical approach performs better than prior approaches in realistic environments, exhibiting better alignment and less oscillation.
Published: 2021
Full Text: View/download PDF

11. Self-Calibrating Active Binocular Vision via Active Efficient Coding with Deep Autoencoders

Author: Wilmot, Charles, Shi, Bertram E., and Triesch, Jochen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We present a model of the self-calibration of active binocular vision comprising the simultaneous learning of visual representations, vergence, and pursuit eye movements. The model follows the principle of Active Efficient Coding (AEC), a recent extension of the classic Efficient Coding Hypothesis to active perception. In contrast to previous AEC models, the present model uses deep autoencoders to learn sensory representations. We also propose a new formulation of the intrinsic motivation signal that guides the learning of behavior. We demonstrate the performance of the model in simulations.
Published: 2021

12. AVGCN: Trajectory Prediction using Graph Convolutional Networks Guided by Human Attention

Author: Liu, Congcong, Chen, Yuying, Liu, Ming, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Pedestrian trajectory prediction is a critical yet challenging task, especially for crowded scenes. We suggest that introducing an attention mechanism to infer the importance of different neighbors is critical for accurate trajectory prediction in scenes with varying crowd size. In this work, we propose a novel method, AVGCN, for trajectory prediction utilizing graph convolutional networks (GCN) based on human attention (A denotes attention, V denotes visual field constraints). First, we train an attention network that estimates the importance of neighboring pedestrians, using gaze data collected as subjects perform a bird's eye view crowd navigation task. Then, we incorporate the learned attention weights modulated by constraints on the pedestrian's visual field into a trajectory prediction network that uses a GCN to aggregate information from neighbors efficiently. AVGCN also considers the stochastic nature of pedestrian trajectories by taking advantage of variational trajectory prediction. Our approach achieves state-of-the-art performance on several trajectory prediction benchmarks, and the lowest average prediction error over all considered benchmarks., Comment: 7 pages, 4 figures
Published: 2021

13. HGCN-GJS: Hierarchical Graph Convolutional Network with Groupwise Joint Sampling for Trajectory Prediction

Author: Chen, Yuying, Liu, Congcong, Mei, Xiaodong, Shi, Bertram E., and Liu, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate pedestrian trajectory prediction is of great importance for downstream tasks such as autonomous driving and mobile robot navigation. Fully investigating the social interactions within the crowd is crucial for accurate pedestrian trajectory prediction. However, most existing methods do not capture group level interactions well, focusing only on pairwise interactions and neglecting group-wise interactions. In this work, we propose a hierarchical graph convolutional network, HGCN-GJS, for trajectory prediction which well leverages group level interactions within the crowd. Furthermore, we introduce a novel joint sampling scheme for modeling the joint distribution of multiple pedestrians in the future trajectories. Based on the group information, this scheme associates the trajectory of one person with the trajectory of other people in the group, but maintains the independence of the trajectories of outsiders. We demonstrate the performance of our network on several trajectory prediction datasets, achieving state-of-the-art results on all datasets considered., Comment: 6 pages, 8 figures, accepted by IROS 2022
Published: 2020

14. Multitask Emotion Recognition with Incomplete Labels

Author: Deng, Didan, Chen, Zhaokang, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks., Comment: Accepted by FG2020
Published: 2020

15. Towards High Performance Low Complexity Calibration in Appearance Based Gaze Estimation

Author: Chen, Zhaokang and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Appearance-based gaze estimation from RGB images provides relatively unconstrained gaze tracking. We have previously proposed a gaze decomposition method that decomposes the gaze angle into the sum of a subject-independent gaze estimate from the image and a subject-dependent bias. This paper extends that work with a more complete characterization of the interplay between the complexity of the calibration dataset and estimation accuracy. We analyze the effect of the number of gaze targets, the number of images used per gaze target and the number of head positions in calibration data using a new NISLGaze dataset, which is well suited for analyzing these effects as it includes more diversity in head positions and orientations for each subject than other datasets. A better understanding of these factors enables low complexity high performance calibration. Our results indicate that using only a single gaze target and single head position is sufficient to achieve high quality calibration, outperforming state-of-the-art methods by more than 6.3%. One of the surprising findings is that the same estimator yields the best performance both with and without calibration. To better understand the reasons, we provide a new theoretical analysis that specifies the conditions under which this can be expected., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Published: 2020
Full Text: View/download PDF

16. Robot Navigation in Crowds by Graph Convolutional Networks with Attention Learned from Human Gaze

Author: Chen, Yuying, Liu, Congcong, Liu, Ming, and Shi, Bertram E.
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Safe and efficient crowd navigation for mobile robot is a crucial yet challenging task. Previous work has shown the power of deep reinforcement learning frameworks to train efficient policies. However, their performance deteriorates when the crowd size grows. We suggest that this can be addressed by enabling the network to identify and pay attention to the humans in the crowd that are most critical to navigation. We propose a novel network utilizing a graph representation to learn the policy. We first train a graph convolutional network based on human gaze data that accurately predicts human attention to different agents in the crowd. Then we incorporate the learned attention into a graph-based reinforcement learning architecture. The proposed attention mechanism enables the assignment of meaningful weightings to the neighbors of the robot, and has the additional benefit of interpretability. Experiments on real-world dense pedestrian datasets with various crowd sizes demonstrate that our model outperforms state-of-art methods by 18.4% in task accomplishment and by 16.4% in time efficiency., Comment: 8 pages, 7 figures
Published: 2019

17. Offset Calibration for Appearance-Based Gaze Estimation via Gaze Decomposition

Author: Chen, Zhaokang and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-independent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subject-independent estimator. We decompose the gaze angle into a subject-dependent bias term and a subject-independent term between the gaze angle and the bias. The subject-independent term is estimated by a deep convolutional network. For calibration-free tracking, we set the subject-dependent bias term to zero. For single gaze point calibration, we estimate the bias from a few images taken as the subject gazes at a point. Experiments on three datasets indicate that as a calibration-free estimator, the proposed method outperforms the state-of-the-art methods by up to $10.0\%$. The proposed calibration method is robust and reduces estimation error significantly (up to $35.6\%$), achieving state-of-the-art performance for appearance-based eye trackers with calibration., Comment: Accepted by WACV2020. This is not the camera-ready version
Published: 2019

18. Gaze Training by Modulated Dropout Improves Imitation Learning

Author: Chen, Yuying, Liu, Congcong, Tai, Lei, Liu, Ming, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: Imitation learning by behavioral cloning is a prevalent method that has achieved some success in vision-based autonomous driving. The basic idea behind behavioral cloning is to have the neural network learn from observing a human expert's behavior. Typically, a convolutional neural network learns to predict the steering commands from raw driver-view images by mimicking the behaviors of human drivers. However, there are other cues, such as gaze behavior, available from human drivers that have yet to be exploited. Previous researches have shown that novice human learners can benefit from observing experts' gaze patterns. We present here that deep neural networks can also profit from this. We propose a method, gaze-modulated dropout, for integrating this gaze information into a deep driving network implicitly rather than as an additional input. Our experimental results demonstrate that gaze-modulated dropout enhances the generalization capability of the network to unseen scenes. Prediction error in steering commands is reduced by 23.5% compared to uniform dropout. Running closed loop in the simulator, the gaze-modulated dropout net increased the average distance travelled between infractions by 58.5%. Consistent with these results, the gaze-modulated dropout net shows lower model uncertainty., Comment: 6 pages, 4 figures
Published: 2019

19. Appearance-Based Gaze Estimation Using Dilated-Convolutions

Author: Chen, Zhaokang and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8%) gains when compared to similar networks without dilated-convolutions. Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets., Comment: 16 pages, 7 figures. To appear in ACCV2018
Published: 2019

20. Coupled Recurrent Network (CRN)

Author: Sun, Lin, Jia, Kui, Shen, Yuejia, Savarese, Silvio, Yeung, Dit Yan, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel streams of Recurrent Neural Networks (RNNs). However, two-stream RNNs do not fully exploit the reciprocal information contained in the multiple signals, let alone exploit it in a recurrent manner. To this end, we propose in this paper a novel recurrent architecture, termed Coupled Recurrent Network (CRN), to deal with multiple input sources. In CRN, the parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent Interpretation Block (RIB) that supports learning of reciprocal feature representations from multiple signals in a recurrent manner. Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN. Experiments show the efficacy of the proposed CRN. In particular, we achieve the new state of the art on the benchmark datasets of human action recognition and multi-person pose estimation.
Published: 2018

21. Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Author: Deng, Didan, Zhou, Yuqian, Pi, Jimin, and Shi, Bertram E.
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms the unimodal baseline, achieving the concordance correlation coefficients (CCC) of 0.400 on the arousal task, and 0.353 on the valence task., Comment: 5 pages, 1 figure, subject to the 2018 IJCNN challenge on One-Minute Gradual-Emotion Recognition
Published: 2018

22. Lattice Long Short-Term Memory for Human Action Recognition

Author: Sun, Lin, Jia, Kui, Chen, Kevin, Yeung, Dit Yan, Shi, Bertram E., and Savarese, Silvio
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities., Comment: ICCV2017
Published: 2017

23. Using Variable Dwell Time to Accelerate Gaze-Based Web Browsing with Two-Step Selection

Author: Chen, Zhaokang and Shi, Bertram E.
Subjects: Computer Science - Human-Computer Interaction
Abstract: In order to avoid the "Midas Touch" problem, gaze-based interfaces for selection often introduce a dwell time: a fixed amount of time the user must fixate upon an object before it is selected. Past interfaces have used a uniform dwell time across all objects. Here, we propose a gaze-based browser using a two-step selection policy with variable dwell time. In the first step, a command, e.g. "back" or "select", is chosen from a menu using a dwell time that is constant across the different commands. In the second step, if the "select" command is chosen, the user selects a hyperlink using a dwell time that varies between different hyperlinks. We assign shorter dwell times to more likely hyperlinks and longer dwell times to less likely hyperlinks. In order to infer the likelihood each hyperlink will be selected, we have developed a probabilistic model of natural gaze behavior while surfing the web. We have evaluated a number of heuristic and probabilistic methods for varying the dwell times using both simulation and experiment. Our results demonstrate that varying dwell time improves the user experience in comparison with fixed dwell time, resulting in fewer errors and increased speed. While all of the methods for varying dwell time resulted in improved performance, the probabilistic models yielded much greater gains than the simple heuristics. The best performing model reduces error rate by 50% compared to 100ms uniform dwell time while maintaining a similar response time. It reduces response time by 60% compared to 300ms uniform dwell time while maintaining a similar error rate., Comment: This is an Accepted Manuscript of an article published by Taylor & Francis in the International Journal of Human-Computer Interaction on 30 March, 2018, available online: http://www.tandfonline.com/10.1080/10447318.2018.1452351 . For an eprint of the final published article, please access: https://www.tandfonline.com/eprint/T9d4cNwwRUqXPPiZYm8Z/full . Correct Figure 14
Published: 2017
Full Text: View/download PDF

24. Active efficient coding explains the development of binocular vision and its failure in amblyopia

Author: Eckmann, Samuel, Klimmasch, Lukas, Shi, Bertram E., and Triesch, Jochen
Published: 2020

25. Developing and Assessing MATLAB Exercises for Active Concept Learning

Author: Song, S. H., Antonelli, Marco, Fung, Tony, Armstrong, Brandon D., Chong, Amy, Lo, Albert, and Shi, Bertram E.
Subjects: Computer Science - Computers and Society
Abstract: New technologies, such as MOOCs, provide innovative methods to tackle new challenges in teaching and learning, such as globalization and changing contemporary culture and to remove the limits of conventional classrooms. However, they also bring challenges in course delivery and assessment, due to factors such as less direct student-instructor interaction. These challenges are especially severe in engineering education, which relies heavily on experiential learning, such as computer simulations and laboratory exercises, to assist students in understanding concepts. As a result, effective design of experiential learning components is extremely critical for engineering MOOCs. In this paper, we will share our experience gained through developing and offering a MOOC on communication systems, with special focus on the development and assessment of MATLAB exercises for active concept learning. Our approach introduced students to concepts using learning components commonly provided by many MOOC platforms (e.g., online lectures and quizzes), and augmented the student experience with MATLAB based computer simulations and exercises to enable more concrete and detailed understanding of the material. We describe here a systematic approach to MATLAB problem design and assessment, based on our experience with the MATLAB server provided by MathWorks and integrated with the edX MOOC platform. We discuss the effectiveness of the instructional methods as evaluated through students' learning performance. We analyze the impact of the course design tools from both the instructor and the student perspective., Comment: Submitted to IEEE Transactions on Education
Published: 2016

26. An active efficient coding model of the optokinetic nystagmus

Author: Zhang, Chong, Triesch, Jochen, and Shi, Bertram E.
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Neural and Evolutionary Computing
Abstract: Optokinetic nystagmus (OKN) is an involuntary eye movement responsible for stabilizing retinal images in the presence of relative motion between an observer and the environment. Fully understanding the development of optokinetic nystagmus requires a neurally plausible computational model that accounts for the neural development and the behavior. To date, work in this area has been limited. We propose a neurally plausible framework for the joint development of disparity and motion tuning in the visual cortex, the optokinetic and vergence eye movements. This framework models the joint emergence of both perception and behavior, and accounts for the importance of the development of normal vergence control and binocular vision in achieving normal monocular OKN (mOKN) behaviors. Because the model includes behavior, we can simulate the same perturbations as performed in past experiments, such as artificially induced strabismus. The proposed model agrees both qualitatively and quantitatively with a number of findings from the literature on both binocular vision as well as the optokinetic reflex. Finally, our model also makes quantitative predictions about the OKN behavior using the same methods used to characterize the OKN in the experimental literature.
Published: 2016

27. Invariant feature extraction from event based stimuli

Author: Chandrapala, Thusitha N. and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a novel architecture, the event-based GASSOM for learning and extracting invariant representations from event streams originating from neuromorphic vision sensors. The framework is inspired by feed-forward cortical models for visual processing. The model, which is based on the concepts of sparsity and temporal slowness, is able to learn feature extractors that resemble neurons in the primary visual cortex. Layers of units in the proposed model can be cascaded to learn feature extractors with different levels of complexity and selectivity. We explore the applicability of the framework on real world tasks by using the learned network for object recognition. The proposed model achieve higher classification accuracy compared to other state-of-the-art event based processing methods. Our results also demonstrate the generality and robustness of the method, as the recognizers for different data sets and different tasks all used the same set of learned feature detectors, which were trained on data collected independently of the testing data., Comment: 6 pages
Published: 2016

28. Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Author: Sun, Lin, Jia, Kui, Yeung, Dit-Yan, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.
Published: 2015

29. Developing and Assessing MATLAB Exercises for Active Concept Learning

Author: Song, S. H., Antonelli, Marco, Fung, Tony W. K., Armstrong, Brandon D., Chong, Amy, Lo, Albert, and Shi, Bertram E.
Abstract: Contribution: A systematic approach to MATLAB problem design and automated assessment is described, based on the experience working with the MATLAB server provided by MathWorks and integrated with the edX massive online open class (MOOC) platform. Background: New technologies, such as MOOCs, provide innovative methods to tackle new challenges in teaching and learning. However, they also bring challenges in course delivery and assessment, due to factors such as less direct student-instructor interaction. These challenges are especially severe in engineering education, which relies heavily on experiential learning, such as laboratory exercises and computer simulations, to assist students in understanding concepts. As a result, effective design of experiential learning components is extremely critical for engineering MOOCs. Intended Outcomes: This paper shares the experience gained through developing and offering an MOOC on communication systems, with special focus on the development and the automated assessment of MATLAB exercises for active concept learning. Application Design: The proposed approach introduced students to concepts by using learning components commonly provided by many MOOC platforms (e.g., online lectures and quizzes), and augmented the student experience with MATLAB-based computer simulations and exercises to enable more concrete and detailed understanding of the material. Findings: The effectiveness of the instructional methods was supported by evaluation of students' learning performance.
Published: 2019
Full Text: View/download PDF

30. Intrinsically Motivated Learning of Visual Motion Perception and Smooth Pursuit

Author: Zhang, Chong, Zhao, Yu, Triesch, Jochen, and Shi, Bertram E.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Quantitative Biology - Neurons and Cognition
Abstract: We extend the framework of efficient coding, which has been used to model the development of sensory processing in isolation, to model the development of the perception/action cycle. Our extension combines sparse coding and reinforcement learning so that sensory processing and behavior co-develop to optimize a shared intrinsic motivational signal: the fidelity of the neural encoding of the sensory input under resource constraints. Applying this framework to a model system consisting of an active eye behaving in a time varying environment, we find that this generic principle leads to the simultaneous development of both smooth pursuit behavior and model neurons whose properties are similar to those of primary visual cortical neurons selective for different directions of visual motion. We suggest that this general principle may form the basis for a unified and integrated explanation of many perception/action loops., Comment: 6 pages, 5 figures
Published: 2014

31. Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Author: Zhao, Zhetuo, Triesch, Jochen, Shi, Bertram E., Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Manoonpong, Poramate, editor, Larsen, Jørgen Christian, editor, Xiong, Xiaofeng, editor, Hallam, John, editor, and Triesch, Jochen, editor
Published: 2018
Full Text: View/download PDF

32. An Active Efficient Coding Model of Binocular Vision Development Under Normal and Abnormal Rearing Conditions

Author: Klimmasch, Lukas, Schneider, Johann, Lelais, Alexander, Shi, Bertram E., Triesch, Jochen, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Manoonpong, Poramate, editor, Larsen, Jørgen Christian, editor, Xiong, Xiaofeng, editor, Hallam, John, editor, and Triesch, Jochen, editor
Published: 2018
Full Text: View/download PDF

33. Eye-Hand Coordination Develops from Active Multimodal Compression

Author: López, Francisco M., primary, Shi, Bertram E., additional, and Triesch, Jochen, additional
Published: 2023
Full Text: View/download PDF

34. Saccade Amplitude Statistics are Explained by Cortical Magnification

Author: Raabe, Marcel C., primary, López, Francisco M., additional, Yu, Zhengyang, additional, Caplan, Spencer, additional, Yu, Chen, additional, Shi, Bertram E., additional, and Triesch, Jochen, additional
Published: 2023
Full Text: View/download PDF

35. Appearance-Based Gaze Estimation Using Dilated-Convolutions

Author: Chen, Zhaokang, primary and Shi, Bertram E., additional
Published: 2019
Full Text: View/download PDF

36. Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Author: Zhao, Zhetuo, primary, Triesch, Jochen, additional, and Shi, Bertram E., additional
Published: 2018
Full Text: View/download PDF

37. An Active Efficient Coding Model of Binocular Vision Development Under Normal and Abnormal Rearing Conditions

Author: Klimmasch, Lukas, primary, Schneider, Johann, additional, Lelais, Alexander, additional, Shi, Bertram E., additional, and Triesch, Jochen, additional
Published: 2018
Full Text: View/download PDF

38. Towards High Performance Low Complexity Calibration in Appearance Based Gaze Estimation

Author: Chen, Zhaokang, primary and Shi, Bertram E., additional
Published: 2023
Full Text: View/download PDF

39. The Nonlinear Memristive Grid

Author: Jiang, Feijun, Shi, Bertram E., and Zarándy, Ákos, editor
Published: 2011
Full Text: View/download PDF

40. HGCN-GJS: Hierarchical Graph Convolutional Network with Groupwise Joint Sampling for Trajectory Prediction

Author: Chen, Yuying, primary, Liu, Congcong, additional, Mei, Xiaodong, additional, Shi, Bertram E., additional, and Liu, Ming, additional
Published: 2022
Full Text: View/download PDF

41. A Two Stage Energy Model Exhibiting Selectivity to Changing Disparity

Author: Guo, Xiaojiang, Shi, Bertram E., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Sun, Fuchun, editor, Zhang, Jianwei, editor, Tan, Ying, editor, Cao, Jinde, editor, and Yu, Wen, editor
Published: 2008
Full Text: View/download PDF

42. Learning torsional eye movements through active efficient coding

Author: Zhu, Qingpeng, primary, Zhang, Chong, additional, Triesch, Jochen, additional, and Shi, Bertram E, additional
Published: 2022
Full Text: View/download PDF

43. Estimating Multiple Emotion Descriptors by Separating Description and Inference

Author: Deng, Didan, primary and Shi, Bertram E., additional
Published: 2022
Full Text: View/download PDF

44. Combining texture and stereo disparity cues for real-time face detection

Author: Jiang, Feijun, Fischer, Mika, Ekenel, Hazım Kemal, and Shi, Bertram E.
Published: 2013
Full Text: View/download PDF

45. Gaze-controlled Robot-assisted Painting in Virtual Reality for Upper-limb Rehabilitation

Author: Zhang, Yawen, primary, Wang, Haofei, additional, and Shi, Bertram E., additional
Published: 2021
Full Text: View/download PDF

46. A Multimodal Direct Gaze Interface for Wheelchairs and Teleoperated Robots

Author: Poy, Isamu, primary, Wu, Liang, additional, and Shi, Bertram E., additional
Published: 2021
Full Text: View/download PDF

47. Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition

Author: Deng, Didan, primary, Wu, Liang, additional, and Shi, Bertram E., additional
Published: 2021
Full Text: View/download PDF

48. Integrating contrast invariance into a model for cortical orientation map formation

Author: Zhao, Laura Y. and Shi, Bertram E.
Published: 2009
Full Text: View/download PDF

49. Neural population models for perception of motion in depth

Author: Peng, Qiuyan and Shi, Bertram E.
Published: 2014
Full Text: View/download PDF

50. AVGCN: Trajectory Prediction using Graph Convolutional Networks Guided by Human Attention

Author: Liu, Congcong, primary, Chen, Yuying, additional, Liu, Ming, additional, and Shi, Bertram E., additional
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

702 results on '"Shi, Bertram E."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources