Author: "Shi, Bertram E." / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shi, Bertram E."' showing total 404 results

Start Over Author "Shi, Bertram E." Publication Type Electronic Resources

404 results on '"Shi, Bertram E."'

1. Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Author: Frieske, Rita, Shi, Bertram E., Frieske, Rita, and Shi, Bertram E.
Abstract: Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of hallucinations to probable natural language outputs of the model creates a danger of deception and impacts the credibility of the system. We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. We demonstrate that this method helps to distinguish between hallucinatory and non-hallucinatory models that have similar baseline word error rates. We further explore the relationship between the types of ASR errors and the types of dataset noise to determine what types of noise are most likely to create hallucinatory outputs. We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency. Finally, we discover how to induce hallucinations with a random noise injection to the utterance.
Published: 2024

2. Reducing Stress and Anxiety in the Metaverse: A Systematic Review of Meditation, Mindfulness and Virtual Reality

Author: Wang, Xian, Mo, Xiaoyu, Fan, Mingming, Lee, Lik-Hang, Shi, Bertram E., Hui, Pan, Wang, Xian, Mo, Xiaoyu, Fan, Mingming, Lee, Lik-Hang, Shi, Bertram E., and Hui, Pan
Abstract: Meditation, or mindfulness, is widely used to improve mental health. With the emergence of Virtual Reality technology, many studies have provided evidence that meditation with VR can bring health benefits. However, to our knowledge, there are no guidelines and comprehensive reviews in the literature on how to conduct such research in virtual reality. In order to understand the role of VR technology in meditation and future research opportunities, we conducted a systematic literature review in the IEEE and ACM databases. Our process yielded 19 eligible papers and we conducted a structured analysis. We understand the state-of-art of meditation type, design consideration and VR and technology through these papers and conclude research opportunities and challenges for the future.
Published: 2022

3. CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Author: Dai, Wenliang, Cahyawijaya, Samuel, Yu, Tiezheng, Barezi, Elham J., Xu, Peng, Yiu, Cheuk Tung Shadow, Frieske, Rita, Lovenia, Holy, Winata, Genta Indra, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., Fung, Pascale, Dai, Wenliang, Cahyawijaya, Samuel, Yu, Tiezheng, Barezi, Elham J., Xu, Peng, Yiu, Cheuk Tung Shadow, Frieske, Rita, Lovenia, Holy, Winata, Genta Indra, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., and Fung, Pascale
Abstract: With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR., Comment: 6 pages
Published: 2022

4. Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Author: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung Shadow, Lovenia, Holy, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., Fung, Pascale, Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung Shadow, Lovenia, Holy, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., and Fung, Pascale
Abstract: Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
Published: 2022

5. ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Author: Lovenia, Holy, Cahyawijaya, Samuel, Winata, Genta Indra, Xu, Peng, Yan, Xu, Liu, Zihan, Frieske, Rita, Yu, Tiezheng, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., Fung, Pascale, Lovenia, Holy, Cahyawijaya, Samuel, Winata, Genta Indra, Xu, Peng, Yan, Xu, Liu, Zihan, Frieske, Rita, Yu, Tiezheng, Dai, Wenliang, Barezi, Elham J., Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram E., and Fung, Pascale
Abstract: Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND's design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69\% character error rate and 27.05% mixed error rate.
Published: 2021

6. Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition

Author: Deng, Didan, Wu, Liang, Shi, Bertram E., Deng, Didan, Wu, Liang, and Shi, Bertram E.
Abstract: When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors. We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation. Our method generates single student models that provide accurate estimates of uncertainty for in-domain samples and a student ensemble that can detect out-of-domain samples. Our experiments on emotion recognition and uncertainty estimation using the Aff-wild2 dataset demonstrate that our algorithm gives more reliable uncertainty estimates than both Temperature Scaling and Monte Carol Dropout., Comment: Accepted as a Workshop paper in ICCV2021 proceeding
Published: 2021

7. Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Author: Zhao, Zhetuo, Triesch, Jochen, Shi, Bertram E., Zhao, Zhetuo, Triesch, Jochen, and Shi, Bertram E.
Abstract: The active efficient coding (AEC) framework parsimoniously explains the joint development of visual processing and eye movements, e.g., the emergence of binocular disparity selective neurons and fusional vergence, the disjunctive eye movements that align left and right eye images. Vergence can be driven by information in both the fovea and periphery, which play complementary roles. The high resolution fovea can drive precise short range movements. The lower resolution periphery supports coarser long range movements. The fovea and periphery may also contain conflicting information, e.g. due to objects at different depths. While past AEC models did integrate peripheral and foveal information, they did not explicitly take into account these characteristics. We propose here a two-level hierarchical approach that does. The bottom level generates different vergence actions from foveal and peripheral regions. The top level selects one. We demonstrate that the hierarchical approach performs better than prior approaches in realistic environments, exhibiting better alignment and less oscillation.
Published: 2021
Full Text: View/download PDF

8. Self-Calibrating Active Binocular Vision via Active Efficient Coding with Deep Autoencoders

Author: Wilmot, Charles, Shi, Bertram E., Triesch, Jochen, Wilmot, Charles, Shi, Bertram E., and Triesch, Jochen
Abstract: We present a model of the self-calibration of active binocular vision comprising the simultaneous learning of visual representations, vergence, and pursuit eye movements. The model follows the principle of Active Efficient Coding (AEC), a recent extension of the classic Efficient Coding Hypothesis to active perception. In contrast to previous AEC models, the present model uses deep autoencoders to learn sensory representations. We also propose a new formulation of the intrinsic motivation signal that guides the learning of behavior. We demonstrate the performance of the model in simulations.
Published: 2021

9. AVGCN: Trajectory Prediction using Graph Convolutional Networks Guided by Human Attention

Author: Liu, Congcong, Chen, Yuying, Liu, Ming, Shi, Bertram E., Liu, Congcong, Chen, Yuying, Liu, Ming, and Shi, Bertram E.
Abstract: Pedestrian trajectory prediction is a critical yet challenging task, especially for crowded scenes. We suggest that introducing an attention mechanism to infer the importance of different neighbors is critical for accurate trajectory prediction in scenes with varying crowd size. In this work, we propose a novel method, AVGCN, for trajectory prediction utilizing graph convolutional networks (GCN) based on human attention (A denotes attention, V denotes visual field constraints). First, we train an attention network that estimates the importance of neighboring pedestrians, using gaze data collected as subjects perform a bird's eye view crowd navigation task. Then, we incorporate the learned attention weights modulated by constraints on the pedestrian's visual field into a trajectory prediction network that uses a GCN to aggregate information from neighbors efficiently. AVGCN also considers the stochastic nature of pedestrian trajectories by taking advantage of variational trajectory prediction. Our approach achieves state-of-the-art performance on several trajectory prediction benchmarks, and the lowest average prediction error over all considered benchmarks., Comment: 7 pages, 4 figures
Published: 2021

10. The development of active binocular vision under normal and alternate rearing conditions

Author: Klimmasch, Lukas, Schneider, Johann, Lelais, Alexander, Shi, Bertram E., Triesch, Jochen, Klimmasch, Lukas, Schneider, Johann, Lelais, Alexander, Shi, Bertram E., and Triesch, Jochen
Abstract: The development of binocular vision is an active learning process comprising the development of disparity tuned neurons in visual cortex and the establishment of precise vergence control of the eyes. We present a computational model for the learning and self-calibration of active binocular vision based on the Active Efficient Coding framework, an extension of classic efficient coding ideas to active perception. Under normal rearing conditions, the model develops disparity tuned neurons and precise vergence control, allowing it to correctly interpret random dot stereogramms. Under altered rearing conditions modeled after neurophysiological experiments, the model qualitatively reproduces key experimental findings on changes in binocularity and disparity tuning. Furthermore, the model makes testable predictions regarding how altered rearing conditions impede the learning of precise vergence control. Finally, the model predicts a surprising new effect that impaired vergence control affects the statistics of orientation tuning in visual cortical neurons.
Published: 2020

11. Active efficient coding explains the development of binocular vision and its failure in amblyopia

Author: Eckmann, Samuel, Klimmasch, Lukas, Shi, Bertram E., Triesch, Jochen, Eckmann, Samuel, Klimmasch, Lukas, Shi, Bertram E., and Triesch, Jochen
Abstract: The development of vision during the first months of life is an active process that comprises the learning of appropriate neural representations and the learning of accurate eye movements. While it has long been suspected that the two learning processes are coupled, there is still no widely accepted theoretical framework describing this joint development. Here we propose a computational model of the development of active binocular vision to fill this gap. The model is based on a new formulation of the Active Efficient Coding theory, which proposes that eye movements, as well as stimulus encoding, are jointly adapted to maximize the overall coding efficiency. Under healthy conditions, the model self-calibrates to perform accurate vergence and accommodation eye movements. It exploits disparity cues to deduce the direction of defocus, which leads to co-ordinated vergence and accommodation responses. In a simulated anisometropic case, where the refraction power of the two eyes differs, an amblyopia-like state develops, in which the foveal region of one eye is suppressed due to inputs from the other eye. After correcting for refractive errors, the model can only reach healthy performance levels if receptive fields are still plastic, in line with findings on a critical period for binocular vision development. Overall, our model offers a unifying conceptual framework for understanding the development of binocular vision.
Published: 2020

12. Active efficient coding explains the development of binocular vision and its failure in amblyopia

Author: Eckmann, Samuel, Klimmasch, Lukas, Shi, Bertram E., Triesch, Jochen, Eckmann, Samuel, Klimmasch, Lukas, Shi, Bertram E., and Triesch, Jochen
Abstract: The development of vision during the first months of life is an active process that comprises the learning of appropriate neural representations and the learning of accurate eye movements. While it has long been suspected that the two learning processes are coupled, there is still no widely accepted theoretical framework describing this joint development. Here, we propose a computational model of the development of active binocular vision to fill this gap. The model is based on a formulation of the active efficient coding theory, which proposes that eye movements as well as stimulus encoding are jointly adapted to maximize the overall coding efficiency. Under healthy conditions, the model self-calibrates to perform accurate vergence and accommodation eye movements. It exploits disparity cues to deduce the direction of defocus, which leads to coordinated vergence and accommodation responses. In a simulated anisometropic case, where the refraction power of the two eyes differs, an amblyopia-like state develops in which the foveal region of one eye is suppressed due to inputs from the other eye. After correcting for refractive errors, the model can only reach healthy performance levels if receptive fields are still plastic, in line with findings on a critical period for binocular vision development. Overall, our model offers a unifying conceptual framework for understanding the development of binocular vision.
Published: 2020

13. Multitask Emotion Recognition with Incomplete Labels

Author: Deng, Didan, Chen, Zhaokang, Shi, Bertram E., Deng, Didan, Chen, Zhaokang, and Shi, Bertram E.
Abstract: We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks., Comment: Accepted by FG2020
Published: 2020

14. Towards High Performance Low Complexity Calibration in Appearance Based Gaze Estimation

Author: Chen, Zhaokang, Shi, Bertram E., Chen, Zhaokang, and Shi, Bertram E.
Abstract: Appearance-based gaze estimation from RGB images provides relatively unconstrained gaze tracking. We have previously proposed a gaze decomposition method that decomposes the gaze angle into the sum of a subject-independent gaze estimate from the image and a subject-dependent bias. This paper extends that work with a more complete characterization of the interplay between the complexity of the calibration dataset and estimation accuracy. We analyze the effect of the number of gaze targets, the number of images used per gaze target and the number of head positions in calibration data using a new NISLGaze dataset, which is well suited for analyzing these effects as it includes more diversity in head positions and orientations for each subject than other datasets. A better understanding of these factors enables low complexity high performance calibration. Our results indicate that using only a single gaze target and single head position is sufficient to achieve high quality calibration, outperforming state-of-the-art methods by more than 6.3%. One of the surprising findings is that the same estimator yields the best performance both with and without calibration. To better understand the reasons, we provide a new theoretical analysis that specifies the conditions under which this can be expected., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
Published: 2020
Full Text: View/download PDF

15. HGCN-GJS: Hierarchical Graph Convolutional Network with Groupwise Joint Sampling for Trajectory Prediction

Author: Chen, Yuying, Liu, Congcong, Mei, Xiaodong, Shi, Bertram E., Liu, Ming, Chen, Yuying, Liu, Congcong, Mei, Xiaodong, Shi, Bertram E., and Liu, Ming
Abstract: Accurate pedestrian trajectory prediction is of great importance for downstream tasks such as autonomous driving and mobile robot navigation. Fully investigating the social interactions within the crowd is crucial for accurate pedestrian trajectory prediction. However, most existing methods do not capture group level interactions well, focusing only on pairwise interactions and neglecting group-wise interactions. In this work, we propose a hierarchical graph convolutional network, HGCN-GJS, for trajectory prediction which well leverages group level interactions within the crowd. Furthermore, we introduce a novel joint sampling scheme for modeling the joint distribution of multiple pedestrians in the future trajectories. Based on the group information, this scheme associates the trajectory of one person with the trajectory of other people in the group, but maintains the independence of the trajectories of outsiders. We demonstrate the performance of our network on several trajectory prediction datasets, achieving state-of-the-art results on all datasets considered., Comment: 6 pages, 8 figures, accepted by IROS 2022
Published: 2020

16. Autonomous development of active binocular and motion vision through active efficient coding

Author: Röhrbein, Florian, Lelais, Alexander, Mahn, Jonas, Narayan, Vikram, Zhang, Chong, Shi, Bertram E., Triesch, Jochen, Röhrbein, Florian, Lelais, Alexander, Mahn, Jonas, Narayan, Vikram, Zhang, Chong, Shi, Bertram E., and Triesch, Jochen
Abstract: We present a model for the autonomous and simultaneous learning of active binocular and motion vision. The model is based on the Active Efficient Coding (AEC) framework, a recent generalization of classic efficient coding theories to active perception. The model learns how to efficiently encode the incoming visual signals generated by an object moving in 3-D through sparse coding. Simultaneously, it learns how to produce eye movements that further improve the efficiency of the sensory coding. This learning is driven by an intrinsic motivation to maximize the system's coding efficiency. We test our approach on the humanoid robot iCub using simulations. The model demonstrates self-calibration of accurate object fixation and tracking of moving objects. Our results show that the model keeps improving until it hits physical constraints such as camera or motor resolution, or limits on its internal coding capacity. Furthermore, we show that the emerging sensory tuning properties are in line with results on disparity, motion, and motion-in-depth tuning in the visual cortex of mammals. The model suggests that vergence and tracking eye movements can be viewed as fundamentally having the same objective of maximizing the coding efficiency of the visual system and that they can be learned and calibrated jointly through AEC.
Published: 2019

17. Robot Navigation in Crowds by Graph Convolutional Networks with Attention Learned from Human Gaze

Author: Chen, Yuying, Liu, Congcong, Liu, Ming, Shi, Bertram E., Chen, Yuying, Liu, Congcong, Liu, Ming, and Shi, Bertram E.
Abstract: Safe and efficient crowd navigation for mobile robot is a crucial yet challenging task. Previous work has shown the power of deep reinforcement learning frameworks to train efficient policies. However, their performance deteriorates when the crowd size grows. We suggest that this can be addressed by enabling the network to identify and pay attention to the humans in the crowd that are most critical to navigation. We propose a novel network utilizing a graph representation to learn the policy. We first train a graph convolutional network based on human gaze data that accurately predicts human attention to different agents in the crowd. Then we incorporate the learned attention into a graph-based reinforcement learning architecture. The proposed attention mechanism enables the assignment of meaningful weightings to the neighbors of the robot, and has the additional benefit of interpretability. Experiments on real-world dense pedestrian datasets with various crowd sizes demonstrate that our model outperforms state-of-art methods by 18.4% in task accomplishment and by 16.4% in time efficiency., Comment: 8 pages, 7 figures
Published: 2019

18. Offset Calibration for Appearance-Based Gaze Estimation via Gaze Decomposition

Author: Chen, Zhaokang, Shi, Bertram E., Chen, Zhaokang, and Shi, Bertram E.
Abstract: Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-independent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subject-independent estimator. We decompose the gaze angle into a subject-dependent bias term and a subject-independent term between the gaze angle and the bias. The subject-independent term is estimated by a deep convolutional network. For calibration-free tracking, we set the subject-dependent bias term to zero. For single gaze point calibration, we estimate the bias from a few images taken as the subject gazes at a point. Experiments on three datasets indicate that as a calibration-free estimator, the proposed method outperforms the state-of-the-art methods by up to $10.0\%$. The proposed calibration method is robust and reduces estimation error significantly (up to $35.6\%$), achieving state-of-the-art performance for appearance-based eye trackers with calibration., Comment: Accepted by WACV2020. This is not the camera-ready version
Published: 2019

19. Gaze Training by Modulated Dropout Improves Imitation Learning

Author: Chen, Yuying, Liu, Congcong, Tai, Lei, Liu, Ming, Shi, Bertram E., Chen, Yuying, Liu, Congcong, Tai, Lei, Liu, Ming, and Shi, Bertram E.
Abstract: Imitation learning by behavioral cloning is a prevalent method that has achieved some success in vision-based autonomous driving. The basic idea behind behavioral cloning is to have the neural network learn from observing a human expert's behavior. Typically, a convolutional neural network learns to predict the steering commands from raw driver-view images by mimicking the behaviors of human drivers. However, there are other cues, such as gaze behavior, available from human drivers that have yet to be exploited. Previous researches have shown that novice human learners can benefit from observing experts' gaze patterns. We present here that deep neural networks can also profit from this. We propose a method, gaze-modulated dropout, for integrating this gaze information into a deep driving network implicitly rather than as an additional input. Our experimental results demonstrate that gaze-modulated dropout enhances the generalization capability of the network to unseen scenes. Prediction error in steering commands is reduced by 23.5% compared to uniform dropout. Running closed loop in the simulator, the gaze-modulated dropout net increased the average distance travelled between infractions by 58.5%. Consistent with these results, the gaze-modulated dropout net shows lower model uncertainty., Comment: 6 pages, 4 figures
Published: 2019

20. Appearance-Based Gaze Estimation Using Dilated-Convolutions

Author: Chen, Zhaokang, Shi, Bertram E., Chen, Zhaokang, and Shi, Bertram E.
Abstract: Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8%) gains when compared to similar networks without dilated-convolutions. Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets., Comment: 16 pages, 7 figures. To appear in ACCV2018
Published: 2019

21. Coupled Recurrent Network (CRN)

Author: Sun, Lin, Jia, Kui, Shen, Yuejia, Savarese, Silvio, Yeung, Dit Yan, Shi, Bertram E., Sun, Lin, Jia, Kui, Shen, Yuejia, Savarese, Silvio, Yeung, Dit Yan, and Shi, Bertram E.
Abstract: Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel streams of Recurrent Neural Networks (RNNs). However, two-stream RNNs do not fully exploit the reciprocal information contained in the multiple signals, let alone exploit it in a recurrent manner. To this end, we propose in this paper a novel recurrent architecture, termed Coupled Recurrent Network (CRN), to deal with multiple input sources. In CRN, the parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent Interpretation Block (RIB) that supports learning of reciprocal feature representations from multiple signals in a recurrent manner. Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN. Experiments show the efficacy of the proposed CRN. In particular, we achieve the new state of the art on the benchmark datasets of human action recognition and multi-person pose estimation.
Published: 2018

22. Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Author: Deng, Didan, Zhou, Yuqian, Pi, Jimin, Shi, Bertram E., Deng, Didan, Zhou, Yuqian, Pi, Jimin, and Shi, Bertram E.
Abstract: The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms the unimodal baseline, achieving the concordance correlation coefficients (CCC) of 0.400 on the arousal task, and 0.353 on the valence task., Comment: 5 pages, 1 figure, subject to the 2018 IJCNN challenge on One-Minute Gradual-Emotion Recognition
Published: 2018

23. Feedback Networks

Author: Zamir, Amir R., Wu, Te-Lin, Sun, Lin, Shen, William B., Shi, Bertram E., Malik, Jitendra, Savarese, Silvio, Zamir, Amir R., Wu, Te-Lin, Sun, Lin, Shen, William B., Shi, Bertram E., Malik, Jitendra, and Savarese, Silvio
Abstract: Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. We establish that a feedback based approach has several core advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback develops a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We provide a general feedback based learning architecture, instantiated using existing RNNs, with the endpoint results on par or better than existing feedforward networks and the addition of the above advantages.
Published: 2017

24. Probabilistic Adjustment of Dwell Time for Eye Typing

Author: Pi, Jimin ECE, Shi, Bertram E., Pi, Jimin ECE, and Shi, Bertram E.
Abstract: Requiring a dwell time before selection is a common way to solve 'Midas-touch problem' in gaze-based interaction. Choosing the dwell time involves a tradeoff between unintentional selection for short dwell times and slow text entry for long dwell times. We propose a probabilistic model for gaze based selection, which adjusts the dwell time based on the probability of each letter based on the past letters selected. By reformulating the entire problem of gaze-based selection probabilistically, we can naturally integrate the probability of each character naturally and with very few prior assumptions and very few free parameters. It automatically assigns shorter dwell times to more likely characters and longer dwell times to less likely characters. Our experimental results demonstrate that the proposed technique speeds up typing without loss in accuracy. The concept of this can be generalized to other dwell-based applications, leading to more efficient gaze system interaction. © 2017 IEEE.
Published: 2017

25. Joint Learning of Binocularly Driven Saccades and Vergence by Active Efficient Coding

Author: Zhu, Qingpeng ECE, Triesch, Jochen, Shi, Bertram E., Zhu, Qingpeng ECE, Triesch, Jochen, and Shi, Bertram E.
Abstract: This paper investigates two types of eye movements: vergence and saccades. Vergence eye movements are responsible for bringing the images of the two eyes into correspondence, whereas saccades drive gaze to interesting regions in the scene. Control of both vergence and saccades develops during early infancy. To date, these two types of eye movements have been studied separately. Here, we propose a computational model of an active vision system that integrates these two types of eye movements. We hypothesize that incorporating a saccade strategy driven by bottom-up attention will benefit the development of vergence control. The integrated system is based on the active efficient coding framework, which describes the joint development of sensory-processing and eye movement control to jointly optimize the coding efficiency of the sensory system. In the integrated system, we propose a binocular saliency model to drive saccades based on learned binocular feature extractors, which simultaneously encode both depth and texture information. Saliency in our model also depends on the current fixation point. This extends prior work, which focused on monocular images and saliency measures that are independent of the current fixation. Our results show that the proposed saliency driven saccades lead to better vergence performance and faster learning in the overall system than random saccades. Faster learning is significant because it indicates that the system actively selects inputs for the most effective learning. This work suggests that saliency-driven saccades provide a scaffold for the development of vergence control during infancy.
Published: 2017

26. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition

Author: Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., Benosman, Ryad B., Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., and Benosman, Ryad B.
Abstract: This paper describes novel event-based spatio-temporal features called time-surfaces and how they can be used to create a hierarchical event-based pattern recognition architecture. Unlike existing hierarchical architectures for pattern recognition, the presented model relies on a time oriented approach to extract spatio-temporal features from the asynchronously acquired dynamics of a visual scene. These dynamics are acquired using biologically inspired frameless asynchronous event-driven vision sensors. Similarly to cortical structures, subsequent layers in our hierarchy extract increasingly abstract features using increasingly large spatio-temporal windows. The central concept is to use the rich temporal information provided by events to create contexts in the form of time-surfaces which represent the recent temporal activity within a local spatial neighborhood. We demonstrate that this concept can robustly be used at all stages of an event-based hierarchical model. First layer feature units operate on groups of pixels, while subsequent layer feature units operate on the output of lower level feature units. We report results on a previously published 36 class character recognition task and a four class canonical dynamic card pip task, achieving near 100 percent accuracy on each. We introduce a new seven class moving face recognition task, achieving 79 percent accuracy. © 2017 IEEE.
Published: 2017

27. Using Point Cloud Data to Improve Three Dimensional Gaze Estimation

Author: Wang, Haofei ECE, Antonelli, Marco, Shi, Bertram E., Wang, Haofei ECE, Antonelli, Marco, and Shi, Bertram E.
Abstract: This paper addresses the problem of estimating gaze location in the 3D environment using a remote eye tracker. Instead of relying only on data provided by the eye tracker, we investigate how to integrate gaze direction with the point-cloud-based representation of the scene provided by a Kinect sensor. The algorithm first combines the gaze vectors for the two eyes provided by the eye tracker into a single gaze vector emanating from a point in between the two eyes. The gaze target in the three dimensional environment is then identified by finding the point in the 3D point cloud that is closest to the gaze vector. Our experimental results demonstrate that the estimate of the gaze target location provided by this method is significantly better than that provided when considering gaze information alone. It is also better than two other methods for integrating point cloud information: (1) finding the 3D point closest to the gaze location as estimated by triangulating the gaze vectors from the two eyes, and (2) finding the 3D point with smallest average distance to the two gaze vectors considered individually. The proposed method has an average error of 1.7 cm in a workspace of 25 × 23 × 24 cm located at a distance of 60 cm from the user.
Published: 2017

28. Using Point Cloud Data to Improve Three Dimensional Gaze Estimation

Author: Wang, Haofei ECE, Antonelli, Marco, Shi, Bertram E., Wang, Haofei ECE, Antonelli, Marco, and Shi, Bertram E.
Abstract: This paper addresses the problem of estimating gaze location in the 3D environment using a remote eye tracker. Instead of relying only on data provided by the eye tracker, we investigate how to integrate gaze direction with the point-cloud-based representation of the scene provided by a Kinect sensor. The algorithm first combines the gaze vectors for the two eyes provided by the eye tracker into a single gaze vector emanating from a point in between the two eyes. The gaze target in the three dimensional environment is then identified by finding the point in the 3D point cloud that is closest to the gaze vector. Our experimental results demonstrate that the estimate of the gaze target location provided by this method is significantly better than that provided when considering gaze information alone. It is also better than two other methods for integrating point cloud information: (1) finding the 3D point closest to the gaze location as estimated by triangulating the gaze vectors from the two eyes, and (2) finding the 3D point with smallest average distance to the two gaze vectors considered individually. The proposed method has an average error of 1.7 cm in a workspace of 25 × 23 × 24 cm located at a distance of 60 cm from the user.
Published: 2017

29. Feedback Networks

Author: Zamir, Amir R., Wu, Te-Lin, Sun, Lin, Shen, William B., Shi, Bertram E., Malik, Jitendra, Savarese, Silvio, Zamir, Amir R., Wu, Te-Lin, Sun, Lin, Shen, William B., Shi, Bertram E., Malik, Jitendra, and Savarese, Silvio
Abstract: Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. We establish that a feedback based approach has several core advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback develops a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We provide a general feedback based learning architecture, instantiated using existing RNNs, with the endpoint results on par or better than existing feedforward networks and the addition of the above advantages.
Published: 2017

30. Probabilistic Adjustment of Dwell Time for Eye Typing

Author: Pi, Jimin ECE, Shi, Bertram E., Pi, Jimin ECE, and Shi, Bertram E.
Abstract: Requiring a dwell time before selection is a common way to solve 'Midas-touch problem' in gaze-based interaction. Choosing the dwell time involves a tradeoff between unintentional selection for short dwell times and slow text entry for long dwell times. We propose a probabilistic model for gaze based selection, which adjusts the dwell time based on the probability of each letter based on the past letters selected. By reformulating the entire problem of gaze-based selection probabilistically, we can naturally integrate the probability of each character naturally and with very few prior assumptions and very few free parameters. It automatically assigns shorter dwell times to more likely characters and longer dwell times to less likely characters. Our experimental results demonstrate that the proposed technique speeds up typing without loss in accuracy. The concept of this can be generalized to other dwell-based applications, leading to more efficient gaze system interaction. © 2017 IEEE.
Published: 2017

31. Joint Learning of Binocularly Driven Saccades and Vergence by Active Efficient Coding

Author: Zhu, Qingpeng ECE, Triesch, Jochen, Shi, Bertram E., Zhu, Qingpeng ECE, Triesch, Jochen, and Shi, Bertram E.
Abstract: This paper investigates two types of eye movements: vergence and saccades. Vergence eye movements are responsible for bringing the images of the two eyes into correspondence, whereas saccades drive gaze to interesting regions in the scene. Control of both vergence and saccades develops during early infancy. To date, these two types of eye movements have been studied separately. Here, we propose a computational model of an active vision system that integrates these two types of eye movements. We hypothesize that incorporating a saccade strategy driven by bottom-up attention will benefit the development of vergence control. The integrated system is based on the active efficient coding framework, which describes the joint development of sensory-processing and eye movement control to jointly optimize the coding efficiency of the sensory system. In the integrated system, we propose a binocular saliency model to drive saccades based on learned binocular feature extractors, which simultaneously encode both depth and texture information. Saliency in our model also depends on the current fixation point. This extends prior work, which focused on monocular images and saliency measures that are independent of the current fixation. Our results show that the proposed saliency driven saccades lead to better vergence performance and faster learning in the overall system than random saccades. Faster learning is significant because it indicates that the system actively selects inputs for the most effective learning. This work suggests that saliency-driven saccades provide a scaffold for the development of vergence control during infancy.
Published: 2017

32. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition

Author: Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., Benosman, Ryad B., Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., and Benosman, Ryad B.
Abstract: This paper describes novel event-based spatio-temporal features called time-surfaces and how they can be used to create a hierarchical event-based pattern recognition architecture. Unlike existing hierarchical architectures for pattern recognition, the presented model relies on a time oriented approach to extract spatio-temporal features from the asynchronously acquired dynamics of a visual scene. These dynamics are acquired using biologically inspired frameless asynchronous event-driven vision sensors. Similarly to cortical structures, subsequent layers in our hierarchy extract increasingly abstract features using increasingly large spatio-temporal windows. The central concept is to use the rich temporal information provided by events to create contexts in the form of time-surfaces which represent the recent temporal activity within a local spatial neighborhood. We demonstrate that this concept can robustly be used at all stages of an event-based hierarchical model. First layer feature units operate on groups of pixels, while subsequent layer feature units operate on the output of lower level feature units. We report results on a previously published 36 class character recognition task and a four class canonical dynamic card pip task, achieving near 100 percent accuracy on each. We introduce a new seven class moving face recognition task, achieving 79 percent accuracy. © 2017 IEEE.
Published: 2017

33. Learning of active binocular vision in a biomechanical model of the oculomotor system

Author: Klimmasch, Lukas, Lelais, Alexander, Lichtenstein, Alexander, Shi, Bertram E., Triesch, Jochen, Klimmasch, Lukas, Lelais, Alexander, Lichtenstein, Alexander, Shi, Bertram E., and Triesch, Jochen
Abstract: We present a model for the autonomous learning of active binocular vision using a recently developed biome-chanical model of the human oculomotor system. The model is formulated in the Active Efficient Coding (AEC) framework, a recent generalization of classic efficient coding theories to active perception. The model simultaneously learns how to efficiently encode binocular images and how to generate accurate vergence eye movements that facilitate efficient encoding of the visual input. In order to resolve the redundancy problem arising from the actuation of the eyes through antagonistic muscle pairs, we consider the metabolic costs associated with eye movements. We show that the model successfully learns to trade off vergence accuracy against the associated metabolic costs, producing high fidelity vergence eye movements obeying Sherrington’s law of reciprocal innervation.
Published: 2017

34. Using Point Cloud Data to Improve Three Dimensional Gaze Estimation

Author: Wang, Haofei ECE, Antonelli, Marco, Shi, Bertram E., Wang, Haofei ECE, Antonelli, Marco, and Shi, Bertram E.
Abstract: This paper addresses the problem of estimating gaze location in the 3D environment using a remote eye tracker. Instead of relying only on data provided by the eye tracker, we investigate how to integrate gaze direction with the point-cloud-based representation of the scene provided by a Kinect sensor. The algorithm first combines the gaze vectors for the two eyes provided by the eye tracker into a single gaze vector emanating from a point in between the two eyes. The gaze target in the three dimensional environment is then identified by finding the point in the 3D point cloud that is closest to the gaze vector. Our experimental results demonstrate that the estimate of the gaze target location provided by this method is significantly better than that provided when considering gaze information alone. It is also better than two other methods for integrating point cloud information: (1) finding the 3D point closest to the gaze location as estimated by triangulating the gaze vectors from the two eyes, and (2) finding the 3D point with smallest average distance to the two gaze vectors considered individually. The proposed method has an average error of 1.7 cm in a workspace of 25 × 23 × 24 cm located at a distance of 60 cm from the user.
Published: 2017

35. Feedback Networks

Author: Zamir, Amir R., Wu, Te-Lin, Sun, Lin, Shen, William B., Shi, Bertram E., Malik, Jitendra, Savarese, Silvio, Zamir, Amir R., Wu, Te-Lin, Sun, Lin, Shen, William B., Shi, Bertram E., Malik, Jitendra, and Savarese, Silvio
Abstract: Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output. We establish that a feedback based approach has several core advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback develops a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We provide a general feedback based learning architecture, instantiated using existing RNNs, with the endpoint results on par or better than existing feedforward networks and the addition of the above advantages.
Published: 2017

36. Probabilistic Adjustment of Dwell Time for Eye Typing

Author: Pi, Jimin ECE, Shi, Bertram E., Pi, Jimin ECE, and Shi, Bertram E.
Abstract: Requiring a dwell time before selection is a common way to solve 'Midas-touch problem' in gaze-based interaction. Choosing the dwell time involves a tradeoff between unintentional selection for short dwell times and slow text entry for long dwell times. We propose a probabilistic model for gaze based selection, which adjusts the dwell time based on the probability of each letter based on the past letters selected. By reformulating the entire problem of gaze-based selection probabilistically, we can naturally integrate the probability of each character naturally and with very few prior assumptions and very few free parameters. It automatically assigns shorter dwell times to more likely characters and longer dwell times to less likely characters. Our experimental results demonstrate that the proposed technique speeds up typing without loss in accuracy. The concept of this can be generalized to other dwell-based applications, leading to more efficient gaze system interaction. © 2017 IEEE.
Published: 2017

37. Joint Learning of Binocularly Driven Saccades and Vergence by Active Efficient Coding

Author: Zhu, Qingpeng ECE, Triesch, Jochen, Shi, Bertram E., Zhu, Qingpeng ECE, Triesch, Jochen, and Shi, Bertram E.
Abstract: This paper investigates two types of eye movements: vergence and saccades. Vergence eye movements are responsible for bringing the images of the two eyes into correspondence, whereas saccades drive gaze to interesting regions in the scene. Control of both vergence and saccades develops during early infancy. To date, these two types of eye movements have been studied separately. Here, we propose a computational model of an active vision system that integrates these two types of eye movements. We hypothesize that incorporating a saccade strategy driven by bottom-up attention will benefit the development of vergence control. The integrated system is based on the active efficient coding framework, which describes the joint development of sensory-processing and eye movement control to jointly optimize the coding efficiency of the sensory system. In the integrated system, we propose a binocular saliency model to drive saccades based on learned binocular feature extractors, which simultaneously encode both depth and texture information. Saliency in our model also depends on the current fixation point. This extends prior work, which focused on monocular images and saliency measures that are independent of the current fixation. Our results show that the proposed saliency driven saccades lead to better vergence performance and faster learning in the overall system than random saccades. Faster learning is significant because it indicates that the system actively selects inputs for the most effective learning. This work suggests that saliency-driven saccades provide a scaffold for the development of vergence control during infancy.
Published: 2017

38. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition

Author: Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., Benosman, Ryad B., Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., and Benosman, Ryad B.
Abstract: This paper describes novel event-based spatio-temporal features called time-surfaces and how they can be used to create a hierarchical event-based pattern recognition architecture. Unlike existing hierarchical architectures for pattern recognition, the presented model relies on a time oriented approach to extract spatio-temporal features from the asynchronously acquired dynamics of a visual scene. These dynamics are acquired using biologically inspired frameless asynchronous event-driven vision sensors. Similarly to cortical structures, subsequent layers in our hierarchy extract increasingly abstract features using increasingly large spatio-temporal windows. The central concept is to use the rich temporal information provided by events to create contexts in the form of time-surfaces which represent the recent temporal activity within a local spatial neighborhood. We demonstrate that this concept can robustly be used at all stages of an event-based hierarchical model. First layer feature units operate on groups of pixels, while subsequent layer feature units operate on the output of lower level feature units. We report results on a previously published 36 class character recognition task and a four class canonical dynamic card pip task, achieving near 100 percent accuracy on each. We introduce a new seven class moving face recognition task, achieving 79 percent accuracy. © 2017 IEEE.
Published: 2017

39. Lattice Long Short-Term Memory for Human Action Recognition

Author: Sun, Lin, Jia, Kui, Chen, Kevin, Yeung, Dit Yan, Shi, Bertram E., Savarese, Silvio, Sun, Lin, Jia, Kui, Chen, Kevin, Yeung, Dit Yan, Shi, Bertram E., and Savarese, Silvio
Abstract: Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities., Comment: ICCV2017
Published: 2017

40. Using Variable Dwell Time to Accelerate Gaze-Based Web Browsing with Two-Step Selection

Author: Chen, Zhaokang, Shi, Bertram E., Chen, Zhaokang, and Shi, Bertram E.
Abstract: In order to avoid the "Midas Touch" problem, gaze-based interfaces for selection often introduce a dwell time: a fixed amount of time the user must fixate upon an object before it is selected. Past interfaces have used a uniform dwell time across all objects. Here, we propose a gaze-based browser using a two-step selection policy with variable dwell time. In the first step, a command, e.g. "back" or "select", is chosen from a menu using a dwell time that is constant across the different commands. In the second step, if the "select" command is chosen, the user selects a hyperlink using a dwell time that varies between different hyperlinks. We assign shorter dwell times to more likely hyperlinks and longer dwell times to less likely hyperlinks. In order to infer the likelihood each hyperlink will be selected, we have developed a probabilistic model of natural gaze behavior while surfing the web. We have evaluated a number of heuristic and probabilistic methods for varying the dwell times using both simulation and experiment. Our results demonstrate that varying dwell time improves the user experience in comparison with fixed dwell time, resulting in fewer errors and increased speed. While all of the methods for varying dwell time resulted in improved performance, the probabilistic models yielded much greater gains than the simple heuristics. The best performing model reduces error rate by 50% compared to 100ms uniform dwell time while maintaining a similar response time. It reduces response time by 60% compared to 300ms uniform dwell time while maintaining a similar error rate., Comment: This is an Accepted Manuscript of an article published by Taylor & Francis in the International Journal of Human-Computer Interaction on 30 March, 2018, available online: http://www.tandfonline.com/10.1080/10447318.2018.1452351 . For an eprint of the final published article, please access: https://www.tandfonline.com/eprint/T9d4cNwwRUqXPPiZYm8Z/full . Correct Figure 14
Published: 2017
Full Text: View/download PDF

41. Exoskeletons for the Elderly: Challenges in Control Interfaces

Author: Shi, Bertram E., So, Richard H.Y., Shi, Bertram E., and So, Richard H.Y.
Published: 2016

42. Unsupervised Learning of Depth During Coordinated Head/eye Movements

Author: Antonelli, Marco, Rucci, Michele R., Shi, Bertram E., Antonelli, Marco, Rucci, Michele R., and Shi, Bertram E.
Abstract: Autonomous robots and humans need to create a coherent 3D representation of their peripersonal space in order to interact with nearby objects. Recent studies in visual neuroscience suggest that the small coordinated head/eye movements that humans continually perform during fixation provides useful depth information. In this work, we mimic such a behavior on a humanoid robot and propose a computational model that extracts depth information without requiring the kinematic model of the robot. First, we show that, during fixational head/eye movements, proprioceptive cues and optic flow lie on a low dimensional subspace that is a function of the depth of the target. Then, we use the generative adaptive subspace self-organizing map (GASSOM) to learn these depthdependent subspaces. The depth of the target is eventually decoded using a winner-take-all strategy. The proposed model is validated on a simulated model of the iCub robot. © 2016 IEEE.
Published: 2016

43. OpenEyeSim: A Biomechanical Model For Simulation Of Closedloop Visual Perception

Author: Priamikov, Alexander, Fronius, Maria, Shi, Bertram E., Triesch, Jochen, Priamikov, Alexander, Fronius, Maria, Shi, Bertram E., and Triesch, Jochen
Abstract: We introduce OpenEyeSim, a detailed three-dimensional biomechanical model of the human extraocular eye muscles including a visualization of a virtual environment. The main purpose of OpenEyeSim is to serve as a platform for developing models of the joint learning of visual representations and eye-movement control in the perception-action cycle. The architecture and dynamic muscle properties are based on measurements of the human oculomotor system. We show that our model can reproduce different types of eye movements. Additionally, our model is able to calculate metabolic costs of eye movements. It is also able to simulate different eye disorders, such as different forms of strabismus. We propose OpenEyeSim as a platform for studying many of the complexities of oculomotor control and learning during normal and abnormal visual development.
Published: 2016

44. Learning Visuomotor Transformations and End Effector Appearance by Local Visual Consistency

Author: Zhou, Tao, Shi, Bertram E., Zhou, Tao, and Shi, Bertram E.
Abstract: We present an algorithm that enables a robot to learn the visuomotor transformation from its joint angle space to visual space. The learned transformation can accurately predict location and shape of robot end effector's image projection. This paper extends past work by approximating the end effector by a planar region, rather than a point, in 3-D space, and through its use of spatially and temporally local, rather than global, measures of image consistency. Our robotic experiments demonstrate that the proposed algorithm can learn location and shape of the image region corresponding to the end effector, and how it deforms as the arm moves randomly in front of the camera. Our approach does not require that the end effector be identified with a specific marker. We also demonstrate that the region corresponding to the end effector can adapt to changes in the end effector shape. © 2015 IEEE.
Published: 2016

45. Invariant Feature Extraction From Event Based Stimuli

Author: Chandrapala, Thusitha N., Shi, Bertram E, Chandrapala, Thusitha N., and Shi, Bertram E
Abstract: We propose a novel architecture, the event-based Generative Adaptive Subspace Self-Organizing Map (GASSOM) for learning and extracting invariant representations from event streams originating from neuromorphic vision sensors. The framework is inspired by feed-forward cortical models for visual processing. The model, which is based on the concepts of sparsity and temporal slowness, is able to learn feature extractors that resemble neurons in the primary visual cortex. Layers of units in the proposed model can be cascaded to learn feature extractors with different levels of complexity and selectivity. We explore the applicability of the framework on real world tasks by using the learned network for object recognition. The proposed model achieve higher classification accuracy compared to other state-of-The-Art event based processing methods. Our results also demonstrate the generality and robustness of the method, as the recognizers for different data sets and different tasks all used the same set of learned feature detectors, which were trained on data collected independently of the testing data. © 2016 IEEE.
Published: 2016

46. Exoskeletons for the Elderly: Challenges in Control Interfaces

Author: Shi, Bertram E., So, Richard H.Y., Shi, Bertram E., and So, Richard H.Y.
Published: 2016

47. Invariant Feature Extraction From Event Based Stimuli

Author: Chandrapala, Thusitha N., Shi, Bertram E, Chandrapala, Thusitha N., and Shi, Bertram E
Abstract: We propose a novel architecture, the event-based Generative Adaptive Subspace Self-Organizing Map (GASSOM) for learning and extracting invariant representations from event streams originating from neuromorphic vision sensors. The framework is inspired by feed-forward cortical models for visual processing. The model, which is based on the concepts of sparsity and temporal slowness, is able to learn feature extractors that resemble neurons in the primary visual cortex. Layers of units in the proposed model can be cascaded to learn feature extractors with different levels of complexity and selectivity. We explore the applicability of the framework on real world tasks by using the learned network for object recognition. The proposed model achieve higher classification accuracy compared to other state-of-The-Art event based processing methods. Our results also demonstrate the generality and robustness of the method, as the recognizers for different data sets and different tasks all used the same set of learned feature detectors, which were trained on data collected independently of the testing data. © 2016 IEEE.
Published: 2016

48. Unsupervised Learning of Depth During Coordinated Head/eye Movements

Author: Antonelli, Marco, Rucci, Michele R., Shi, Bertram E., Antonelli, Marco, Rucci, Michele R., and Shi, Bertram E.
Abstract: Autonomous robots and humans need to create a coherent 3D representation of their peripersonal space in order to interact with nearby objects. Recent studies in visual neuroscience suggest that the small coordinated head/eye movements that humans continually perform during fixation provides useful depth information. In this work, we mimic such a behavior on a humanoid robot and propose a computational model that extracts depth information without requiring the kinematic model of the robot. First, we show that, during fixational head/eye movements, proprioceptive cues and optic flow lie on a low dimensional subspace that is a function of the depth of the target. Then, we use the generative adaptive subspace self-organizing map (GASSOM) to learn these depthdependent subspaces. The depth of the target is eventually decoded using a winner-take-all strategy. The proposed model is validated on a simulated model of the iCub robot. © 2016 IEEE.
Published: 2016

49. OpenEyeSim: A Biomechanical Model For Simulation Of Closedloop Visual Perception

Author: Priamikov, Alexander, Fronius, Maria, Shi, Bertram E., Triesch, Jochen, Priamikov, Alexander, Fronius, Maria, Shi, Bertram E., and Triesch, Jochen
Abstract: We introduce OpenEyeSim, a detailed three-dimensional biomechanical model of the human extraocular eye muscles including a visualization of a virtual environment. The main purpose of OpenEyeSim is to serve as a platform for developing models of the joint learning of visual representations and eye-movement control in the perception-action cycle. The architecture and dynamic muscle properties are based on measurements of the human oculomotor system. We show that our model can reproduce different types of eye movements. Additionally, our model is able to calculate metabolic costs of eye movements. It is also able to simulate different eye disorders, such as different forms of strabismus. We propose OpenEyeSim as a platform for studying many of the complexities of oculomotor control and learning during normal and abnormal visual development.
Published: 2016

50. Exoskeletons for the Elderly: Challenges in Control Interfaces

Author: Shi, Bertram E., So, Richard H.Y., Shi, Bertram E., and So, Richard H.Y.
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

Publisher

404 results on '"Shi, Bertram E."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources