Author: "Bhattacharya, Uttaran" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bhattacharya, Uttaran"' showing total 27 results

Start Over Author "Bhattacharya, Uttaran" Publication Type Reports

27 results on '"Bhattacharya, Uttaran"'

1. Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

Author: Bhattacharya, Uttaran, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a token sequence of the speaker's face landmark motion and body-joint motion computed from a video, our method synthesizes the motion sequences for the speaker's face landmarks and body joints to match the content and the affect of the speech. We design a generator consisting of a set of encoders to transform all the inputs into a multimodal embedding space capturing their correlations, followed by a pair of decoders to synthesize the desired face and pose motions. To enhance the plausibility of synthesis, we use an adversarial discriminator that learns to differentiate between the face and pose motions computed from the original videos and our synthesized motions based on their affective expressions. To evaluate our approach, we extend the TED Gesture Dataset to include view-normalized, co-speech face landmarks in addition to body gestures. We demonstrate the performance of our method through thorough quantitative and qualitative experiments on multiple evaluation metrics and via a user study. We observe that our method results in low reconstruction error and produces synthesized samples with diverse facial expressions and body gestures for digital characters., Comment: 14 pages, 7 figures, 2 tables
Published: 2024

2. Evaluation and Continual Improvement for an Enterprise AI Assistant

Author: Maharaj, Akash V., Qian, Kun, Bhattacharya, Uttaran, Fang, Sally, Galatanu, Horia, Garg, Manas, Hanessian, Rachel, Kapoor, Nishant, Russell, Ken, Vaithyanathan, Shivakumar, and Li, Yunyao
Subjects: Computer Science - Human-Computer Interaction
Abstract: The development of conversational AI assistants is an iterative process with multiple components. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprises, which is under active development, and how we address these challenges. We also share preliminary results and discuss lessons learned., Comment: Accepted to DaSH Workshop at NAACL 2024
Published: 2024

3. HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

Author: Narasimhaswamy, Supreeth, Bhattacharya, Uttaran, Chen, Xiang, Dasgupta, Ishita, Mitra, Saayan, and Hoai, Minh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands., Comment: Revisions: 1. Added a link to project page in the abstract, 2. Updated references and related work, 3. Fixed some grammatical errors
Published: 2024

4. VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

Author: Wang, Yizhou, Zhang, Ruiyi, Wang, Haoliang, Bhattacharya, Uttaran, Fu, Yun, and Wu, Gang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
Published: 2023

5. Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

Author: Khandelwal, Ashmit, Agrawal, Aditya, Bhattacharyya, Aanisha, Singla, Yaman K, Singh, Somesh, Bhattacharya, Uttaran, Dasgupta, Ishita, Petrangeli, Stefano, Shah, Rajiv Ratn, Chen, Changyou, and Krishnamurthy, Balaji
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Shannon and Weaver's seminal information theory divides communication into three levels: technical, semantic, and effectiveness. While the technical level deals with the accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Large Language Models (LLMs), with their wide generalizability, make some progress towards the second level. However, LLMs and other communication models are not conventionally designed for predicting and optimizing communication for desired receiver behaviors and intents. As a result, the effectiveness level remains largely untouched by modern communication systems. In this paper, we introduce the receivers' "behavior tokens," such as shares, likes, clicks, purchases, and retweets, in the LLM's training corpora to optimize content for the receivers and predict their behaviors. Other than showing similar performance to LLMs on content understanding tasks, our trained models show generalization capabilities on the behavior dimension for behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. We show results on all these capabilities using a wide range of tasks on three corpora. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior (https://behavior-in-the-wild.github.io/LCBM).
Published: 2023

6. DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning

Author: Bhattacharya, Aneesh, Paranjape, Manas, Bhattacharya, Uttaran, and Bera, Aniket
Subjects: Computer Science - Sound, Computer Science - Graphics, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present DanceAnyWay, a generative learning method to synthesize beat-guided dances of 3D human characters synchronized with music. Our method learns to disentangle the dance movements at the beat frames from the dance movements at all the remaining frames by operating at two hierarchical levels. At the coarser "beat" level, it encodes the rhythm, pitch, and melody information of the input music via dedicated feature representations only at the beat frames. It leverages them to synthesize the beat poses of the target dances using a sequence-to-sequence learning framework. At the finer "repletion" level, our method encodes similar rhythm, pitch, and melody information from all the frames of the input music via dedicated feature representations. It generates the full dance sequences by combining the synthesized beat and repletion poses and enforcing plausibility through an adversarial learning framework. Our training paradigm also enforces fine-grained diversity in the synthesized dances through a randomized temporal contrastive loss, which ensures different segments of the dance sequences have different movements and avoids motion freezing or collapsing to repetitive movements. We evaluate the performance of our approach through extensive experiments on the benchmark AIST++ dataset and observe improvements of about 7%-12% in motion quality metrics and 1.5%-4% in motion diversity metrics over the current baselines, respectively. We also conducted a user study to evaluate the visual quality of our synthesized dances. We note that, on average, the samples generated by our method were about 9-48% more preferred by the participants and had a 4-27% better five-point Likert-scale score over the best available current baseline in terms of motion quality and synchronization. Our source code and project page are available at https://github.com/aneeshbhattacharya/DanceAnyWay., Comment: 11 pages, 7 figures, 3 tables. To appear as part of the proceedings of the 38th Annual AAAI Conference on Artificial Intelligence, 2024
Published: 2023

7. Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

Author: Bhattacharya, Uttaran, Wu, Gang, Petrangeli, Stefano, Swaminathan, Viswanathan, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific., Comment: 14 pages, 5 figures, 7 tables
Published: 2022
Full Text: View/download PDF

8. HighlightMe: Detecting Highlights from Human-Centric Videos

Author: Bhattacharya, Uttaran, Wu, Gang, Petrangeli, Stefano, Swaminathan, Viswanathan, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. Our method works on the graph-based representation of multiple observable human-centric modalities in the videos, such as poses and faces. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions based on these modalities. We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames. We use these scores to compute which frames to highlight and stitch contiguous frames to produce the excerpts. We train our network on the large-scale AVA-Kinetics action dataset and evaluate it on four benchmark video highlight datasets: DSH, TVSum, PHD2, and SumMe. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning., Comment: 10 pages, 5 figures, 5 tables. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021
Published: 2021

9. Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

Author: Bhattacharya, Uttaran, Childs, Elizabeth, Rewkowski, Nicholas, and Manocha, Dinesh
Subjects: Computer Science - Multimedia, Computer Science - Machine Learning
Abstract: We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10--33%, the mean acceleration difference by 8--58%, and the Fr\'echet Gesture Distance by 21--34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech., Comment: 11 pages, 4 figures, 2 tables. Proceedings of the 29th ACM International Conference on Multimedia, October 20-24, 2021, Virtual Event, China
Published: 2021
Full Text: View/download PDF

10. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

Author: Bhattacharya, Uttaran, Rewkowski, Nicholas, Banerjee, Abhishek, Guhan, Pooja, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Graphics
Abstract: We present Text2Gestures, a transformer-based learning method to interactively generate emotive full-body gestures for virtual agents aligned with natural language text inputs. Our method generates emotionally expressive gestures by utilizing the relevant biomechanical features for body expressions, also known as affective features. We also consider the intended task corresponding to the text and the target virtual agents' intended gender and handedness in our generation pipeline. We train and evaluate our network on the MPI Emotional Body Expressions Database and observe that our network produces state-of-the-art performance in generating gestures for virtual agents aligned with the text for narration or conversation. Our network can generate these gestures at interactive rates on a commodity GPU. We conduct a web-based user study and observe that around 91% of participants indicated our generated gestures to be at least plausible on a five-point Likert Scale. The emotions perceived by the participants from the gestures are also strongly positively correlated with the corresponding intended emotions, with a minimum Pearson coefficient of 0.77 in the valence dimension., Comment: 10 pages, 8 figures, 2 tables
Published: 2021
Full Text: View/download PDF

11. Generating Emotive Gaits for Virtual Agents Using Affect-Based Autoregression

Author: Bhattacharya, Uttaran, Rewkowski, Nicholas, Guhan, Pooja, Williams, Niall L., Mittal, Trisha, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Graphics
Abstract: We present a novel autoregression network to generate virtual agents that convey various emotions through their walking styles or gaits. Given the 3D pose sequences of a gait, our network extracts pertinent movement features and affective features from the gait. We use these features to synthesize subsequent gaits such that the virtual agents can express and transition between emotions represented as combinations of happy, sad, angry, and neutral. We incorporate multiple regularizations in the training of our network to simultaneously enforce plausible movements and noticeable emotions on the virtual agents. We also integrate our approach with an AR environment using a Microsoft HoloLens and can generate emotive gaits at interactive rates to increase the social presence. We evaluate how human observers perceive both the naturalness and the emotions from the generated gaits of the virtual agents in a web-based study. Our results indicate around 89% of the users found the naturalness of the gaits satisfactory on a five-point Likert scale, and the emotions they perceived from the virtual agents are statistically similar to the intended emotions of the virtual agents. We also use our network to augment existing gait datasets with emotive gaits and will release this augmented dataset for future research in emotion prediction and emotive gait synthesis. Our project website is available at https://gamma.umd.edu/gen_emotive_gaits/., Comment: 12 pages, 9 figures, 3 tables
Published: 2020
Full Text: View/download PDF

12. Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders

Author: Banerjee, Abhishek, Bhattacharya, Uttaran, and Bera, Aniket
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel generalized zero-shot algorithm to recognize perceived emotions from gestures. Our task is to map gestures to novel emotion categories not encountered in training. We introduce an adversarial, autoencoder-based representation learning that correlates 3D motion-captured gesture sequence with the vectorized representation of the natural-language perceived emotion terms using word2vec embeddings. The language-semantic embedding provides a representation of the emotion label space, and we leverage this underlying distribution to map the gesture-sequences to the appropriate categorical emotion labels. We train our method using a combination of gestures annotated with known emotion terms and gestures not annotated with any emotions. We evaluate our method on the MPI Emotional Body Expressions Database (EBEDB) and obtain an accuracy of $58.43\%$. This improves the performance of current state-of-the-art algorithms for generalized zero-shot learning by $25$--$27\%$ on the absolute.
Published: 2020

13. Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues

Author: Mittal, Trisha, Bhattacharya, Uttaran, Chandra, Rohan, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We present a learning-based method for detecting real and fake deepfake multimedia content. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to perceived emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets, respectively. To the best of our knowledge, ours is the first approach that simultaneously exploits audio and video modalities and also perceived emotions from the two modalities for deepfake detection., Comment: Accepted to ACMMM-2020
Published: 2020

14. EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle

Author: Mittal, Trisha, Guhan, Pooja, Bhattacharya, Uttaran, Chandra, Rohan, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images. Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition. Our first interpretation is based on using multiple modalities(e.g. faces and gaits) for emotion recognition. For the second interpretation, we gather semantic context from the input image and use a self-attention-based CNN to encode this information. Finally, we use depth maps to model the third interpretation related to socio-dynamic interactions and proximity among agents. We demonstrate the efficiency of our network through experiments on EMOTIC, a benchmark dataset. We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods. We also introduce a new dataset, GroupWalk, which is a collection of videos captured in multiple real-world settings of people walking. We report an AP of 65.83 across 4 categories on GroupWalk, which is also an improvement over prior methods.
Published: 2020

15. CMetric: A Driving Behavior Measure Using Centrality Functions

Author: Chandra, Rohan, Bhattacharya, Uttaran, Mittal, Trisha, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present a new measure, CMetric, to classify driver behaviors using centrality functions. Our formulation combines concepts from computational graph theory and social traffic psychology to quantify and classify the behavior of human drivers. CMetric is used to compute the probability of a vehicle executing a driving style, as well as the intensity used to execute the style. Our approach is designed for realtime autonomous driving applications, where the trajectory of each vehicle or road-agent is extracted from a video. We compute a dynamic geometric graph (DGG) based on the positions and proximity of the road-agents and centrality functions corresponding to closeness and degree. These functions are used to compute the CMetric based on style likelihood and style intensity estimates. Our approach is general and makes no assumption about traffic density, heterogeneity, or how driving behaviors change over time. We present an algorithm to compute CMetric and demonstrate its performance on real-world traffic datasets. To test the accuracy of CMetric, we introduce a new evaluation protocol (called "Time Deviation Error") that measures the difference between human prediction and the prediction made by CMetric., Comment: Accepted to IROS 2020
Published: 2020

16. The Liar's Walk: Detecting Deception with Gait and Gesture

Author: Randhavane, Tanmay, Bhattacharya, Uttaran, Kapsaskis, Kyra, Gray, Kurt, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present a data-driven deep neural algorithm for detecting deceptive walking behavior using nonverbal cues like gaits and gestures. We conducted an elaborate user study, where we recorded many participants performing tasks involving deceptive walking. We extract the participants' walking gaits as series of 3D poses. We annotate various gestures performed by participants during their tasks. Based on the gait and gesture data, we train an LSTM-based deep neural network to obtain deep features. Finally, we use a combination of psychology-based gait, gesture, and deep features to detect deceptive walking with an accuracy of 88.41%. This is an improvement of 10.6% over handcrafted gait and gesture features and an improvement of 4.7% and 9.2% over classifiers based on the state-of-the-art emotion and action classification algorithms, respectively. Additionally, we present a novel dataset, DeceptiveWalk, that contains gaits and gestures with their associated deception labels. To the best of our knowledge, ours is the first algorithm to detect deceptive behavior using non-verbal cues of gait and gesture., Comment: 10 pages, 6 Figures
Published: 2019

17. Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMs

Author: Chandra, Rohan, Guan, Tianrui, Panuganti, Srujan, Mittal, Trisha, Bhattacharya, Uttaran, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present a novel approach for traffic forecasting in urban traffic scenarios using a combination of spectral graph analysis and deep learning. We predict both the low-level information (future trajectories) as well as the high-level information (road-agent behavior) from the extracted trajectory of each road-agent. Our formulation represents the proximity between the road agents using a weighted dynamic geometric graph (DGG). We use a two-stream graph-LSTM network to perform traffic forecasting using these weighted DGGs. The first stream predicts the spatial coordinates of road-agents, while the second stream predicts whether a road-agent is going to exhibit overspeeding, underspeeding, or neutral behavior by modeling spatial interactions between road-agents. Additionally, we propose a new regularization algorithm based on spectral clustering to reduce the error margin in long-term prediction (3-5 seconds) and improve the accuracy of the predicted trajectories. Moreover, we prove a theoretical upper bound on the regularized prediction error. We evaluate our approach on the Argoverse, Lyft, Apolloscape, and NGSIM datasets and highlight the benefits over prior trajectory prediction methods. In practice, our approach reduces the average prediction error by approximately 75% over prior algorithms and achieves a weighted average accuracy of 91.2% for behavior prediction. Additionally, our spectral regularization improves long-term prediction by up to 70%., Comment: Accepted to RAL/IROS 2020 as a dual journal and conference submission
Published: 2019

18. Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping

Author: Bhattacharya, Uttaran, Roncal, Christian, Mittal, Trisha, Chandra, Rohan, Kapsaskis, Kyra, Gray, Kurt, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present an autoencoder-based semi-supervised approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data and represented as sequences of 3D poses. Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in a bottom-up manner in the encoder, following the kinematic chains in the human body. We also constrain the latent embeddings of the encoder to contain the space of psychologically-motivated affective features underlying the gaits. We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings. For the annotated data, we also train a classifier to map the latent embeddings to emotion labels. Our semi-supervised approach achieves a mean average precision of 0.84 on the Emotion-Gait benchmark dataset, which contains both labeled and unlabeled gaits collected from multiple sources. We outperform current state-of-art algorithms for both emotion recognition and action recognition from 3D gaits by 7%--23% on the absolute. More importantly, we improve the average precision by 10%--50% on the absolute on classes that each makes up less than 25% of the labeled part of the Emotion-Gait benchmark dataset., Comment: 18 pages, 5 figures, 3 tables
Published: 2019
Full Text: View/download PDF

19. M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Author: Mittal, Trisha, Bhattacharya, Uttaran, Chandra, Rohan, Bera, Aniket, and Manocha, Dinesh
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.
Published: 2019

20. STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

Author: Bhattacharya, Uttaran, Mittal, Trisha, Chandra, Rohan, Randhavane, Tanmay, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: We present a novel classifier network called STEP, to classify perceived human emotion from gaits, based on a Spatial Temporal Graph Convolutional Network (ST-GCN) architecture. Given an RGB video of an individual walking, our formulation implicitly exploits the gait features to classify the emotional state of the human into one of four emotions: happy, sad, angry, or neutral. We use hundreds of annotated real-world gait videos and augment them with thousands of annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE). We incorporate a novel push-pull regularization loss in the CVAE formulation of STEP-Gen to generate realistic gaits and improve the classification accuracy of STEP. We also release a novel dataset (E-Gait), which consists of $2,177$ human gaits annotated with perceived emotions along with thousands of synthetic gaits. In practice, STEP can learn the affective features and exhibits classification accuracy of 89% on E-Gait, which is 14 - 30% more accurate over prior methods., Comment: 11 pages, 7 figures, 1 table
Published: 2019
Full Text: View/download PDF

21. GraphRQI: Classifying Driver Behaviors Using Graph Spectrums

Author: Chandra, Rohan, Bhattacharya, Uttaran, Mittal, Trisha, Li, Xiaoyu, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present a novel algorithm (GraphRQI) to identify driver behaviors from road-agent trajectories. Our approach assumes that the road-agents exhibit a range of driving traits, such as aggressive or conservative driving. Moreover, these traits affect the trajectories of nearby road-agents as well as the interactions between road-agents. We represent these inter-agent interactions using unweighted and undirected traffic graphs. Our algorithm classifies the driver behavior using a supervised learning algorithm by reducing the computation to the spectral analysis of the traffic graph. Moreover, we present a novel eigenvalue algorithm to compute the spectrum efficiently. We provide theoretical guarantees for the running time complexity of our eigenvalue algorithm and show that it is faster than previous methods by 2 times. We evaluate the classification accuracy of our approach on traffic videos and autonomous driving datasets corresponding to urban traffic. In practice, GraphRQI achieves an accuracy improvement of up to 25% over prior driver behavior classification algorithms. We also use our classification algorithm to predict the future trajectories of road-agents., Comment: Final Pre-print. Accepted at ICRA 2020
Published: 2019

22. RobustTP: End-to-End Trajectory Prediction for Heterogeneous Road-Agents in Dense Traffic with Noisy Sensor Inputs

Author: Chandra, Rohan, Bhattacharya, Uttaran, Roncal, Christian, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present RobustTP, an end-to-end algorithm for predicting future trajectories of road-agents in dense traffic with noisy sensor input trajectories obtained from RGB cameras (either static or moving) through a tracking algorithm. In this case, we consider noise as the deviation from the ground truth trajectory. The amount of noise depends on the accuracy of the tracking algorithm. Our approach is designed for dense heterogeneous traffic, where the road agents corresponding to a mixture of buses, cars, scooters, bicycles, or pedestrians. RobustTP is an approach that first computes trajectories using a combination of a non-linear motion model and a deep learning-based instance segmentation algorithm. Next, these noisy trajectories are trained using an LSTM-CNN neural network architecture that models the interactions between road-agents in dense and heterogeneous traffic. Our trajectory prediction algorithm outperforms state-of-the-art methods for end-to-end trajectory prediction using sensor inputs. We achieve an improvement of upto 18% in average displacement error and an improvement ofup to 35.5% in final displacement error at the end of the prediction window (5 seconds) over the next best method. All experiments were set up on an Nvidia TiTan Xp GPU. Additionally, we release a software framework, TrackNPred. The framework consists of implementations of state-of-the-art tracking and trajectory prediction methods and tools to benchmark and evaluate them on real-world dense traffic datasets.
Published: 2019

23. RoadTrack: Realtime Tracking of Road Agents in Dense and Heterogeneous Environments

Author: Chandra, Rohan, Bhattacharya, Uttaran, Randhavane, Tanmay, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present a realtime tracking algorithm, RoadTrack, to track heterogeneous road-agents in dense traffic videos. Our approach is designed for traffic scenarios that consist of different road-agents such as pedestrians, two-wheelers, cars, buses, etc. sharing the road. We use the tracking-by-detection approach where we track a road-agent by matching the appearance or bounding box region in the current frame with the predicted bounding box region propagated from the previous frame. RoadTrack uses a novel motion model called the Simultaneous Collision Avoidance and Interaction (SimCAI) model to predict the motion of road-agents by modeling collision avoidance and interactions between the road-agents for the next frame. We demonstrate the advantage of RoadTrack on a dataset of dense traffic videos and observe an accuracy of 75.8% on this dataset, outperforming prior state-of-the-art tracking algorithms by at least 5.2%. RoadTrack operates in realtime at approximately 30 fps and is at least 4 times faster than prior tracking algorithms on standard tracking datasets., Comment: Final pre-print. Accepted at ICRA 2020
Published: 2019

24. DensePeds: Pedestrian Tracking in Dense Crowds Using Front-RVO and Sparse Features

Author: Chandra, Rohan, Bhattacharya, Uttaran, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present a pedestrian tracking algorithm, DensePeds, that tracks individuals in highly dense crowds (greater than 2 pedestrians per square meter). Our approach is designed for videos captured from front-facing or elevated cameras. We present a new motion model called Front-RVO (FRVO) for predicting pedestrian movements in dense situations using collision avoidance constraints and combine it with state-of-the-art Mask R-CNN to compute sparse feature vectors that reduce the loss of pedestrian tracks (false negatives). We evaluate DensePeds on the standard MOT benchmarks as well as a new dense crowd dataset. In practice, our approach is 4.5 times faster than prior tracking algorithms on the MOT benchmark and we are state-of-the-art in dense crowd videos by over 2.6% on the absolute scale on average., Comment: added more refs
Published: 2019

25. Identifying Emotions from Walking using Affective and Deep Features

Author: Randhavane, Tanmay, Bhattacharya, Uttaran, Kapsaskis, Kyra, Gray, Kurt, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: We present a new data-driven model and algorithm to identify the perceived emotions of individuals based on their walking styles. Given an RGB video of an individual walking, we extract his/her walking gait in the form of a series of 3D poses. Our goal is to exploit the gait features to classify the emotional state of the human into one of four emotions: happy, sad, angry, or neutral. Our perceived emotion recognition approach uses deep features learned via LSTM on labeled emotion datasets. Furthermore, we combine these features with affective features computed from gaits using posture and movement cues. These features are classified using a Random Forest Classifier. We show that our mapping between the combined feature space and the perceived emotional state provides 80.07% accuracy in identifying the perceived emotions. In addition to classifying discrete categories of emotions, our algorithm also predicts the values of perceived valence and arousal from gaits. We also present an EWalk (Emotion Walk) dataset that consists of videos of walking individuals with gaits and labeled emotions. To the best of our knowledge, this is the first gait-based model to identify perceived emotions from videos of walking individuals.
Published: 2019

26. Efficient and Robust Registration on the 3D Special Euclidean Group

Author: Bhattacharya, Uttaran and Govindu, Venu Madhav
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present an accurate, robust and fast method for registration of 3D scans. Our motion estimation optimizes a robust cost function on the intrinsic representation of rigid motions, i.e., the Special Euclidean group $\mathbb{SE}(3)$. We exploit the geometric properties of Lie groups as well as the robustness afforded by an iteratively reweighted least squares optimization. We also generalize our approach to a joint multiview method that simultaneously solves for the registration of a set of scans. We demonstrate the efficacy of our approach by thorough experimental validation. Our approach significantly outperforms the state-of-the-art robust 3D registration method based on a line process in terms of both speed and accuracy. We also show that this line process method is a special case of our principled geometric solution. Finally, we also present scenarios where global registration based on feature correspondences fails but multiview ICP based on our robust motion estimation is successful., Comment: 18 pages, 7 figures, 7 tables
Published: 2019
Full Text: View/download PDF

27. TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions

Author: Chandra, Rohan, Bhattacharya, Uttaran, Bera, Aniket, and Manocha, Dinesh
Subjects: Computer Science - Robotics
Abstract: We present a new algorithm for predicting the near-term trajectories of road-agents in dense traffic videos. Our approach is designed for heterogeneous traffic, where the road-agents may correspond to buses, cars, scooters, bicycles, or pedestrians. We model the interactions between different road-agents using a novel LSTM-CNN hybrid network for trajectory prediction. In particular, we take into account heterogeneous interactions that implicitly accounts for the varying shapes, dynamics, and behaviors of different road agents. In addition, we model horizon-based interactions which are used to implicitly model the driving behavior of each road-agent. We evaluate the performance of our prediction algorithm, TraPHic, on the standard datasets and also introduce a new dense, heterogeneous traffic dataset corresponding to urban Asian videos and agent trajectories. We outperform state-of-the-art methods on dense traffic datasets by 30%., Comment: 10 pages, 5 figures, 3 tables
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

27 results on '"Bhattacharya, Uttaran"'

1. Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

2. Evaluation and Continual Improvement for an Enterprise AI Assistant

3. HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

4. VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

5. Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

6. DanceAnyWay: Synthesizing Beat-Guided 3D Dances with Randomized Temporal Contrastive Learning

7. Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

8. HighlightMe: Detecting Highlights from Human-Centric Videos

9. Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

10. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

11. Generating Emotive Gaits for Virtual Agents Using Affect-Based Autoregression

12. Learning Unseen Emotions from Gestures via Semantically-Conditioned Zero-Shot Perception with Adversarial Autoencoders

13. Emotions Don't Lie: An Audio-Visual Deepfake Detection Method Using Affective Cues

14. EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle

15. CMetric: A Driving Behavior Measure Using Centrality Functions

16. The Liar's Walk: Detecting Deception with Gait and Gesture

17. Forecasting Trajectory and Behavior of Road-Agents Using Spectral Clustering in Graph-LSTMs

18. Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping

19. M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

20. STEP: Spatial Temporal Graph Convolutional Networks for Emotion Perception from Gaits

21. GraphRQI: Classifying Driver Behaviors Using Graph Spectrums

22. RobustTP: End-to-End Trajectory Prediction for Heterogeneous Road-Agents in Dense Traffic with Noisy Sensor Inputs

23. RoadTrack: Realtime Tracking of Road Agents in Dense and Heterogeneous Environments

24. DensePeds: Pedestrian Tracking in Dense Crowds Using Front-RVO and Sparse Features

25. Identifying Emotions from Walking using Affective and Deep Features

26. Efficient and Robust Registration on the 3D Special Euclidean Group

27. TraPHic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

27 results on '"Bhattacharya, Uttaran"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources