Author: "Manocha, Dinesh" / Language: undetermined - Searchworks@Jio Institute Digital Library Search Results

1. Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis

Author: Mittal, Trisha, Sinha, Ritwik, Swaminathan, Viswanathan, Collomosse, John, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: As tools for content editing mature, and artificial intelligence (AI) based algorithms for synthesizing media grow, the presence of manipulated content across online media is increasing. This phenomenon causes the spread of misinformation, creating a greater need to distinguish between ``real'' and ``manipulated'' content. To this end, we present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated). Many of the existing deepfake datasets focus exclusively on two types of facial manipulations -- swapping with a different subject's face or altering the existing face. VideoSham, on the other hand, contains more diverse, context-rich, and human-centric, high-resolution videos manipulated using a combination of 6 different spatial and temporal attacks. Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VideoSham. We performed a user study on Amazon Mechanical Turk with 1200 participants to understand if they can differentiate between the real and manipulated videos in VideoSham. Finally, we dig deeper into the strengths and weaknesses of performances by humans and SOTA-algorithms to identify gaps that need to be filled with better AI algorithms. We present the dataset at https://github.com/adobe-research/VideoSham-dataset., Comment: Accepted to WACV2023 - Workshop on Manipulation, Adversarial, and Presentation Attacks in Biometrics
Published: 2023
Full Text: View/download PDF

2. Listen2Scene: Interactive material-aware binaural sound propagation for reconstructed 3D scenes

Author: Ratnarajah, Anton and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG), Multimedia (cs.MM)
Abstract: We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network is able to handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function to the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU and can easily handle multiple sources. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible as compared to audio rendered using prior learning-based sound propagation algorithms., Comment: Project page: https://anton-jeran.github.io/Listen2Scene/
Published: 2023
Full Text: View/download PDF

3. STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning

Author: Chakraborty, Souradip, Bedi, Amrit Singh, Koppel, Alec, Wang, Mengdi, Huang, Furong, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
Published: 2023
Full Text: View/download PDF

4. VERN: Vegetation-aware Robot Navigation in Dense Unstructured Outdoor Environments

Author: Sathyamoorthy, Adarsh Jagan, Weerakoon, Kasun, Guan, Tianrui, Russell, Mason, Conover, Damon, Pusey, Jason, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO)
Abstract: We propose a novel method for autonomous legged robot navigation in densely vegetated environments with a variety of pliable/traversable and non-pliable/untraversable vegetation. We present a novel few-shot learning classifier that can be trained on a few hundred RGB images to differentiate flora that can be navigated through, from the ones that must be circumvented. Using the vegetation classification and 2D lidar scans, our method constructs a vegetation-aware traversability cost map that accurately represents the pliable and non-pliable obstacles with lower, and higher traversability costs, respectively. Our cost map construction accounts for misclassifications of the vegetation and further lowers the risk of collisions, freezing and entrapment in vegetation during navigation. Furthermore, we propose holonomic recovery behaviors for the robot for scenarios where it freezes, or gets physically entrapped in dense, pliable vegetation. We demonstrate our method on a Boston Dynamics Spot robot in real-world unstructured environments with sparse and dense tall grass, bushes, trees, etc. We observe an increase of 25-90% in success rates, 10-90% decrease in freezing rate, and up to 65% decrease in the false positive rate compared to existing methods., Comment: 8 Pages, 5 figures
Published: 2023
Full Text: View/download PDF

5. Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic

Author: Suttle, Wesley A., Bedi, Amrit Singh, Patel, Bhrij, Sadler, Brian M., Koppel, Alec, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
Published: 2023
Full Text: View/download PDF

6. Dynamic EM Ray Tracing for Large Urban Scenes with Multiple Receivers

Author: Wang, Ruichen and Manocha, Dinesh
Subjects: Signal Processing (eess.SP), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing
Abstract: Radio applications are increasingly being used in urban environments for cellular radio systems and safety applications that use vehicle-vehicle, and vehicle-to-infrastructure. We present a novel ray tracing-based radio propagation algorithm that can handle large urban scenes with hundreds or thousands of dynamic objects and receivers. Our approach is based on the use of coherence-based techniques that exploit spatial and temporal coherence for efficient wireless propagation and radio network planning. Our formulation also utilizes channel coherence which is used to determine the effectiveness of the propagation model within a certain time in dynamically generated paths; and spatial consistency which is used to estimate the similarity and accuracy of changes in a dynamic environment with varying propagation models and blocking obstacles. We highlight the performance of our simulator in large urban traffic scenes with an area of 2*2 km^2 and more than 10,000 users and devices. We evaluate the accuracy by comparing the results with discrete model simulations performed using WinProp. In practice, our approach scales linearly with the area of the urban environment and the number of dynamic obstacles or receivers., Comment: 7 pages, 14 figures, conference
Published: 2023
Full Text: View/download PDF

7. CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network

Author: Ghosh, Sreyan, Suri, Manan, Chiniya, Purva, Tyagi, Utkarsh, Kumar, Sonal, and Manocha, Dinesh
Subjects: Social and Information Networks (cs.SI), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Social and Information Networks, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: The tremendous growth of social media users interacting in online conversations has also led to significant growth in hate speech. Most of the prior works focus on detecting explicit hate speech, which is overt and leverages hateful phrases, with very little work focusing on detecting hate speech that is implicit or denotes hatred through indirect or coded language. In this paper, we present CoSyn, a user- and conversational-context synergized network for detecting implicit hate speech in online conversation trees. CoSyn first models the user's personal historical and social context using a novel hyperbolic Fourier attention mechanism and hyperbolic graph convolution network. Next, we jointly model the user's personal context and the conversational context using a novel context interaction mechanism in the hyperbolic space that clearly captures the interplay between the two and makes independent assessments on the amounts of information to be retrieved from both contexts. CoSyn performs all operations in the hyperbolic space to account for the scale-free dynamics of social media. We demonstrate the effectiveness of CoSyn both qualitatively and quantitatively on an open-source hate speech dataset with Twitter conversations and show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 8.15% - 19.50%., Comment: Under review at IJCAI 2023
Published: 2023
Full Text: View/download PDF

8. Real-Time Decentralized Navigation of Nonholonomic Agents Using Shifted Yielding Areas

Author: He, Liang, Pan, Zherong, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO)
Abstract: We present a lightweight, decentralized algorithm for navigating multiple nonholonomic agents through challenging environments with narrow passages. Our key idea is to allow agents to yield to each other in large open areas instead of narrow passages, to increase the success rate of conventional decentralized algorithms. At pre-processing time, our method computes a medial axis for the freespace. A reference trajectory is then computed and projected onto the medial axis for each agent. During run time, when an agent senses other agents moving in the opposite direction, our algorithm uses the medial axis to estimate a Point of Impact (POI) as well as the available area around the POI. If the area around the POI is not large enough for yielding behaviors to be successful, we shift the POI to nearby large areas by modulating the agent's reference trajectory and traveling speed. We evaluate our method on a row of 4 environments with up to 15 robots, and we find our method incurs a marginal computational overhead of 10-30 ms on average, achieving real-time performance. Afterward, our planned reference trajectories can be tracked using local navigation algorithms to achieve up to a $100\%$ higher success rate over local navigation algorithms alone.
Published: 2023
Full Text: View/download PDF

9. ACLM: A Selective-Denoising based Generative Data Augmentation Approach for Low-Resource Complex NER

Author: Ghosh, Sreyan, Tyagi, Utkarsh, Suri, Manan, Kumar, Sonal, Ramaneswaran, S, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Information Retrieval (cs.IR), Computer Science - Information Retrieval
Abstract: Complex Named Entity Recognition (NER) is the task of detecting linguistically complex named entities in low-context text. In this paper, we present ACLM Attention-map aware keyword selection for Conditional Language Model fine-tuning), a novel data augmentation approach based on conditional generation to address the data scarcity problem in low-resource complex NER. ACLM alleviates the context-entity mismatch issue, a problem existing NER data augmentation techniques suffer from and often generates incoherent augmentations by placing complex named entities in the wrong context. ACLM builds on BART and is optimized on a novel text reconstruction or denoising task - we use selective masking (aided by attention maps) to retain the named entities and certain keywords in the input sentence that provide contextually relevant additional knowledge or hints about the named entities. Compared with other data augmentation strategies, ACLM can generate more diverse and coherent augmentations preserving the true word sense of complex entities in the sentence. We demonstrate the effectiveness of ACLM both qualitatively and quantitatively on monolingual, cross-lingual, and multilingual complex NER across various low-resource settings. ACLM outperforms all our neural baselines by a significant margin (1%-36%). In addition, we demonstrate the application of ACLM to other domains that suffer from data scarcity (e.g., biomedical). In practice, ACLM generates more effective and factual augmentations for these domains than prior methods. Code: https://github.com/Sreyan88/ACLM, Comment: ACL 2023 Main Conference
Published: 2023
Full Text: View/download PDF

10. Aerial Diffusion: Text Guided Ground-to-Aerial View Translation from a Single Image using Diffusion Models

Author: Kothandaraman, Divya, Zhou, Tianyi, Lin, Ming, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel method, Aerial Diffusion, for generating aerial views from a single ground-view image using text guidance. Aerial Diffusion leverages a pretrained text-image diffusion model for prior knowledge. We address two main challenges corresponding to domain gap between the ground-view and the aerial view and the two views being far apart in the text-image embedding manifold. Our approach uses a homography inspired by inverse perspective mapping prior to finetuning the pretrained diffusion model. Additionally, using the text corresponding to the ground-view to finetune the model helps us capture the details in the ground-view image at a relatively low bias towards the ground-view image. Aerial Diffusion uses an alternating sampling strategy to compute the optimal solution on complex high-dimensional manifold and generate a high-fidelity (w.r.t. ground view) aerial image. We demonstrate the quality and versatility of Aerial Diffusion on a plethora of images from various domains including nature, human actions, indoor scenes, etc. We qualitatively prove the effectiveness of our method with extensive ablations and comparisons. To the best of our knowledge, Aerial Diffusion is the first approach that performs ground-to-aerial translation in an unsupervised manner., Comment: Code: https://github.com/divyakraman/AerialDiffusion
Published: 2023
Full Text: View/download PDF

11. CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition

Author: Guan, Tianrui, Muthuselvam, Aswath, Hoover, Montana, Wang, Xijun, Liang, Jing, Sathyamoorthy, Adarsh Jagan, Conover, Damon, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We present CrossLoc3D, a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. Cross-source point cloud data corresponds to point sets captured by depth sensors with different accuracies or from different distances and perspectives. We address the challenges in terms of developing 3D place recognition methods that account for the representation gap between points captured by different sources. Our method handles cross-source data by utilizing multi-grained features and selecting convolution kernel sizes that correspond to most prominent features. Inspired by the diffusion models, our method uses a novel iterative refinement process that gradually shifts the embedding spaces from different sources to a single canonical space for better metric learning. In addition, we present CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of point cloud data from both aerial and ground LiDAR scans. The point clouds in CS-Campus3D have representation gaps and other features like different views, point densities, and noise patterns. We show that our CrossLoc3D algorithm can achieve an improvement of 4.74% - 15.37% in terms of the top 1 average recall on our CS-Campus3D benchmark and achieves performance comparable to state-of-the-art 3D place recognition method on the Oxford RobotCar. We will release the code and CS-Campus3D benchmark.
Published: 2023
Full Text: View/download PDF

12. BEDRF: Bidirectional Edge Diffraction Response Function for Interactive Sound Propagation

Author: Cao, Chunxiao, An, Zili, Ren, Zhong, Manocha, Dinesh, and Zhou, Kun
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce bidirectional edge diffraction response function (BEDRF), a new approach to model wave diffraction around edges with path tracing. The diffraction part of the wave is expressed as an integration on path space, and the wave-edge interaction is expressed using only the localized information around points on the edge similar to a bidirectional scattering distribution function (BSDF) for visual rendering. For an infinite single wedge, our model generates the same result as the analytic solution. Our approach can be easily integrated into interactive geometric sound propagation algorithms that use path tracing to compute specular and diffuse reflections. Our resulting propagation algorithm can approximate complex wave propagation phenomena involving high-order diffraction, and is able to handle dynamic, deformable objects and moving sources and listeners. We highlight the performance of our approach in different scenarios to generate smooth auralization.
Published: 2023
Full Text: View/download PDF

13. Binaural audio generation via multi-task learning

Author: Li, Sijia, Liu, Shiguang, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Graphics and Computer-Aided Design, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.
Published: 2021
Full Text: View/download PDF

14. Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

Author: Bhattacharya, Uttaran, Wu, Gang, Petrangeli, Stefano, Swaminathan, Viswanathan, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific., Comment: 14 pages, 5 figures, 7 tables
Published: 2022
Full Text: View/download PDF

15. Predicting Loose-Fitting Garment Deformations Using Bone-Driven Motion Networks

Author: Pan, Xiaoyu, Mai, Jiaming, Jiang, Xinwei, Tang, Dongxue, Li, Jingxiang, Shao, Tianjia, Zhou, Kun, Jin, Xiaogang, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Graphics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, GeneralLiterature_MISCELLANEOUS, Graphics (cs.GR), ComputingMethodologies_COMPUTERGRAPHICS, Machine Learning (cs.LG)
Abstract: We present a learning algorithm that uses bone-driven motion networks to predict the deformation of loose-fitting garment meshes at interactive rates. Given a garment, we generate a simulation database and extract virtual bones from simulated mesh sequences using skin decomposition. At runtime, we separately compute low- and high-frequency deformations in a sequential manner. The low-frequency deformations are predicted by transferring body motions to virtual bones' motions, and the high-frequency deformations are estimated leveraging the global information of virtual bones' motions and local information extracted from low-frequency meshes. In addition, our method can estimate garment deformations caused by variations of the simulation parameters (e.g., fabric's bending stiffness) using an RBF kernel ensembling trained networks for different sets of simulation parameters. Through extensive comparisons, we show that our method outperforms state-of-the-art methods in terms of prediction accuracy of mesh deformations by about 20% in RMSE and 10% in Hausdorff distance and STED. The code and data are available at https://github.com/non-void/VirtualBones., Comment: SIGGRAPH 22 Conference Paper
Published: 2022
Full Text: View/download PDF

16. STCrowd: A Multimodal Dataset for Pedestrian Perception in Crowded Scenes

Author: Cong, Peishan, Zhu, Xinge, Qiao, Feng, Ren, Yiming, Peng, Xidong, Hou, Yuenan, Xu, Lan, Yang, Ruigang, Manocha, Dinesh, and Ma, Yuexin
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: Accurately detecting and tracking pedestrians in 3D space is challenging due to large variations in rotations, poses and scales. The situation becomes even worse for dense crowds with severe occlusions. However, existing benchmarks either only provide 2D annotations, or have limited 3D annotations with low-density pedestrian distribution, making it difficult to build a reliable pedestrian perception system especially in crowded scenes. To better evaluate pedestrian perception algorithms in crowded scenarios, we introduce a large-scale multimodal dataset,STCrowd. Specifically, in STCrowd, there are a total of 219 K pedestrian instances and 20 persons per frame on average, with various levels of occlusion. We provide synchronized LiDAR point clouds and camera images as well as their corresponding 3D labels and joint IDs. STCrowd can be used for various tasks, including LiDAR-only, image-only, and sensor-fusion based pedestrian detection and tracking. We provide baselines for most of the tasks. In addition, considering the property of sparse global distribution and density-varying local distribution of pedestrians, we further propose a novel method, Density-aware Hierarchical heatmap Aggregation (DHA), to enhance pedestrian perception in crowded scenes. Extensive experiments show that our new method achieves state-of-the-art performance for pedestrian detection on various datasets., Comment: accepted at CVPR2022
Published: 2022
Full Text: View/download PDF

17. Fast-Rir: Fast Neural Diffuse Room Impulse Response Generator

Author: Ratnarajah, Anton, Zhang, Shi-Xiong, Yu, Meng, Tang, Zhenyu, Manocha, Dinesh, and Yu, Dong
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s. We evaluate our generated RIRs in automatic speech recognition (ASR) applications using Google Speech API, Microsoft Speech API, and Kaldi tools. We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU and gives similar performance to DAS in ASR experiments. Our FAST-RIR is 12 times faster than an existing GPU-based RIR generator (gpuRIR). We show that our FAST-RIR outperforms gpuRIR by 2.5% in an AMI far-field ASR benchmark., Comment: Accepted to ICASSP 2022. More results and source code is available at https://anton-jeran.github.io/FRIR/
Published: 2022
Full Text: View/download PDF

18. SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

Author: Choi, Jaehoon, Jung, Dongki, Lee, Yonghan, Kim, Deokhwa, Manocha, Dinesh, and Lee, Donghwan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Monocular depth estimation in the wild inherently predicts depth up to an unknown scale. To resolve scale ambiguity issue, we present a learning algorithm that leverages monocular simultaneous localization and mapping (SLAM) with proprioceptive sensors. Such monocular SLAM systems can provide metrically scaled camera poses. Given these metric poses and monocular sequences, we propose a self-supervised learning method for the pre-trained supervised monocular depth networks to enable metrically scaled depth estimation. Our approach is based on a teacher-student formulation which guides our network to predict high-quality depths. We demonstrate that our approach is useful for various applications such as mobile robot navigation and is applicable to diverse environments. Our full system shows improvements over recent self-supervised depth estimation and completion methods on EuRoC, OpenLORIS, and ScanNet datasets.
Published: 2022
Full Text: View/download PDF

19. M-MELD: A Multilingual Multi-Party Dataset for Emotion Recognition in Conversations

Author: Ghosh, Sreyan, Ramaneswaran, S, Tyagi, Utkarsh, Srivastava, Harshvardhan, Lepcha, Samden, Sakshi, S, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computation and Language (cs.CL)
Abstract: Expression of emotions is a crucial part of daily human communication. Emotion recognition in conversations (ERC) is an emerging field of study, where the primary task is to identify the emotion behind each utterance in a conversation. Though a lot of work has been done on ERC in the past, these works only focus on ERC in the English language, thereby ignoring any other languages. In this paper, we present Multilingual MELD (M-MELD), where we extend the Multimodal EmotionLines Dataset (MELD) \cite{poria2018meld} to 4 other languages beyond English, namely Greek, Polish, French, and Spanish. Beyond just establishing strong baselines for all of these 4 languages, we also propose a novel architecture, DiscLSTM, that uses both sequential and conversational discourse context in a conversational dialogue for ERC. Our proposed approach is computationally efficient, can transfer across languages using just a cross-lingual encoder, and achieves better performance than most uni-modal text approaches in the literature on both MELD and M-MELD. We make our data and code publicly on GitHub.
Published: 2022
Full Text: View/download PDF

20. DMCA: Dense Multi-agent Navigation using Attention and Communication

Author: Arul, Senthil Hariharan, Bedi, Amrit Singh, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO)
Abstract: In decentralized multi-robot navigation, the agents lack the world knowledge to make safe and (near-)optimal plans reliably and make their decisions on their neighbors' observable states. We present a reinforcement learning based multi-agent navigation algorithm that performs inter-agent communications. In order to deal with the variable number of neighbors for each agent, we use a multi-head self-attention mechanism to encode neighbor information and create a fixed-length observation vector. We pose communication selection as a link prediction problem, where the network predicts whether communication is necessary given the observable information. The communicated information augments the observed neighbor information and is used to select a suitable navigation plan. We highlight the benefits of our approach by performing safe and efficient navigation among multiple robots in dense and challenging benchmarks. We also compare the performance with other learning-based methods and highlight improvements in terms of fewer collisions and time-to-goal in dense scenarios.
Published: 2022
Full Text: View/download PDF

21. A Repulsive Force Unit for Garment Collision Handling in Neural Networks

Author: Tan, Qingyang, Zhou, Yi, Wang, Tuanfeng, Ceylan, Duygu, Sun, Xin, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Graphics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Graphics (cs.GR)
Abstract: Despite recent success, deep learning-based methods for predicting 3D garment deformation under body motion suffer from interpenetration problems between the garment and the body. To address this problem, we propose a novel collision handling neural network layer called Repulsive Force Unit (ReFU). Based on the signed distance function (SDF) of the underlying body and the current garment vertex positions, ReFU predicts the per-vertex offsets that push any interpenetrating vertex to a collision-free configuration while preserving the fine geometric details. We show that ReFU is differentiable with trainable parameters and can be integrated into different network backbones that predict 3D garment deformations. Our experiments show that ReFU significantly reduces the number of collisions between the body and the garment and better preserves geometric details compared to prior methods based on collision loss or post-processing optimization., Comment: ECCV 2022
Published: 2022
Full Text: View/download PDF

22. DC-MRTA: Decentralized Multi-Robot Task Allocation and Navigation in Complex Environments

Author: Agrawal, Aakriti, Hariharan, Senthil, Bedi, Amrit Singh, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Science - Machine Learning, Computer Science - Multiagent Systems, Robotics (cs.RO), Machine Learning (cs.LG), Multiagent Systems (cs.MA)
Abstract: We present a novel reinforcement learning (RL) based task allocation and decentralized navigation algorithm for mobile robots in warehouse environments. Our approach is designed for scenarios in which multiple robots are used to perform various pick up and delivery tasks. We consider the problem of joint decentralized task allocation and navigation and present a two level approach to solve it. At the higher level, we solve the task allocation by formulating it in terms of Markov Decision Processes and choosing the appropriate rewards to minimize the Total Travel Delay (TTD). At the lower level, we use a decentralized navigation scheme based on ORCA that enables each robot to perform these tasks in an independent manner, and avoid collisions with other robots and dynamic obstacles. We combine these lower and upper levels by defining rewards for the higher level as the feedback from the lower level navigation algorithm. We perform extensive evaluation in complex warehouse layouts with large number of agents and highlight the benefits over state-of-the-art algorithms based on myopic pickup distance minimization and regret-based task selection. We observe improvement up to 14% in terms of task completion time and up-to 40% improvement in terms of computing collision-free trajectories for the robots.
Published: 2022
Full Text: View/download PDF

23. 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos

Author: Gupta, Vikram, Mittal, Trisha, Mathur, Puneet, Mishra, Vaibhav, Maheshwari, Mayank, Bera, Aniket, Mukherjee, Debdoot, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: We present 3MASSIV, a multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj. 3MASSIV comprises of 50k short videos (20 seconds average duration) and 100K unlabeled videos in 11 different languages and captures popular short video trends like pranks, fails, romance, comedy expressed via unique audio-visual formats like self-shot videos, reaction videos, lip-synching, self-sung songs, etc. 3MASSIV presents an opportunity for multimodal and multilingual semantic understanding on these unique videos by annotating them for concepts, affective states, media types, and audio language. We present a thorough analysis of 3MASSIV and highlight the variety and unique aspects of our dataset compared to other contemporary popular datasets with strong baselines. We also show how the social media content in 3MASSIV is dynamic and temporal in nature, which can be used for semantic understanding tasks and cross-lingual analysis., Comment: Accepted in CVPR 2022
Published: 2022
Full Text: View/download PDF

24. Placing Human Animations into 3D Scenes by Learning Interaction- and Geometry-Driven Keyframes

Author: Mullen Jr, James F., Kothandaraman, Divya, Bera, Aniket, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel method for placing a 3D human animation into a 3D scene while maintaining any human-scene interactions in the animation. We use the notion of computing the most important meshes in the animation for the interaction with the scene, which we call "keyframes." These keyframes allow us to better optimize the placement of the animation into the scene such that interactions in the animations (standing, laying, sitting, etc.) match the affordances of the scene (e.g., standing on the floor or laying in a bed). We compare our method, which we call PAAK, with prior approaches, including POSA, PROX ground truth, and a motion synthesis method, and highlight the benefits of our method with a perceptual study. Human raters preferred our PAAK method over the PROX ground truth data 64.6\% of the time. Additionally, in direct comparisons, the raters preferred PAAK over competing methods including 61.5\% compared to POSA., Comment: WACV 2023. Our project website is available at https://gamma.umd.edu/paak/
Published: 2022
Full Text: View/download PDF

25. Dense Multi-Agent Navigation Using Voronoi Cells and Congestion Metric-based Replanning

Author: Arul, Senthil Hariharan and Manocha, Dinesh
Subjects: Computer Science::Robotics, FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO)
Abstract: We present a decentralized path-planning algorithm for navigating multiple differential-drive robots in dense environments. In contrast to prior decentralized methods, we propose a novel congestion metric-based replanning that couples local and global planning techniques to efficiently navigate in scenarios with multiple corridors. To handle dense scenes with narrow passages, our approach computes the initial path for each agent to its assigned goal using a lattice planner. Based on neighbors' information, each agent performs online replanning using a congestion metric that tends to reduce the collisions and improves the navigation performance. Furthermore, we use the Voronoi cells of each agent to plan the local motion as well as a corridor selection strategy to limit the congestion in narrow passages. We evaluate the performance of our approach in complex warehouse-like scenes and demonstrate improved performance and efficiency over prior methods. In addition, our approach results in a higher success rate in terms of collision-free navigation to the goals.
Published: 2022
Full Text: View/download PDF

26. RTAW: An Attention Inspired Reinforcement Learning Method for Multi-Robot Task Allocation in Warehouse Environments

Author: Agrawal, Aakriti, Bedi, Amrit Singh, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Science - Multiagent Systems, Robotics (cs.RO), Multiagent Systems (cs.MA)
Abstract: We present a novel reinforcement learning based algorithm for multi-robot task allocation problem in warehouse environments. We formulate it as a Markov Decision Process and solve via a novel deep multi-agent reinforcement learning method (called RTAW) with attention inspired policy architecture. Hence, our proposed policy network uses global embeddings that are independent of the number of robots/tasks. We utilize proximal policy optimization algorithm for training and use a carefully designed reward to obtain a converged policy. The converged policy ensures cooperation among different robots to minimize total travel delay (TTD) which ultimately improves the makespan for a sufficiently large task-list. In our extensive experiments, we compare the performance of our RTAW algorithm to state of the art methods such as myopic pickup distance minimization (greedy) and regret based baselines on different navigation schemes. We show an improvement of upto 14% (25-1000 seconds) in TTD on scenarios with hundreds or thousands of tasks for different challenging warehouse layouts and task generation schemes. We also demonstrate the scalability of our approach by showing performance with up to $1000$ robots in simulations.
Published: 2022
Full Text: View/download PDF

27. Dealing with Sparse Rewards in Continuous Control Robotics via Heavy-Tailed Policies

Author: Chakraborty, Souradip, Bedi, Amrit Singh, Koppel, Alec, Tokekar, Pratap, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Robotics, FOS: Electrical engineering, electronic engineering, information engineering, Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Robotics (cs.RO), Machine Learning (cs.LG)
Abstract: In this paper, we present a novel Heavy-Tailed Stochastic Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems. Sparse reward is common in continuous control robotics tasks such as manipulation and navigation, and makes the learning problem hard due to non-trivial estimation of value functions over the state space. This demands either reward shaping or expert demonstrations for the sparse reward environment. However, obtaining high-quality demonstrations is quite expensive and sometimes even impossible. We propose a heavy-tailed policy parametrization along with a modified momentum-based policy gradient tracking scheme (HT-SPG) to induce a stable exploratory behavior to the algorithm. The proposed algorithm does not require access to expert demonstrations. We test the performance of HT-SPG on various benchmark tasks of continuous control with sparse rewards such as 1D Mario, Pathological Mountain Car, Sparse Pendulum in OpenAI Gym, and Sparse MuJoCo environments (Hopper-v2). We show consistent performance improvement across all tasks in terms of high average cumulative reward. HT-SPG also demonstrates improved convergence speed with minimum samples, thereby emphasizing the sample efficiency of our proposed algorithm.
Published: 2022
Full Text: View/download PDF

28. GWA: A Large High-Quality Acoustic Dataset for Audio Processing

Author: Tang, Zhenyu, Aralikatti, Rohith, Ratnarajah, Anton, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science::Sound, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present the Geometric-Wave Acoustic (GWA) dataset, a large-scale audio dataset of about 2 million synthetic room impulse responses (IRs) and their corresponding detailed geometric and simulation configurations. Our dataset samples acoustic environments from over 6.8K high-quality diverse and professionally designed houses represented as semantically labeled 3D meshes. We also present a novel real-world acoustic materials assignment scheme based on semantic matching that uses a sentence transformer model. We compute high-quality impulse responses corresponding to accurate low-frequency and high-frequency wave effects by automatically calibrating geometric acoustic ray-tracing with a finite-difference time-domain wave solver. We demonstrate the higher accuracy of our IRs by comparing with recorded IRs from complex real-world environments. Moreover, we highlight the benefits of GWA on audio deep learning tasks such as automated speech recognition, speech enhancement, and speech separation. This dataset is the first data with accurate wave acoustic simulations in complex scenes. Codes and data are available at https://gamma.umd.edu/pro/sound/gwa.
Published: 2022
Full Text: View/download PDF

29. WGICP: Differentiable Weighted GICP-Based Lidar Odometry

Author: Son, Sanghyun, Liang, Jing, Lin, Ming, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO)
Abstract: We present a novel differentiable weighted generalized iterative closest point (WGICP) method applicable to general 3D point cloud data, including that from Lidar. Our method builds on differentiable generalized ICP (GICP), and we propose using the differentiable K-Nearest Neighbor (KNN) algorithm to enhance differentiability. The differentiable GICP algorithm provides the gradient of output pose estimation with respect to each input point, which allows us to train a neural network to predict its importance, or weight, in estimating the correct pose. In contrast to the other ICP-based methods that use voxel-based downsampling or matching methods to reduce the computational cost, our method directly reduces the number of points used for GICP by only selecting those with the highest weights and ignoring redundant ones with lower weights. We show that our method improves both accuracy and speed of the GICP algorithm for the KITTI dataset and can be used to develop a more robust and efficient SLAM system., Comment: 6 pages
Published: 2022
Full Text: View/download PDF

30. MSVIPER: Improved Policy Distillation for Reinforcement-Learning-Based Robot Navigation

Author: Roth, Aaron M., Liang, Jing, Sriram, Ram, Tabassi, Elham, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Robotics (cs.RO), Human-Computer Interaction (cs.HC), Machine Learning (cs.LG)
Abstract: We present Multiple Scenario Verifiable Reinforcement Learning via Policy Extraction (MSVIPER), a new method for policy distillation to decision trees for improved robot navigation. MSVIPER learns an "expert" policy using any Reinforcement Learning (RL) technique involving learning a state-action mapping and then uses imitation learning to learn a decision-tree policy from it. We demonstrate that MSVIPER results in efficient decision trees and can accurately mimic the behavior of the expert policy. Moreover, we present efficient policy distillation and tree-modification techniques that take advantage of the decision tree structure to allow improvements to a policy without retraining. We use our approach to improve the performance of RL-based robot navigation algorithms for indoor and outdoor scenes. We demonstrate the benefits in terms of reduced freezing and oscillation behaviors (by up to 95\% reduction) for mobile robots navigating among dynamic obstacles and reduced vibrations and oscillation (by up to 17\%) for outdoor robot navigation on complex, uneven terrains., Comment: 6 pages main paper, 2 pages of references, 5 page appendix (13 pages total) 5 tables, 9 algorithms, 4 figures
Published: 2022
Full Text: View/download PDF

31. Vision-Centric BEV Perception: A Survey

Author: Ma, Yuexin, Wang, Tai, Bai, Xuyang, Yang, Huitong, Hou, Yuenan, Wang, Yaming, Qiao, Yu, Yang, Ruigang, Manocha, Dinesh, and Zhu, Xinge
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years, vision-centric Bird's Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion. The rapid advancements in deep learning have led to the proposal of numerous methods for addressing vision-centric BEV perception challenges. However, there has been no recent survey encompassing this novel and burgeoning research field. To catalyze future research, this paper presents a comprehensive survey of the latest developments in vision-centric BEV perception and its extensions. It compiles and organizes up-to-date knowledge, offering a systematic review and summary of prevalent algorithms. Additionally, the paper provides in-depth analyses and comparative results on various BEV perception tasks, facilitating the evaluation of future works and sparking new research directions. Furthermore, the paper discusses and shares valuable empirical implementation details to aid in the advancement of related algorithms., Comment: project page at https://github.com/4DVLab/Vision-Centric-BEV-Perception; 22 pages, 15 figures
Published: 2022
Full Text: View/download PDF

32. SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Author: Seth, Ashish, Ghosh, Sreyan, Umesh, S., and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark \cite{9868132}, significantly outperforming DeLoRes-M and other prior approaches, which are pre-trained on $10\times$ larger of unsupervised data. We will make all our codes available on GitHub., Comment: ICASSP 2023
Published: 2022
Full Text: View/download PDF

33. VRDoc: Gaze-based Interactions for VR Reading Experience

Author: Lee, Geonsun, Healey, Jennifer, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, D.2.2, I.3.7, Computer Science - Human-Computer Interaction, Human-Computer Interaction (cs.HC)
Abstract: Virtual reality (VR) offers the promise of an infinite office and remote collaboration, however, existing interactions in VR do not strongly support one of the most essential tasks for most knowledge workers, reading. This paper presents VRDoc, a set of gaze-based interaction methods designed to improve the reading experience in VR. We introduce three key components: Gaze Select-and-Snap for document selection, Gaze MagGlass for enhanced text legibility, and Gaze Scroll for ease of document traversal. We implemented each of these tools using a commodity VR headset with eye-tracking. In a series of user studies with 13 participants, we show that VRDoc makes VR reading both more efficient (p < 0.01 ) and less demanding (p < 0.01), and when given a choice, users preferred to use our tools over the current VR reading methods., 8 pages, 4 figures, ISMAR 2022
Published: 2022
Full Text: View/download PDF

34. HighlightMe: Detecting Highlights from Human-Centric Videos

Author: Bhattacharya, Uttaran, Wu, Gang, Petrangeli, Stefano, Swaminathan, Viswanathan, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. Our method works on the graph-based representation of multiple observable human-centric modalities in the videos, such as poses and faces. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions based on these modalities. We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames. We use these scores to compute which frames to highlight and stitch contiguous frames to produce the excerpts. We train our network on the large-scale AVA-Kinetics action dataset and evaluate it on four benchmark video highlight datasets: DSH, TVSum, PHD2, and SumMe. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods in these datasets, without requiring any user-provided preferences or dataset-specific fine-tuning., 10 pages, 5 figures, 5 tables. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021
Published: 2021
Full Text: View/download PDF

35. DnD: Dense Depth Estimation in Crowded Dynamic Indoor Scenes

Author: Jung, Dongki, Choi, Jaehoon, Lee, Yonghan, Kim, Deokhwa, Kim, Changick, Manocha, Dinesh, and Lee, Donghwan
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: We present a novel approach for estimating depth from a monocular camera as it moves through complex and crowded indoor environments, e.g., a department store or a metro station. Our approach predicts absolute scale depth maps over the entire scene consisting of a static background and multiple moving people, by training on dynamic scenes. Since it is difficult to collect dense depth maps from crowded indoor environments, we design our training framework without requiring depths produced from depth sensing devices. Our network leverages RGB images and sparse depth maps generated from traditional 3D reconstruction methods to estimate dense depth maps. We use two constraints to handle depth for non-rigidly moving people without tracking their motion explicitly. We demonstrate that our approach offers consistent improvements over recent depth estimation methods on the NAVERLABS dataset, which includes complex and crowded scenes.
Published: 2021
Full Text: View/download PDF

36. XAI-N: Sensor-based Robot Navigation using Expert Policies and Decision Trees

Author: Roth, Aaron M., Liang, Jing, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Robotics (cs.RO), Machine Learning (cs.LG)
Abstract: We present a novel sensor-based learning navigation algorithm to compute a collision-free trajectory for a robot in dense and dynamic environments with moving obstacles or targets. Our approach uses deep reinforcement learning-based expert policy that is trained using a sim2real paradigm. In order to increase the reliability and handle the failure cases of the expert policy, we combine with a policy extraction technique to transform the resulting policy into a decision tree format. The resulting decision tree has properties which we use to analyze and modify the policy and improve performance on navigation metrics including smoothness, frequency of oscillation, frequency of immobilization, and obstruction of target. We are able to modify the policy to address these imperfections without retraining, combining the learning power of deep learning with the control of domain-specific algorithms. We highlight the benefits of our algorithm in simulated environments and navigating a Clearpath Jackal robot among moving pedestrians. (Videos at this url: https://gamma.umd.edu/researchdirections/xrl/navviper)
Published: 2021
Full Text: View/download PDF

37. METEOR:A Dense, Heterogeneous, and Unstructured Traffic Dataset With Rare Behaviors

Author: Chandra, Rohan, Wang, Xijun, Mahajan, Mridul, Kala, Rahul, Palugulla, Rishitha, Naidu, Chandrababu, Jain, Alok, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Robotics (cs.RO)
Abstract: We present a new traffic dataset, METEOR, which captures traffic patterns and multi-agent driving behaviors in unstructured scenarios. METEOR consists of more than 1000 one-minute videos, over 2 million annotated frames with bounding boxes and GPS trajectories for 16 unique agent categories, and more than 13 million bounding boxes for traffic agents. METEOR is a dataset for rare and interesting, multi-agent driving behaviors that are grouped into traffic violations, atypical interactions, and diverse scenarios. Every video in METEOR is tagged using a diverse range of factors corresponding to weather, time of the day, road conditions, and traffic density. We use METEOR to benchmark perception methods for object detection and multi-agent behavior prediction. Our key finding is that state-of-the-art models for object detection and behavior prediction, which otherwise succeed on existing datasets such as Waymo, fail on the METEOR dataset. METEOR marks the first step towards the development of more sophisticated perception models for dense, heterogeneous, and unstructured scenarios., Comment: Under review at IROS 2022
Published: 2021
Full Text: View/download PDF

38. Example-based Real-time Clothing Synthesis for Virtual Agents

Author: Wu, Nannan, Chao, Qianwen, Chen, Yanzhen, Xu, Weiwei, Liu, Chen, Manocha, Dinesh, Sun, Wenxin, Han, Yi, Yao, Xinran, and Jin, Xiaogang
Subjects: FOS: Computer and information sciences, Computer Science - Graphics, Graphics (cs.GR), ComputingMethodologies_COMPUTERGRAPHICS
Abstract: We present a real-time cloth animation method for dressing virtual humans of various shapes and poses. Our approach formulates the clothing deformation as a high-dimensional function of body shape parameters and pose parameters. In order to accelerate the computation, our formulation factorizes the clothing deformation into two independent components: the deformation introduced by body pose variation (Clothing Pose Model) and the deformation from body shape variation (Clothing Shape Model). Furthermore, we sample and cluster the poses spanning the entire pose space and use those clusters to efficiently calculate the anchoring points. We also introduce a sensitivity-based distance measurement to both find nearby anchoring points and evaluate their contributions to the final animation. Given a query shape and pose of the virtual agent, we synthesize the resulting clothing deformation by blending the Taylor expansion results of nearby anchoring points. Compared to previous methods, our approach is general and able to add the shape dimension to any clothing pose model. %and therefore it is more general. Furthermore, we can animate clothing represented with tens of thousands of vertices at 50+ FPS on a CPU. Moreover, our example database is more representative and can be generated in parallel, and thereby saves the training time. We also conduct a user evaluation and show that our method can improve a user's perception of dressed virtual agents in an immersive virtual environment compared to a conventional linear blend skinning method.
Published: 2021
Full Text: View/download PDF

39. TS-RIR: Translated synthetic room impulse responses for speech augmentation

Author: Ratnarajah, Anton, Tang, Zhenyu, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a method for improving the quality of synthetic room impulse responses for far-field speech recognition. We bridge the gap between the fidelity of synthetic room impulse responses (RIRs) and the real room impulse responses using our novel, TS-RIRGAN architecture. Given a synthetic RIR in the form of raw audio, we use TS-RIRGAN to translate it into a real RIR. We also perform real-world sub-band room equalization on the translated synthetic RIR. Our overall approach improves the quality of synthetic RIRs by compensating low-frequency wave effects, similar to those in real RIRs. We evaluate the performance of improved synthetic RIRs on a far-field speech dataset augmented by convolving the LibriSpeech clean speech dataset [1] with RIRs and adding background noise. We show that far-field speech augmented using our improved synthetic RIRs reduces the word error rate by up to 19.9% in Kaldi far-field automatic speech recognition benchmark [2]., Comment: Accepted to IEEE ASRU 2021. Source code is available at https://github.com/GAMMA-UMD/TS-RIR
Published: 2021
Full Text: View/download PDF

40. Improving Reverberant Speech Separation with Multi-stage Training and Curriculum Learning

Author: Aralikatti, Rohith, Ratnarajah, Anton, Tang, Zhenyu, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a novel approach that improves the performance of reverberant speech separation. Our approach is based on an accurate geometric acoustic simulator (GAS) which generates realistic room impulse responses (RIRs) by modeling both specular and diffuse reflections. We also propose three training methods - pre-training, multi-stage training and curriculum learning that significantly improve separation quality in the presence of reverberation. We also demonstrate that mixing the synthetic RIRs with a small number of real RIRs during training enhances separation performance. We evaluate our approach on reverberant mixtures generated from real, recorded data (in several different room configurations) from the VOiCES dataset. Our novel approach (curriculum learning+pre-training+multi-stage training) results in a significant relative improvement over prior techniques based on image source method (ISM).
Published: 2021
Full Text: View/download PDF

41. N-Cloth: Predicting 3D Cloth Deformation with Mesh-Based Networks

Author: Li, Yudi, Tang, Min, Yang, Yun, Huang, Zi, Tong, Ruofeng, Yang, Shuangcai, Li, Yao, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Graphics, GeneralLiterature_MISCELLANEOUS, Graphics (cs.GR), ComputingMethodologies_COMPUTERGRAPHICS, Machine Learning (cs.LG)
Abstract: We present a novel mesh-based learning approach (N-Cloth) for plausible 3D cloth deformation prediction. Our approach is general and can handle cloth or obstacles represented by triangle meshes with arbitrary topologies. We use graph convolution to transform the cloth and object meshes into a latent space to reduce the non-linearity in the mesh space. Our network can predict the target 3D cloth mesh deformation based on the initial state of the cloth mesh template and the target obstacle mesh. Our approach can handle complex cloth meshes with up to 100K triangles and scenes with various objects corresponding to SMPL humans, non-SMPL humans or rigid bodies. In practice, our approach can be used to generate plausible cloth simulation at 30-45 fps on an NVIDIA GeForce RTX 3090 GPU. We highlight its benefits over prior learning-based methods and physically-based cloth simulators., Comment: 12 pages
Published: 2021
Full Text: View/download PDF

42. Robust 2D/3D Vehicle Parsing in CVIS

Author: Miao, Hui, Lu, Feixiang, Liu, Zongdai, Zhang, Liangjun, Manocha, Dinesh, and Zhou, Bin
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Robotics (cs.RO)
Abstract: We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS). Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters. First, to deal with multi-view data scarcity, we propose a part-assisted novel view synthesis algorithm for data augmentation. We train a part-based texture inpainting network in a self-supervised manner. Then we render the textured model into the background image with the target 6-DoF pose. Second, to handle various camera parameters, we present a new method that produces dense mappings between image pixels and 3D points to perform robust 2D/3D vehicle parsing. Third, we build the first CVIS dataset for benchmarking, which annotates more than 1540 images (14017 instances) from real-world traffic scenarios. We combine these novel algorithms and datasets to develop a robust approach for 2D/3D vehicle parsing for CVIS. In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation, by 4.5%, 4.3%, and 2.9%, respectively. More details and results are included in the supplement. To facilitate future research, we will release the source code and the dataset on GitHub.
Published: 2021
Full Text: View/download PDF

43. Dynamic Coherence-Based EM Ray Tracing Simulations in Vehicular Environments

Author: Wang, Ruichen and Manocha, Dinesh
Subjects: Computer Science - Networking and Internet Architecture, Networking and Internet Architecture (cs.NI), Signal Processing (eess.SP), FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Signal Processing
Abstract: 5G applications have become increasingly popular in recent years as the spread of fifth-generation (5G) network deployment has grown. For vehicular networks, mmWave band signals have been well studied and used for communication and sensing. In this work, we propose a new dynamic ray tracing algorithm that exploits spatial and temporal coherence. We evaluate the performance by comparing the results on typical vehicular communication scenarios with GEMV^2, which uses a combination of deterministic and stochastic models, and WinProp, which utilizes the deterministic model for simulations with given environment information. We also compare the performance of our algorithm on complex, urban models and observe a reduction in computation time by 36% compared to GEMV^2 and by 30% compared to WinProp, while maintaining similar prediction accuracy., Comment: 7 pages, 15 figures, conference
Published: 2021
Full Text: View/download PDF

44. SPA: Verbal Interactions between Agents and Avatars in Shared Virtual Environments using Propositional Planning

Author: Best, Andrew, Narang, Sahil, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Science - Multiagent Systems, Computation and Language (cs.CL), Multiagent Systems (cs.MA)
Abstract: We present a novel approach for generating plausible verbal interactions between virtual human-like agents and user avatars in shared virtual environments. Sense-Plan-Ask, or SPA, extends prior work in propositional planning and natural language processing to enable agents to plan with uncertain information, and leverage question and answer dialogue with other agents and avatars to obtain the needed information and complete their goals. The agents are additionally able to respond to questions from the avatars and other agents using natural-language enabling real-time multi-agent multi-avatar communication environments. Our algorithm can simulate tens of virtual agents at interactive rates interacting, moving, communicating, planning, and replanning. We find that our algorithm creates a small runtime cost and enables agents to complete their goals more effectively than agents without the ability to leverage natural-language communication. We demonstrate quantitative results on a set of simulated benchmarks and detail the results of a preliminary user-study conducted to evaluate the plausibility of the virtual interactions generated by SPA. Overall, we find that participants prefer SPA to prior techniques in 84\% of responses including significant benefits in terms of the plausibility of natural-language interactions and the positive impact of those interactions.
Published: 2020
Full Text: View/download PDF

45. AutoTrajectory: Label-free Trajectory Extraction and Prediction from Videos using Dynamic Points

Author: Ma, Yuexin, ZHU, Xinge, Cheng, Xinjing, Yang, Ruigang, Liu, Jiming, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Current methods for trajectory prediction operate in supervised manners, and therefore require vast quantities of corresponding ground truth data for training. In this paper, we present a novel, label-free algorithm, AutoTrajectory, for trajectory extraction and prediction to use raw videos directly. To better capture the moving objects in videos, we introduce dynamic points. We use them to model dynamic motions by using a forward-backward extractor to keep temporal consistency and using image reconstruction to keep spatial consistency in an unsupervised manner. Then we aggregate dynamic points to instance points, which stand for moving objects such as pedestrians in videos. Finally, we extract trajectories by matching instance points for prediction training. To the best of our knowledge, our method is the first to achieve unsupervised learning of trajectory extraction and prediction. We evaluate the performance on well-known trajectory datasets and show that our method is effective for real-world videos and can use raw videos to further improve the performance of existing models.
Published: 2020
Full Text: View/download PDF

46. Multi-Window Data Augmentation Approach for Speech Emotion Recognition

Author: Padi, Sarala, Manocha, Dinesh, and Sriram, Ram D.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: We present a Multi-Window Data Augmentation (MWA-SER) approach for speech emotion recognition. MWA-SER is a unimodal approach that focuses on two key concepts; designing the speech augmentation method and building the deep learning model to recognize the underlying emotion of an audio signal. Our proposed multi-window augmentation approach generates additional data samples from the speech signal by employing multiple window sizes in the audio feature extraction process. We show that our augmentation method, combined with a deep learning model, improves speech emotion recognition performance. We evaluate the performance of our approach on three benchmark datasets: IEMOCAP, SAVEE, and RAVDESS. We show that the multi-window model improves the SER performance and outperforms a single-window model. The notion of finding the best window size is an essential step in audio feature extraction. We perform extensive experimental evaluations to find the best window choice and explore the windowing effect for SER analysis.
Published: 2020
Full Text: View/download PDF

47. AES: Autonomous Excavator System for Real-World and Hazardous Environments

Author: Zhao, Jinxin, Long, Pinxin, Wang, Liyang, Qian, Lingfeng, Lu, Feixiang, Song, Xibin, Manocha, Dinesh, and Zhang, Liangjun
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Robotics (cs.RO)
Abstract: Excavators are widely used for material-handling applications in unstructured environments, including mining and construction. The size of the global market of excavators is 44.12 Billion USD in 2018 and is predicted to grow to 63.14 Billion USD by 2026. Operating excavators in a real-world environment can be challenging due to extreme conditions and rock sliding, ground collapse, or exceeding dust. Multiple fatalities and injuries occur each year during excavations. An autonomous excavator that can substitute human operators in these hazardous environments would substantially lower the number of injuries and can improve the overall productivity.
Published: 2020
Full Text: View/download PDF

48. PerMO: Perceiving More at Once from a Single Image for Autonomous Driving

Author: Lu, Feixiang, Liu, Zongdai, Song, Xibin, Zhou, Dingfu, Li, Wei, Miao, Hui, Liao, Miao, Zhang, Liangjun, Zhou, Bin, Yang, Ruigang, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image for autonomous driving. Our approach combines the strengths of deep learning and the elegance of traditional techniques from part-based deformable model representation to produce high-quality 3D models in the presence of severe occlusions. We present a new part-based deformable vehicle model that is used for instance segmentation and automatically generate a dataset that contains dense correspondences between 2D images and 3D models. We also present a novel end-to-end deep neural network to predict dense 2D/3D mapping and highlight its benefits. Based on the dense mapping, we are able to compute precise 6-DoF poses and 3D reconstruction results at almost interactive rates on a commodity GPU. We have integrated these algorithms with an autonomous driving system. In practice, our method outperforms the state-of-the-art methods for all major vehicle parsing tasks: 2D instance segmentation by 4.4 points (mAP), 6-DoF pose estimation by 9.11 points, and 3D detection by 1.37. Moreover, we have released all of the source code, dataset, and the trained model on Github.
Published: 2020
Full Text: View/download PDF

49. Reactive Navigation under Non-Parametric Uncertainty through Hilbert Space Embedding of Probabilistic Velocity Obstacles

Author: Jyotish, P. S. Naga, Gopalakrishnan, Bharath, Kumar, A. V. S. Sai Bhargav, Singh, Arun Kumar, Krishna, K. Madhava, and Manocha, Dinesh
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, FOS: Electrical engineering, electronic engineering, information engineering, Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Robotics (cs.RO)
Abstract: The probabilistic velocity obstacle (PVO) extends the concept of velocity obstacle (VO) to work in uncertain dynamic environments. In this paper, we show how a robust model predictive control (MPC) with PVO constraints under non-parametric uncertainty can be made computationally tractable. At the core of our formulation is a novel yet simple interpretation of our robust MPC as a problem of matching the distribution of PVO with a certain desired distribution. To this end, we propose two methods. Our first baseline method is based on approximating the distribution of PVO with a Gaussian Mixture Model (GMM) and subsequently performing distribution matching using Kullback Leibler (KL) divergence metric. Our second formulation is based on the possibility of representing arbitrary distributions as functions in Reproducing Kernel Hilbert Space (RKHS). We use this foundation to interpret our robust MPC as a problem of minimizing the distance between the desired distribution and the distribution of the PVO in the RKHS. Both the RKHS and GMM based formulation can work with any uncertainty distribution and thus allowing us to relax the prevalent Gaussian assumption in the existing works. We validate our formulation by taking an example of 2D navigation of quadrotors with a realistic noise model for perception and ego-motion uncertainty. In particular, we present a systematic comparison between the GMM and the RKHS approach and show that while both approaches can produce safe trajectories, the former is highly conservative and leads to poor tracking and control costs. Furthermore, RKHS based approach gives better computational times that are up to one order of magnitude lesser than the computation time of the GMM based approach., Comment: 17 pages, 16 figures, 2 tables, accepted in IEEE Robotics and Automation Letters (RA-L)
Published: 2020
Full Text: View/download PDF

50. Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data Augmentation

Author: Lu, Feixiang, Liu, Zongdai, Miao, Hui, Wang, Peng, Zhang, Liangjun, Yang, Ruigang, Manocha, Dinesh, and Zhou, Bin
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Holistically understanding an object and its 3D movable parts through visual perception models is essential for enabling an autonomous agent to interact with the world. For autonomous driving, the dynamics and states of vehicle parts such as doors, the trunk, and the bonnet can provide meaningful semantic information and interaction states, which are essential to ensuring the safety of the self-driving vehicle. Existing visual perception models mainly focus on coarse parsing such as object bounding box detection or pose estimation and rarely tackle these situations. In this paper, we address this important autonomous driving problem by solving three critical issues. First, to deal with data scarcity, we propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images before reconstructing human-vehicle interaction (VHI) scenarios. Our approach is fully automatic without any human interaction, which can generate a large number of vehicles in uncommon states (VUS) for training deep neural networks (DNNs). Second, to perform fine-grained vehicle perception, we present a multi-task network for VUS parsing and a multi-stream network for VHI parsing. Third, to quantitatively evaluate the effectiveness of our data augmentation approach, we build the first VUS dataset in real traffic scenarios (e.g., getting on/out or placing/removing luggage). Experimental results show that our approach advances other baseline methods in 2D detection and instance segmentation by a big margin (over 8%). In addition, our network yields large improvements in discovering and understanding these uncommon cases. Moreover, we have released the source code, the dataset, and the trained model on Github (https://github.com/zongdai/EditingForDNN).
Published: 2020
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Journal

Database

Publisher

67 results on '"Manocha, Dinesh"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources