Author: "Wu, Jiajun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wu, Jiajun"' showing total 1,873 results

Start Over Author "Wu, Jiajun"

1,873 results on '"Wu, Jiajun"'

1. ZeroHSI: Zero-Shot 4D Human-Scene Interaction by Video Generation

Author: Li, Hongjie, Yu, Hong-Xing, Li, Jiaman, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Human-scene interaction (HSI) generation is crucial for applications in embodied AI, virtual reality, and robotics. While existing methods can synthesize realistic human motions in 3D scenes and generate plausible human-object interactions, they heavily rely on datasets containing paired 3D scene and motion capture data, which are expensive and time-consuming to collect across diverse environments and interactions. We present ZeroHSI, a novel approach that enables zero-shot 4D human-scene interaction synthesis by integrating video generation and neural human rendering. Our key insight is to leverage the rich motion priors learned by state-of-the-art video generation models, which have been trained on vast amounts of natural human movements and interactions, and use differentiable rendering to reconstruct human-scene interactions. ZeroHSI can synthesize realistic human motions in both static scenes and environments with dynamic objects, without requiring any ground-truth motion data. We evaluate ZeroHSI on a curated dataset of different types of various indoor and outdoor scenes with different interaction prompts, demonstrating its ability to generate diverse and contextually appropriate human-scene interactions., Comment: Project website: https://awfuact.github.io/zerohsi/
Published: 2024

2. Birth and Death of a Rose

Author: Geng, Chen, Zhang, Yunzhi, Wu, Shangzhe, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, I.2.10
Abstract: We study the problem of generating temporal object intrinsics -- temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose -- from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pre-trained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan. Project website: https://chen-geng.com/rose4d, Comment: Project website: https://chen-geng.com/rose4d
Published: 2024

3. ShapeCraft: Body-Aware and Semantics-Aware 3D Object Design

Author: Guo, Michelle, Tang, Mia, Cha, Hannah, Zhang, Ruohan, Liu, C. Karen, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: For designing a wide range of everyday objects, the design process should be aware of both the human body and the underlying semantics of the design specification. However, these two objectives present significant challenges to the current AI-based designing tools. In this work, we present a method to synthesize body-aware 3D objects from a base mesh given an input body geometry and either text or image as guidance. The generated objects can be simulated on virtual characters, or fabricated for real-world use. We propose to use a mesh deformation procedure that optimizes for both semantic alignment as well as contact and penetration losses. Using our method, users can generate both virtual or real-world objects from text, image, or sketch, without the need for manual artist intervention. We present both qualitative and quantitative results on various object categories, demonstrating the effectiveness of our approach., Comment: Project webpage: https://miatang13.github.io/Shape-Craft/
Published: 2024

4. Streaming Detection of Queried Event Start

Author: Eyzaguirre, Cristobal, Tang, Eric, Buch, Shyamal, Gaidon, Adrien, Wu, Jiajun, and Niebles, Juan Carlos
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding-Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.
Published: 2024

5. LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Author: Sun, Fan-Yun, Liu, Weiyu, Gu, Siyi, Lim, Dylan, Bhat, Goutam, Tombari, Federico, Li, Manling, Haber, Nick, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Open-universe 3D layout generation arranges unlabeled 3D assets conditioned on language instruction. Large language models (LLMs) struggle with generating physically plausible 3D scenes and adherence to input instructions, particularly in cluttered scenes. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve performance., Comment: project website: https://ai.stanford.edu/~sunfanyun/layoutvlm/
Published: 2024

6. Lifting Motion to the 3D World via 2D Diffusion

Author: Li, Jiaman, Liu, C. Karen, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion -- including both joint rotations and root trajectories in the world coordinate system -- using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision., Comment: project page: https://lijiaman.github.io/projects/mvlift/
Published: 2024

7. Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Author: Cai, Shengqu, Chan, Eric, Zhang, Yunzhi, Guibas, Leonidas, Wu, Jiajun, and Wetzstein, Gordon
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization., Comment: Project page: https://primecai.github.io/dsd/
Published: 2024

8. IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Author: Liu, Yunong, Eyzaguirre, Cristobal, Li, Manling, Khanna, Shubh, Niebles, Juan Carlos, Ravi, Vineeth, Mishra, Saumitra, Liu, Weiyu, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Shape assembly is a ubiquitous task in daily life, integral for constructing complex 3D structures like IKEA furniture. While significant progress has been made in developing autonomous agents for shape assembly, existing datasets have not yet tackled the 4D grounding of assembly instructions in videos, essential for a holistic understanding of assembly in 3D space over time. We introduce IKEA Video Manuals, a dataset that features 3D models of furniture parts, instructional manuals, assembly videos from the Internet, and most importantly, annotations of dense spatio-temporal alignments between these data modalities. To demonstrate the utility of IKEA Video Manuals, we present five applications essential for shape assembly: assembly plan generation, part-conditioned segmentation, part-conditioned pose estimation, video object segmentation, and furniture assembly based on instructional video manuals. For each application, we provide evaluation metrics and baseline methods. Through experiments on our annotated data, we highlight many challenges in grounding assembly instructions in videos to improve shape assembly, including handling occlusions, varying viewpoints, and extended assembly sequences., Comment: NeurIPS 2024 Datasets and Benchmarks Track
Published: 2024

9. HourVideo: 1-Hour Video-Language Understanding

Author: Chandrasegaran, Keshigeyan, Gupta, Agrim, Hadzic, Lea M., Kota, Taran, He, Jimming, Eyzaguirre, Cristóbal, Durante, Zane, Li, Manling, Wu, Jiajun, and Fei-Fei, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu, Comment: NeurIPS 2024 Datasets and Benchmarks Track; 28 pages
Published: 2024

10. TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Author: Wu, Jiajun, Song, Mo, Zhao, Jingmin, Gao, Yizhao, Li, Jia, and So, Hayden Kwok-Hay
Subjects: Computer Science - Hardware Architecture
Abstract: Modern transformer-based deep neural networks present unique technical challenges for effective acceleration in real-world applications. Apart from the vast amount of linear operations needed due to their sizes, modern transformer models are increasingly reliance on precise non-linear computations that make traditional low-bitwidth quantization methods and fixed-dataflow matrix accelerators ineffective for end-to-end acceleration. To address this need to accelerate both linear and non-linear operations in a unified and programmable framework, this paper introduces TATAA. TATAA employs 8-bit integer (int8) arithmetic for quantized linear layer operations through post-training quantization, while it relies on bfloat16 floating-point arithmetic to approximate non-linear layers of a transformer model. TATAA hardware features a transformable arithmetic architecture that supports both formats during runtime with minimal overhead, enabling it to switch between a systolic array mode for int8 matrix multiplications and a SIMD mode for vectorized bfloat16 operations. An end-to-end compiler is presented to enable flexible mapping from emerging transformer models to the proposed hardware. Experimental results indicate that our mixed-precision design incurs only 0.14% to 1.16% accuracy drop when compared with the pre-trained single-precision transformer models across a range of vision, language, and generative text applications. Our prototype implementation on the Alveo U280 FPGA currently achieves 2935.2 GOPS throughput on linear layers and a maximum of 189.5 GFLOPS for non-linear operations, outperforming related works by up to 1.45x in end-to-end throughput and 2.29x in DSP efficiency, while achieving 2.19x higher power efficiency than modern NVIDIA RTX4090 GPU.
Published: 2024

11. The Scene Language: Representing Scenes with Programs, Words, and Embeddings

Author: Zhang, Yunzhi, Li, Zizhang, Zhou, Matt, Wu, Shangzhe, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity. This representation can be inferred from pre-trained language models via a training-free inference technique, given text or image inputs. The resulting scene can be rendered into images using traditional, neural, or hybrid graphics renderers. Together, this forms a robust, automated system for high-quality 3D and 4D scene generation. Compared with existing representations like scene graphs, our proposed Scene Language generates complex scenes with higher fidelity, while explicitly modeling the scene structures to enable precise control and editing., Comment: Project page: https://ai.stanford.edu/~yzzhang/projects/scene-language/
Published: 2024

12. Learning Smooth Humanoid Locomotion through Lipschitz-Constrained Policies

Author: Chen, Zixuan, He, Xialin, Wang, Yen-Jen, Liao, Qiayuan, Ze, Yanjie, Li, Zhongyu, Sastry, S. Shankar, Wu, Jiajun, Sreenath, Koushil, Gupta, Saurabh, and Peng, Xue Bin
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: Reinforcement learning combined with sim-to-real transfer offers a general framework for developing locomotion controllers for legged robots. To facilitate successful deployment in the real world, smoothing techniques, such as low-pass filters and smoothness rewards, are often employed to develop policies with smooth behaviors. However, because these techniques are non-differentiable and usually require tedious tuning of a large set of hyperparameters, they tend to require extensive manual tuning for each robotic platform. To address this challenge and establish a general technique for enforcing smooth behaviors, we propose a simple and effective method that imposes a Lipschitz constraint on a learned policy, which we refer to as Lipschitz-Constrained Policies (LCP). We show that the Lipschitz constraint can be implemented in the form of a gradient penalty, which provides a differentiable objective that can be easily incorporated with automatic differentiation frameworks. We demonstrate that LCP effectively replaces the need for smoothing rewards or low-pass filters and can be easily integrated into training frameworks for many distinct humanoid robots. We extensively evaluate LCP in both simulation and real-world humanoid robots, producing smooth and robust locomotion controllers. All simulation and deployment code, along with complete checkpoints, is available on our project page: https://lipschitz-constrained-policy.github.io., Comment: 8 pages
Published: 2024

13. Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Author: Ze, Yanjie, Chen, Zixuan, Wang, Wenhao, Chen, Tianyi, He, Xialin, Yuan, Ying, Peng, Xue Bin, and Wu, Jiajun
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills. Recent advances in 3D visuomotor policies, such as the 3D Diffusion Policy (DP3), have shown promise in extending these capabilities to wilder environments. However, 3D visuomotor policies often rely on camera calibration and point-cloud segmentation, which present challenges for deployment on mobile robots like humanoids. In this work, we introduce the Improved 3D Diffusion Policy (iDP3), a novel 3D visuomotor policy that eliminates these constraints by leveraging egocentric 3D visual representations. We demonstrate that iDP3 enables a full-sized humanoid robot to autonomously perform skills in diverse real-world scenarios, using only data collected in the lab. Videos are available at: https://humanoid-manipulation.github.io, Comment: Project website: https://humanoid-manipulation.github.io
Published: 2024

14. Automated Creation of Digital Cousins for Robust Policy Learning

Author: Dai, Tianyuan, Wong, Josiah, Jiang, Yunfan, Wang, Chen, Gokmen, Cem, Zhang, Ruohan, Wu, Jiajun, and Fei-Fei, Li
Subjects: Computer Science - Robotics
Abstract: Training robot policies in the real world can be unsafe, costly, and difficult to scale. Simulation serves as an inexpensive and potentially limitless source of training data, but suffers from the semantics and physics disparity between simulated and real-world environments. These discrepancies can be minimized by training in digital twins, which serve as virtual replicas of a real scene but are expensive to generate and cannot produce cross-domain generalization. To address these limitations, we propose the concept of digital cousins, a virtual asset or scene that, unlike a digital twin, does not explicitly model a real-world counterpart but still exhibits similar geometric and semantic affordances. As a result, digital cousins simultaneously reduce the cost of generating an analogous virtual environment while also facilitating better robustness during sim-to-real domain transfer by providing a distribution of similar training scenes. Leveraging digital cousins, we introduce a novel method for their automated creation, and propose a fully automated real-to-sim-to-real pipeline for generating fully interactive scenes and training robot policies that can be deployed zero-shot in the original scene. We find that digital cousin scenes that preserve geometric and semantic affordances can be produced automatically, and can be used to train policies that outperform policies trained on digital twins, achieving 90% vs. 25% success rates under zero-shot sim-to-real transfer. Additional details are available at https://digital-cousins.github.io/., Comment: CoRL 2024
Published: 2024

15. Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Author: Li, Manling, Zhao, Shiyu, Wang, Qineng, Wang, Kangrui, Zhou, Yu, Srivastava, Sanjana, Gokmen, Cem, Lee, Tony, Li, Li Erran, Zhang, Ruohan, Liu, Weiyu, Liang, Percy, Fei-Fei, Li, Mao, Jiayuan, and Wu, Jiajun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performance because they are usually applied in different domains, for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn blocks embodied agents from leveraging LLMs effectively and selectively. To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision-making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc. Overall, our benchmark offers a comprehensive assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making., Comment: Accepted for oral presentation at NeurIPS 2024 in the Datasets and Benchmarks track. Camera-ready version
Published: 2024

16. Don't Cut Corners: Exact Conditions for Modularity in Biologically Inspired Representations

Author: Dorrell, Will, Hsu, Kyle, Hollingsworth, Luke, Lee, Jin Hwa, Wu, Jiajun, Finn, Chelsea, Latham, Peter E, Behrens, Tim EJ, and Whittington, James CR
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Why do biological and artificial neurons sometimes modularise, each encoding a single meaningful variable, and sometimes entangle their representation of many variables? In this work, we develop a theory of when biologically inspired representations -- those that are nonnegative and energy efficient -- modularise with respect to source variables (sources). We derive necessary and sufficient conditions on a sample of sources that determine whether the neurons in an optimal biologically-inspired linear autoencoder modularise. Our theory applies to any dataset, extending far beyond the case of statistical independence studied in previous work. Rather, we show that sources modularise if their support is "sufficiently spread". From this theory, we extract and validate predictions in a variety of empirical studies on how data distribution affects modularisation in nonlinear feedforward and recurrent neural networks trained on supervised and unsupervised tasks. Furthermore, we apply these ideas to neuroscience data. First, we explain why two studies that recorded prefrontal activity in working memory tasks conflict on whether memories are encoded in orthogonal subspaces: the support of the sources differed due to a critical discrepancy in experimental protocol. Second, we use similar arguments to understand why preparatory and potent subspaces in RNN models of motor cortex are only sometimes orthogonal. Third, we study spatial and reward information mixing in entorhinal recordings, and show our theory matches data better than previous work. And fourth, we suggest a suite of surprising settings in which neurons can be (or appear) mixed selective, without requiring complex nonlinear readouts as in traditional theories. In sum, our theory prescribes precise conditions on when neural activities modularise, providing tools for inducing and elucidating modular representations in brains and machines., Comment: 47 pages, 23 figures. WD and KH contributed equally; LH and JHL contributed equally
Published: 2024

17. MARPLE: A Benchmark for Long-Horizon Inference

Author: Jin, Emily, Huang, Zhuoyi, Fränken, Jan-Philipp, Liu, Weiyu, Cha, Hannah, Brockbank, Erik, Wu, Sarah, Zhang, Ruohan, Wu, Jiajun, and Gerstenberg, Tobias
Subjects: Computer Science - Machine Learning
Abstract: Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models are less robust and performant, while GPT-4 has difficulty comprehending environmental changes. We analyze what factors influence inference performance and ablate different modes of evidence, finding that all modes are valuable for performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models., Comment: NeurIPS 2024. First two authors contributed equally. Project page: https://marple-benchmark.github.io/
Published: 2024

18. FactorSim: Generative Simulation via Factorized Representation

Author: Sun, Fan-Yun, Harini, S. I., Yi, Angela, Zhou, Yihan, Zook, Alex, Tremblay, Jonathan, Cross, Logan, Wu, Jiajun, and Haber, Nick
Subjects: Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code's accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (e.g., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks., Comment: neurips 2024, project website: https://cs.stanford.edu/~sunfanyun/factorsim/
Published: 2024

19. DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

Author: Jin, Xutong, Xu, Chenxi, Gao, Ruohan, Wu, Jiajun, Wang, Guoping, and Li, Sheng
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Accurately estimating and simulating the physical properties of objects from real-world sound recordings is of great practical importance in the fields of vision, graphics, and robotics. However, the progress in these directions has been limited -- prior differentiable rigid or soft body simulation techniques cannot be directly applied to modal sound synthesis due to the high sampling rate of audio, while previous audio synthesizers often do not fully model the accurate physical properties of the sounding objects. We propose DiffSound, a differentiable sound rendering framework for physics-based modal sound synthesis, which is based on an implicit shape representation, a new high-order finite element analysis module, and a differentiable audio synthesizer. Our framework can solve a wide range of inverse problems thanks to the differentiability of the entire pipeline, including physical parameter estimation, geometric shape reasoning, and impact position prediction. Experimental results demonstrate the effectiveness of our approach, highlighting its ability to accurately reproduce the target sound in a physics-based manner. DiffSound serves as a valuable tool for various sound synthesis and analysis applications., Comment: 12 pages, 10 figures. Published in Siggraph 2024. Project page: https://hellojxt.github.io/DiffSound/
Published: 2024
Full Text: View/download PDF

20. What Makes a Maze Look Like a Maze?

Author: Hsu, Joy, Mao, Jiayuan, Tenenbaum, Joshua B., Goodman, Noah D., and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.
Published: 2024

21. An Eulerian Vortex Method on Flow Maps

Author: Wang, Sinan, Deng, Yitong, Deng, Molin, Yu, Hong-Xing, Zhou, Junwei, Chen, Duowen, Komura, Taku, Wu, Jiajun, and Zhu, Bo
Subjects: Computer Science - Graphics, Mathematics - Numerical Analysis, Physics - Fluid Dynamics
Abstract: We present an Eulerian vortex method based on the theory of flow maps to simulate the complex vortical motions of incompressible fluids. Central to our method is the novel incorporation of the flow-map transport equations for line elements, which, in combination with a bi-directional marching scheme for flow maps, enables the high-fidelity Eulerian advection of vorticity variables. The fundamental motivation is that, compared to impulse $\mathbf{m}$, which has been recently bridged with flow maps to encouraging results, vorticity $\boldsymbol{\omega}$ promises to be preferable for its numerical stability and physical interpretability. To realize the full potential of this novel formulation, we develop a new Poisson solving scheme for vorticity-to-velocity reconstruction that is both efficient and able to accurately handle the coupling near solid boundaries. We demonstrate the efficacy of our approach with a range of vortex simulation examples, including leapfrog vortices, vortex collisions, cavity flow, and the formation of complex vortical structures due to solid-fluid interactions., Comment: Accepted at ACM Transactions on Graphics (SIGGRAPH Asia 2024)
Published: 2024
Full Text: View/download PDF

22. View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

Author: Tian, Stephen, Wulfe, Blake, Sargent, Kyle, Liu, Katherine, Zakharov, Sergey, Guizilini, Vitor, and Wu, Jiajun
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista., Comment: Accepted to CoRL 2024
Published: 2024

23. D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Author: Rafailov, Rafael, Hatch, Kyle, Singh, Anikait, Smith, Laura, Kumar, Aviral, Kostrikov, Ilya, Hansen-Estruch, Philippe, Kolev, Victor, Ball, Philip, Wu, Jiajun, Finn, Chelsea, and Levine, Sergey
Subjects: Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Offline reinforcement learning algorithms hold the promise of enabling data-driven RL methods that do not require costly or dangerous real-world exploration and benefit from large pre-collected datasets. This in turn can facilitate real-world applications, as well as a more standardized approach to RL research. Furthermore, offline RL methods can provide effective initializations for online finetuning to overcome challenges with exploration. However, evaluating progress on offline RL algorithms requires effective and challenging benchmarks that capture properties of real-world tasks, provide a range of task difficulties, and cover a range of challenges both in terms of the parameters of the domain (e.g., length of the horizon, sparsity of rewards) and the parameters of the data (e.g., narrow demonstration data or broad exploratory data). While considerable progress in offline RL in recent years has been enabled by simpler benchmark tasks, the most widely used datasets are increasingly saturating in performance and may fail to reflect properties of realistic tasks. We propose a new benchmark for offline RL that focuses on realistic simulations of robotic manipulation and locomotion environments, based on models of real-world robotic systems, and comprising a variety of data sources, including scripted data, play-style data collected by human teleoperators, and other data sources. Our proposed benchmark covers state-based and image-based domains, and supports both offline RL and online fine-tuning evaluation, with some of the tasks specifically designed to require both pre-training and fine-tuning. We hope that our proposed benchmark will facilitate further progress on both offline RL and fine-tuning algorithms. Website with code, examples, tasks, and data is available at \url{https://sites.google.com/view/d5rl/}, Comment: RLC 2024
Published: 2024

24. Enhancing Equitable Access to AI in Housing and Homelessness System of Care through Federated Learning

Author: Taib, Musa, Wu, Jiajun, Drew, Steve, and Messier, Geoffrey G.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computers and Society
Abstract: The top priority of a Housing and Homelessness System of Care (HHSC) is to connect people experiencing homelessness to supportive housing. An HHSC typically consists of many agencies serving the same population. Information technology platforms differ in type and quality between agencies, so their data are usually isolated from one agency to another. Larger agencies may have sufficient data to train and test artificial intelligence (AI) tools but smaller agencies typically do not. To address this gap, we introduce a Federated Learning (FL) approach enabling all agencies to train a predictive model collaboratively without sharing their sensitive data. We demonstrate how FL can be used within an HHSC to provide all agencies equitable access to quality AI and further assist human decision-makers in the allocation of resources within HHSC. This is achieved while preserving the privacy of the people within the data by not sharing identifying information between agencies without their consent. Our experimental results using real-world HHSC data from Calgary, Alberta, demonstrate that our FL approach offers comparable performance with the idealized scenario of training the predictive model with data fully shared and linked between agencies., Comment: Accepted at the 2024 AAAI/ACM Conference on AI, Ethics, and Society (AIES)
Published: 2024

25. RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

Author: Ai, Bo, Tian, Stephen, Shi, Haochen, Wang, Yixuan, Tan, Cheston, Li, Yunzhu, and Wu, Jiajun
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, I.2.9, I.2.6, I.2.10
Abstract: Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems., Comment: Robotics: Science and Systems (RSS), 2024. Project page: https://robo-pack.github.io/
Published: 2024

26. WonderWorld: Interactive 3D Scene Generation from a Single Image

Author: Yu, Hong-Xing, Duan, Haoyi, Herrmann, Charles, Freeman, William T., and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: We present WonderWorld, a novel framework for interactive 3D scene generation that enables users to interactively specify scene contents and layout and see the created scenes in low latency. The major challenge lies in achieving fast generation of 3D scenes. Existing scene generation approaches fall short of speed as they often require (1) progressively generating many views and depth maps, and (2) time-consuming optimization of the scene geometry representations. We introduce the Fast Layered Gaussian Surfels (FLAGS) as our scene representation and an algorithm to generate it from a single view. Our approach does not need multiple views, and it leverages a geometry-based initialization that significantly reduces optimization time. Another challenge is generating coherent geometry that allows all scenes to be connected. We introduce the guided depth diffusion that allows partial conditioning of depth estimation. WonderWorld generates connected and diverse 3D scenes in less than 10 seconds on a single A6000 GPU, enabling real-time user interaction and exploration. We demonstrate the potential of WonderWorld for user-driven content creation and exploration in virtual environments. We will release full code and software for reproducibility. Project website: https://kovenyu.com/WonderWorld/., Comment: Project website: https://kovenyu.com/WonderWorld/
Published: 2024

27. Hearing Anything Anywhere

Author: Wang, Mason, Sawata, Ryosuke, Clarke, Samuel, Gao, Ruohan, Wu, Shangzhe, and Wu, Jiajun
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, I.2.10, I.4.8
Abstract: Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene., Comment: CVPR 2024. The first two authors contributed equally. Project page: https://masonlwang.com/hearinganythinganywhere/
Published: 2024

28. Efficient Imitation Learning with Conservative World Models

Author: Kolev, Victor, Rafailov, Rafael, Hatch, Kyle, Wu, Jiajun, and Finn, Chelsea
Subjects: Computer Science - Machine Learning
Abstract: We tackle the problem of policy learning from expert demonstrations without a reward function. A central challenge in this space is that these policies fail upon deployment due to issues of distributional shift, environment stochasticity, or compounding errors. Adversarial imitation learning alleviates this issue but requires additional on-policy training samples for stability, which presents a challenge in realistic domains due to inefficient learning and high sample complexity. One approach to this issue is to learn a world model of the environment, and use synthetic data for policy training. While successful in prior works, we argue that this is sub-optimal due to additional distribution shifts between the learned model and the real environment. Instead, we re-frame imitation learning as a fine-tuning problem, rather than a pure reinforcement learning one. Drawing theoretical connections to offline RL and fine-tuning algorithms, we argue that standard online world model algorithms are not well suited to the imitation learning problem. We derive a principled conservative optimization bound and demonstrate empirically that it leads to improved performance on two very challenging manipulation environments from high-dimensional raw pixel observations. We set a new state-of-the-art performance on the Franka Kitchen environment from images, requiring only 10 demos on no reward labels, as well as solving a complex dexterity manipulation task., Comment: Oral presentation, L4DC 2024
Published: 2024

29. TRANSIC: Sim-to-Real Policy Transfer by Learning from Online Correction

Author: Jiang, Yunfan, Wang, Chen, Zhang, Ruohan, Wu, Jiajun, and Fei-Fei, Li
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Learning in simulation and transferring the learned policy to the real world has the potential to enable generalist robots. The key challenge of this approach is to address simulation-to-reality (sim-to-real) gaps. Previous methods often require domain-specific knowledge a priori. We argue that a straightforward way to obtain such knowledge is by asking humans to observe and assist robot policy execution in the real world. The robots can then learn from humans to close various sim-to-real gaps. We propose TRANSIC, a data-driven approach to enable successful sim-to-real transfer based on a human-in-the-loop framework. TRANSIC allows humans to augment simulation policies to overcome various unmodeled sim-to-real gaps holistically through intervention and online correction. Residual policies can be learned from human corrections and integrated with simulation policies for autonomous execution. We show that our approach can achieve successful sim-to-real transfer in complex and contact-rich manipulation tasks such as furniture assembly. Through synergistic integration of policies learned in simulation and from humans, TRANSIC is effective as a holistic approach to addressing various, often coexisting sim-to-real gaps. It displays attractive properties such as scaling with human effort. Videos and code are available at https://transic-robot.github.io/, Comment: 8th Conference on Robot Learning (CoRL 2024), Munich, Germany. Project website: https://transic-robot.github.io/
Published: 2024

30. BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Author: Ge, Yunhao, Tang, Yihe, Xu, Jiashu, Gokmen, Cem, Li, Chengshu, Ai, Wensi, Martinez, Benjamin Jose, Aydin, Arman, Anvari, Mona, Chakravarthy, Ayush K, Yu, Hong-Xing, Wong, Josiah, Srivastava, Sanjana, Lee, Sharon, Zha, Shengxin, Itti, Laurent, Li, Yunzhu, Martín-Martín, Roberto, Liu, Miao, Zhang, Pengchuan, Zhang, Ruohan, Fei-Fei, Li, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/, Comment: CVPR 2024 (Highlight). Project website: https://behavior-vision-suite.github.io/
Published: 2024

31. Evaluating Real-World Robot Manipulation Policies in Simulation

Author: Li, Xuanlin, Hsu, Kyle, Gu, Jiayuan, Pertsch, Karl, Mees, Oier, Walke, Homer Rich, Fu, Chuyuan, Lunawat, Ishikaa, Sieh, Isabel, Kirmani, Sean, Levine, Sergey, Wu, Jiajun, Finn, Chelsea, Su, Hao, Vuong, Quan, and Xiao, Ted
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The field of robotics has made significant advances towards generalist robot manipulation policies. However, real-world evaluation of such policies is not scalable and faces reproducibility challenges, which are likely to worsen as policies broaden the spectrum of tasks they can perform. We identify control and visual disparities between real and simulated environments as key challenges for reliable simulated evaluation and propose approaches for mitigating these gaps without needing to craft full-fidelity digital twins of real-world environments. We then employ these approaches to create SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups. Through paired sim-and-real evaluations of manipulation policies, we demonstrate strong correlation between policy performance in SIMPLER environments and in the real world. Additionally, we find that SIMPLER evaluations accurately reflect real-world policy behavior modes such as sensitivity to various distribution shifts. We open-source all SIMPLER environments along with our workflow for creating new environments at https://simpler-env.github.io to facilitate research on general-purpose manipulation policies and simulated evaluation frameworks.
Published: 2024

32. Composable Part-Based Manipulation

Author: Liu, Weiyu, Mao, Jiayuan, Hsu, Joy, Hermans, Tucker, Garg, Animesh, and Wu, Jiajun
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: In this paper, we propose composable part-based manipulation (CPM), a novel approach that leverages object-part decomposition and part-part correspondences to improve learning and generalization of robotic manipulation skills. By considering the functional correspondences between object parts, we conceptualize functional actions, such as pouring and constrained placing, as combinations of different correspondence constraints. CPM comprises a collection of composable diffusion models, where each model captures a different inter-object correspondence. These diffusion models can generate parameters for manipulation skills based on the specific object parts. Leveraging part-based correspondences coupled with the task decomposition into distinct constraints enables strong generalization to novel objects and object categories. We validate our approach in both simulated and real-world scenarios, demonstrating its effectiveness in achieving robust and generalized manipulation capabilities., Comment: Presented at CoRL 2023. For videos and additional results, see our website: https://cpmcorl2023.github.io/
Published: 2024

33. Learning Planning Abstractions from Language

Author: Liu, Weiyu, Chen, Geng, Hsu, Joy, Mao, Jiayuan, and Wu, Jiajun
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: This paper presents a framework for learning state and action abstractions in sequential decision-making domains. Our framework, planning abstraction from language (PARL), utilizes language-annotated demonstrations to automatically discover a symbolic and abstract action space and induce a latent state abstraction based on it. PARL consists of three stages: 1) recovering object-level and action concepts, 2) learning state abstractions, abstract action feasibility, and transition models, and 3) applying low-level policies for abstract actions. During inference, given the task description, PARL first makes abstract action plans using the latent transition and feasibility functions, then refines the high-level plan using low-level policies. PARL generalizes across scenarios involving novel object instances and environments, unseen concept compositions, and tasks that require longer planning horizons than settings it is trained on., Comment: The first two authors contributed equally. The last two authors provide equal advising. Project website: https://parl2024.github.io/
Published: 2024

34. 6G autonomous radio access network empowered by artificial intelligence and network digital twin

Author: Liu, Guangyi, Deng, Juan, Zhu, Yanhong, Li, Na, Han, Boxiao, Wang, Shoufeng, Rui, Hua, Wang, Jingyu, Zhang, Jianhua, Cui, Ying, Cui, Yingping, Yang, Yang, Zhang, Yan, Wang, Jiangzhou, Ouyang, Ye, Ye, Xiaozhou, Chen, Tao, Li, Rongpeng, Zhu, Yongdong, Zhang, Yuanyuan, Yang, Li, Bian, Sen, Sun, Wanfei, Zheng, Qingbi, Tong, Zhou, Zhang, Huimin, Shao, Zecai, Wu, Jiajun, and Kang, Mancong
Published: 2024
Full Text: View/download PDF

35. The muscle–intervertebral disc interaction mediated by L-BAIBA modulates extracellular matrix homeostasis and PANoptosis in nucleus pulposus cells

Author: Qin, Tianyu, Shi, Ming, Zhang, Chao, Wu, Jiajun, Huang, Zhengqi, Zhang, Xiaohe, Li, Shuangxing, Wu, Yuliang, Han, Weitao, Gao, Bo, Xu, Kang, Jin, Song, and Ye, Wei
Published: 2024
Full Text: View/download PDF

36. Controllable Human-Object Interaction Synthesis

Author: Li, Jiaman, Clegg, Alexander, Mottaghi, Roozbeh, Wu, Jiajun, Puig, Xavier, Liu, C. Karen, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

37. PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Author: Zhang, Tianyuan, Yu, Hong-Xing, Wu, Rundi, Feng, Brandon Y., Zheng, Changxi, Snavely, Noah, Wu, Jiajun, Freeman, William T., Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

38. Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians

Author: Zhong, Licheng, Yu, Hong-Xing, Wu, Jiajun, Li, Yunzhu, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

39. Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Author: Sun, Keqiang, Litvak, Dor, Zhang, Yunzhi, Li, Hongsheng, Wu, Jiajun, Wu, Shangzhe, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

40. 3D Congealing: 3D-Aware Image Alignment in the Wild

Author: Zhang, Yunzhi, Li, Zizhang, Raj, Amit, Engelhardt, Andreas, Li, Yuanzhen, Hou, Tingbo, Wu, Jiajun, Jampani, Varun, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

41. Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Author: Feng, Chun, Hsu, Joy, Liu, Weiyu, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: 3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision., Comment: CVPR 2024. The first two authors contributed equally
Published: 2024

42. PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Author: Zhang, Tianyuan, Yu, Hong-Xing, Wu, Rundi, Feng, Brandon Y., Zheng, Changxi, Snavely, Noah, Wu, Jiajun, and Freeman, William T.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Realistic object interactions are crucial for creating immersive virtual experiences, yet synthesizing realistic 3D object dynamics in response to novel interactions remains a significant challenge. Unlike unconditional or text-conditioned dynamics generation, action-conditioned dynamics requires perceiving the physical material properties of objects and grounding the 3D motion prediction on these properties, such as object stiffness. However, estimating physical material properties is an open problem due to the lack of material ground-truth data, as measuring these properties for real objects is highly difficult. We present PhysDreamer, a physics-based approach that endows static 3D objects with interactive dynamics by leveraging the object dynamics priors learned by video generation models. By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations. We demonstrate our approach on diverse examples of elastic objects and evaluate the realism of the synthesized interactions through a user study. PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner. See our project page at https://physdreamer.github.io/., Comment: Project website at: https://physdreamer.github.io/ Appear on ECCV 2024
Published: 2024

43. Tripod: Three Complementary Inductive Biases for Disentangled Representation Learning

Author: Hsu, Kyle, Hamid, Jubayer Ibn, Burns, Kaylee, Finn, Chelsea, and Wu, Jiajun
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance., Comment: ICML 2024 camera-ready. 22 pages, 10 figures, code available at https://github.com/kylehkhsu/tripod
Published: 2024

44. Visually Descriptive Language Model for Vector Graphics Reasoning

Author: Wang, Zhenhailong, Hsu, Joy, Wang, Xingyao, Huang, Kuan-Hao, Li, Manling, Wu, Jiajun, and Ji, Heng
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite significant advancements, large multimodal models (LMMs) still struggle to bridge the gap between low-level visual perception -- focusing on shapes, sizes, and layouts -- and high-level language reasoning, such as semantics and logic. This limitation is evident in tasks that require precise visual perception, like comparing geometric properties or solving visual reasoning problems. To study this failure mode, we focus on vector graphics -- images composed of 2D objects and shapes, prevalent in LMM-based tasks in web, design, and OS environments. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To capture fine visual details, we use Scalable Vector Graphics (SVG) for accurate encoding of visual scenes. However, SVGs are not readily interpretable by LMMs in a zero-shot manner. To tackle this, we propose the Visually Descriptive Language Model (VDLM), which introduces a Primal Visual Description (PVD) as an intermediate textual representation. PVD translates SVGs into a text-based abstraction consisting of primitive attributes (e.g., shape, position, measurement) and their corresponding values. PVD can be learned using task-agnostic synthesized data and represents visual primitives that are universal across vector graphics. This abstraction is more structured, allowing for direct interpretation by foundation models for zero-shot generalization. Without human-annotated data, empirical results show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks. Extensive analyses of VDLM show improved interpretability due to its disentangled perception and reasoning. We also demonstrate a positive correlation between PVD quality and task performance. Project page: https://mikewangwzhl.github.io/VDLM/, Comment: Project page: https://mikewangwzhl.github.io/VDLM/
Published: 2024

45. 3D Congealing: 3D-Aware Image Alignment in the Wild

Author: Zhang, Yunzhi, Li, Zizhang, Raj, Amit, Engelhardt, Andreas, Li, Yuanzhen, Hou, Tingbo, Wu, Jiajun, and Jampani, Varun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose 3D Congealing, a novel problem of 3D-aware alignment for 2D images capturing semantically similar objects. Given a collection of unlabeled Internet images, our goal is to associate the shared semantic parts from the inputs and aggregate the knowledge from 2D images to a shared 3D canonical space. We introduce a general framework that tackles the task without assuming shape templates, poses, or any camera parameters. At its core is a canonical 3D representation that encapsulates geometric and semantic information. The framework optimizes for the canonical representation together with the pose for each input image, and a per-image coordinate map that warps 2D pixel coordinates to the 3D canonical frame to account for the shape matching. The optimization procedure fuses prior knowledge from a pre-trained image generative model and semantic information from input images. The former provides strong knowledge guidance for this under-constraint task, while the latter provides the necessary information to mitigate the training data bias from the pre-trained model. Our framework can be used for various tasks such as correspondence matching, pose estimation, and image editing, achieving strong results on real-world image datasets under challenging illumination conditions and on in-the-wild online image collections., Comment: Project page: https://ai.stanford.edu/~yzzhang/projects/3d-congealing/
Published: 2024

46. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Author: Khazatsky, Alexander, Pertsch, Karl, Nair, Suraj, Balakrishna, Ashwin, Dasari, Sudeep, Karamcheti, Siddharth, Nasiriany, Soroush, Srirama, Mohan Kumar, Chen, Lawrence Yunliang, Ellis, Kirsty, Fagan, Peter David, Hejna, Joey, Itkina, Masha, Lepert, Marion, Ma, Yecheng Jason, Miller, Patrick Tree, Wu, Jimmy, Belkhale, Suneel, Dass, Shivin, Ha, Huy, Jain, Arhan, Lee, Abraham, Lee, Youngwoon, Memmel, Marius, Park, Sungjae, Radosavovic, Ilija, Wang, Kaiyuan, Zhan, Albert, Black, Kevin, Chi, Cheng, Hatch, Kyle Beltran, Lin, Shan, Lu, Jingpei, Mercat, Jean, Rehman, Abdul, Sanketi, Pannag R, Sharma, Archit, Simpson, Cody, Vuong, Quan, Walke, Homer Rich, Wulfe, Blake, Xiao, Ted, Yang, Jonathan Heewon, Yavary, Arefeh, Zhao, Tony Z., Agia, Christopher, Baijal, Rohan, Castro, Mateo Guaman, Chen, Daphne, Chen, Qiuyu, Chung, Trinity, Drake, Jaimyn, Foster, Ethan Paul, Gao, Jensen, Herrera, David Antonio, Heo, Minho, Hsu, Kyle, Hu, Jiaheng, Jackson, Donovon, Le, Charlotte, Li, Yunshuang, Lin, Kevin, Lin, Roy, Ma, Zehan, Maddukuri, Abhiram, Mirchandani, Suvir, Morton, Daniel, Nguyen, Tony, O'Neill, Abigail, Scalise, Rosario, Seale, Derick, Son, Victor, Tian, Stephen, Tran, Emi, Wang, Andrew E., Wu, Yilin, Xie, Annie, Yang, Jingyun, Yin, Patrick, Zhang, Yunchu, Bastani, Osbert, Berseth, Glen, Bohg, Jeannette, Goldberg, Ken, Gupta, Abhinav, Gupta, Abhishek, Jayaraman, Dinesh, Lim, Joseph J, Malik, Jitendra, Martín-Martín, Roberto, Ramamoorthy, Subramanian, Sadigh, Dorsa, Song, Shuran, Wu, Jiajun, Yip, Michael C., Zhu, Yuke, Kollar, Thomas, Levine, Sergey, and Finn, Chelsea
Subjects: Computer Science - Robotics
Abstract: The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup., Comment: Project website: https://droid-dataset.github.io/
Published: 2024

47. WGMR Self-Injection Locking Method Based on Enhanced Optical Feedback with Auxiliary Prism

Author: Wu, Jiajun, Zhong, Shan, and Kang, Songbai
Subjects: Physics - Optics
Abstract: The optical feedback intensity is an important parameter for realizing narrow linewidth lasers in Whispering-gallery-mode resonator (WGMR) self-injection locking technology. We proposed an approach that enhances the intensity of intracavity feedback in crystalline WGMR by using only a single coated auxiliary prism. Compared to the Rayleigh scattering, the feedback intensity of the enhanced scheme increased by more than a hundred times. Furthermore, we demonstrated that, with the enhanced approach, the instantaneous linewidth of the laser was suppressed to 7 Hz level, the locking range was expanded up to 8 GHz, and the relative intensity noise (RIN) was reduced to -152 dBc/Hz@10MHz. The feedback enhanced design is compact, easy-to-operated and can be integrated with the WGMR. It provides a miniaturized solution for controlling optical feedback intensity in WGMR self-injection locking technology.
Published: 2024

48. Reconstruction and Simulation of Elastic Objects with Spring-Mass 3D Gaussians

Author: Zhong, Licheng, Yu, Hong-Xing, Wu, Jiajun, and Li, Yunzhu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Reconstructing and simulating elastic objects from visual observations is crucial for applications in computer vision and robotics. Existing methods, such as 3D Gaussians, model 3D appearance and geometry, but lack the ability to estimate physical properties for objects and simulate them. The core challenge lies in integrating an expressive yet efficient physical dynamics model. We propose Spring-Gaus, a 3D physical object representation for reconstructing and simulating elastic objects from videos of the object from multiple viewpoints. In particular, we develop and integrate a 3D Spring-Mass model into 3D Gaussian kernels, enabling the reconstruction of the visual appearance, shape, and physical dynamics of the object. Our approach enables future prediction and simulation under various initial states and environmental properties. We evaluate Spring-Gaus on both synthetic and real-world datasets, demonstrating accurate reconstruction and simulation of elastic objects. Project page: https://zlicheng.com/spring_gaus/.
Published: 2024

49. BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

Author: Li, Chengshu, Zhang, Ruohan, Wong, Josiah, Gokmen, Cem, Srivastava, Sanjana, Martín-Martín, Roberto, Wang, Chen, Levine, Gabrael, Ai, Wensi, Martinez, Benjamin, Yin, Hang, Lingelbach, Michael, Hwang, Minjune, Hiranaka, Ayano, Garlanka, Sujay, Aydin, Arman, Lee, Sharon, Sun, Jiankai, Anvari, Mona, Sharma, Manasi, Bansal, Dhruva, Hunter, Samuel, Kim, Kyu-Young, Lou, Alan, Matthews, Caleb R, Villa-Renteria, Ivan, Tang, Jerry Huayang, Tang, Claire, Xia, Fei, Li, Yunzhu, Savarese, Silvio, Gweon, Hyowon, Liu, C. Karen, Wu, Jiajun, and Fei-Fei, Li
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: We present BEHAVIOR-1K, a comprehensive simulation benchmark for human-centered robotics. BEHAVIOR-1K includes two components, guided and motivated by the results of an extensive survey on "what do you want robots to do for you?". The first is the definition of 1,000 everyday activities, grounded in 50 scenes (houses, gardens, restaurants, offices, etc.) with more than 9,000 objects annotated with rich physical and semantic properties. The second is OMNIGIBSON, a novel simulation environment that supports these activities via realistic physics simulation and rendering of rigid bodies, deformable bodies, and liquids. Our experiments indicate that the activities in BEHAVIOR-1K are long-horizon and dependent on complex manipulation skills, both of which remain a challenge for even state-of-the-art robot learning solutions. To calibrate the simulation-to-reality gap of BEHAVIOR-1K, we provide an initial study on transferring solutions learned with a mobile manipulator in a simulated apartment to its real-world counterpart. We hope that BEHAVIOR-1K's human-grounded nature, diversity, and realism make it valuable for embodied AI and robot learning research. Project website: https://behavior.stanford.edu., Comment: A preliminary version was published at 6th Conference on Robot Learning (CoRL 2022)
Published: 2024

50. Unsupervised Discovery of Object-Centric Neural Fields

Author: Luo, Rundong, Yu, Hong-Xing, and Wu, Jiajun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study inferring 3D object-centric scene representations from a single image. While recent methods have shown potential in unsupervised 3D object discovery from simple synthetic images, they fail to generalize to real-world scenes with visually rich and diverse objects. This limitation stems from their object representations, which entangle objects' intrinsic attributes like shape and appearance with extrinsic, viewer-centric properties such as their 3D location. To address this bottleneck, we propose Unsupervised discovery of Object-Centric neural Fields (uOCF). uOCF focuses on learning the intrinsics of objects and models the extrinsics separately. Our approach significantly improves systematic generalization, thus enabling unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images. To evaluate our approach, we collect three new datasets, including two real kitchen environments. Extensive experiments show that uOCF enables unsupervised discovery of visually rich objects from a single real image, allowing applications such as 3D object segmentation and scene manipulation. Notably, uOCF demonstrates zero-shot generalization to unseen objects from a single real image. Project page: https://red-fairy.github.io/uOCF/
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,873 results on '"Wu, Jiajun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources