Author: "Fu, Yun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Fu, Yun"' showing total 5,127 results

Start Over Author "Fu, Yun"

5,127 results on '"Fu, Yun"'

1. Accessing Vision Foundation Models at ImageNet-level Costs

Author: Zhang, Yitian, Ma, Xu, Bai, Yue, Wang, Huan, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Vision foundation models are renowned for their generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could advance research in this field. In this work, we offer a very simple and general solution, named Proteus, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. Leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 15 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M).
Published: 2024

2. SoupLM: Model Integration in Large Language and Multi-Modal Models

Author: Bai, Yue, Zhang, Zichen, Lu, Jiasen, and Fu, Yun
Subjects: Computer Science - Artificial Intelligence
Abstract: Training large language models (LLMs) and multimodal LLMs necessitates significant computing resources, and existing publicly available LLMs are typically pre-trained on diverse, privately curated datasets spanning various tasks. For instance, LLaMA, Vicuna, and LLaVA are three LLM variants trained with LLaMA base models using very different training recipes, tasks, and data modalities. The training cost and complexity for such LLM variants grow rapidly. In this study, we propose to use a soup strategy to assemble these LLM variants into a single well-generalized multimodal LLM (SoupLM) in a cost-efficient manner. Assembling these LLM variants efficiently brings knowledge and specialities trained from different domains and data modalities into an integrated one (e.g., chatbot speciality from user-shared conversations for Vicuna, and visual capacity from vision-language data for LLaVA), therefore, to avoid computing costs of repetitive training on several different domains. We propose series of soup strategies to systematically benchmark performance gains across various configurations, and probe the soup behavior across base models in the interpolation space.
Published: 2024

3. Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

Author: Chen, Zhawnen, Wang, Tianchun, Wang, Yizhou, Kosinski, Michal, Zhang, Xiang, Fu, Yun, and Li, Sheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.
Published: 2024

4. Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Author: Zhuo, Le, Du, Ruoyi, Xiao, Han, Li, Yangguang, Liu, Dongyang, Huang, Rongjie, Liu, Wenze, Zhao, Lirui, Wang, Fu-Yun, Ma, Zhanyu, Luo, Xu, Wang, Zehan, Zhang, Kaipeng, Zhu, Xiangyang, Liu, Si, Yue, Xiangyu, Liu, Dingning, Ouyang, Wanli, Liu, Ziwei, Qiao, Yu, Li, Hongsheng, and Gao, Peng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling., Comment: Code at: https://github.com/Alpha-VLLM/Lumina-T2X
Published: 2024

5. Phased Consistency Model

Author: Wang, Fu-Yun, Huang, Zhaoyang, Bergman, Alexander William, Shen, Dazhong, Gao, Peng, Lingelbach, Michael, Sun, Keqiang, Bian, Weikang, Song, Guanglu, Liu, Yu, Li, Hongsheng, and Wang, Xiaogang
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: The consistency model (CM) has recently made significant progress in accelerating the generation of diffusion models. However, its application to high-resolution, text-conditioned image generation in the latent space (a.k.a., LCM) remains unsatisfactory. In this paper, we identify three key flaws in the current design of LCM. We investigate the reasons behind these limitations and propose the Phased Consistency Model (PCM), which generalizes the design space and addresses all identified limitations. Our evaluations demonstrate that PCM significantly outperforms LCM across 1--16 step generation settings. While PCM is specifically designed for multi-step refinement, it achieves even superior or comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show that PCM's methodology is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. More details are available at https://g-u-n.github.io/projects/pcm/.
Published: 2024

6. Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent

Author: Xu, Yi and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding multi-agent behavior is critical across various fields. The conventional approach involves analyzing agent movements through three primary tasks: trajectory prediction, imputation, and spatial-temporal recovery. Considering the unique input formulation and constraint of these tasks, most existing methods are tailored to address only one specific task. However, in real-world applications, these scenarios frequently occur simultaneously. Consequently, methods designed for one task often fail to adapt to others, resulting in performance drops. To overcome this limitation, we propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs, adaptable to diverse scenarios. Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction. We further extend recent successful State Space Models (SSMs), particularly the Mamba model, into a Bidirectional Temporal Mamba to effectively capture temporal dependencies. Additionally, we incorporate a Bidirectional Temporal Scaled (BTS) module to comprehensively scan trajectories while maintaining the temporal missing relationships within the sequence. We curate and benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation. Extensive experiments demonstrate the superior performance of our model. To the best of our knowledge, this is the first work that addresses this unified problem through a versatile generative framework, thereby enhancing our understanding of multi-agent movement. Our datasets, code, and model weights are available at https://github.com/colorfulfuture/UniTraj-pytorch., Comment: Datasets, code, and model weights at available at: https://github.com/colorfulfuture/UniTraj-pytorch
Published: 2024

7. Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Author: Khan, Zaid and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model., Comment: CVPR 2024
Published: 2024

8. Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Author: Shen, Dazhong, Song, Guanglu, Xue, Zeyue, Wang, Fu-Yun, and Liu, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG., Comment: accepted by CVPR-2024
Published: 2024

9. Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Author: Khan, Zaid, BG, Vijay Kumar, Schulter, Samuel, Fu, Yun, and Chandraker, Manmohan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP, Comment: CVPR 2024
Published: 2024

10. OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising

Author: Zhang, Haichao, Xu, Yi, Lu, Hongsheng, Shimizu, Takayuki, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Trajectory prediction is fundamental in computer vision and autonomous driving, particularly for understanding pedestrian behavior and enabling proactive decision-making. Existing approaches in this field often assume precise and complete observational data, neglecting the challenges associated with out-of-view objects and the noise inherent in sensor data due to limited camera range, physical obstructions, and the absence of ground truth for denoised sensor data. Such oversights are critical safety concerns, as they can result in missing essential, non-visible objects. To bridge this gap, we present a novel method for out-of-sight trajectory prediction that leverages a vision-positioning technique. Our approach denoises noisy sensor observations in an unsupervised manner and precisely maps sensor-based trajectories of out-of-sight objects into visual trajectories. This method has demonstrated state-of-the-art performance in out-of-sight noisy sensor trajectory denoising and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory prediction accuracy and addressing the challenges of out-of-sight objects, our work significantly contributes to improving the safety and reliability of autonomous driving in complex environments. Our work represents the first initiative towards Out-Of-Sight Trajectory prediction (OOSTraj), setting a new benchmark for future research. The code is available at \url{https://github.com/Hai-chao-Zhang/OOSTraj}., Comment: In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024 (CVPR)
Published: 2024

11. Adapting to Length Shift: FlexiLength Network for Trajectory Prediction

Author: Xu, Yi and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Trajectory prediction plays an important role in various applications, including autonomous driving, robotics, and scene understanding. Existing approaches mainly focus on developing compact neural networks to increase prediction precision on public datasets, typically employing a standardized input duration. However, a notable issue arises when these models are evaluated with varying observation lengths, leading to a significant performance drop, a phenomenon we term the Observation Length Shift. To address this issue, we introduce a general and effective framework, the FlexiLength Network (FLN), to enhance the robustness of existing trajectory prediction techniques against varying observation periods. Specifically, FLN integrates trajectory data with diverse observation lengths, incorporates FlexiLength Calibration (FLC) to acquire temporal invariant representations, and employs FlexiLength Adaptation (FLA) to further refine these representations for more accurate future trajectory predictions. Comprehensive experiments on multiple datasets, ie, ETH/UCY, nuScenes, and Argoverse 1, demonstrate the effectiveness and flexibility of our proposed FLN framework., Comment: Accepted by CVPR 2024
Published: 2024

12. Rewrite the Stars

Author: Ma, Xu, Dai, Xiyang, Bai, Yue, Wang, Yizhou, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent studies have drawn attention to the untapped potential of the "star operation" (element-wise multiplication) in network design. While intuitive explanations abound, the foundational rationale behind its application remains largely unexplored. Our study attempts to reveal the star operation's ability to map inputs into high-dimensional, non-linear feature spaces -- akin to kernel tricks -- without widening the network. We further introduce StarNet, a simple yet powerful prototype, demonstrating impressive performance and low latency under compact network structure and efficient budget. Like stars in the sky, the star operation appears unremarkable but holds a vast universe of potential. Our work encourages further exploration across tasks, with codes available at https://github.com/ma-xu/Rewrite-the-Stars., Comment: Accepted by CVPR 2024. Codes are made publically available at https://github.com/ma-xu/Rewrite-the-Stars
Published: 2024

13. Efficient Modulation for Vision Networks

Author: Ma, Xu, Dai, Xiyang, Yang, Jianwei, Xiao, Bin, Chen, Yinpeng, Fu, Yun, and Yuan, Lu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further tailor the modulation design by proposing the efficient modulation (EfficientMod) block, which is considered the essential building block for our networks. Benefiting from the prominent representational ability of modulation mechanism and the proposed efficient design, our network can accomplish better trade-offs between accuracy and efficiency and set new state-of-the-art performance in the zoo of efficient networks. When integrating EfficientMod with the vanilla self-attention block, we obtain the hybrid architecture which further improves the performance without loss of efficiency. We carry out comprehensive experiments to verify EfficientMod's performance. With fewer parameters, our EfficientMod-s performs 0.6 top-1 accuracy better than EfficientFormerV2-s2 and is 25% faster on GPU, and 2.9 better than MobileViTv2-1.0 at the same GPU latency. Additionally, our method presents a notable improvement in downstream tasks, outperforming EfficientFormerV2-s by 3.6 mIoU on the ADE20K benchmark. Code and checkpoints are available at https://github.com/ma-xu/EfficientMod., Comment: Accepted by ICLR 2024. Codes are made publically available at https://github.com/ma-xu/EfficientMod
Published: 2024

14. Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation

Author: Wang, Fu-Yun, Wu, Xiaoshi, Huang, Zhaoyang, Shi, Xiaoyu, Shen, Dazhong, Song, Guanglu, Liu, Yu, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video outpainting is a challenging task, aiming at generating video content outside the viewport of the input video while maintaining inter-frame and intra-frame consistency. Existing methods fall short in either generation quality or flexibility. We introduce MOTIA Mastering Video Outpainting Through Input-Specific Adaptation, a diffusion-based pipeline that leverages both the intrinsic data-specific patterns of the source video and the image/video generative prior for effective outpainting. MOTIA comprises two main phases: input-specific adaptation and pattern-aware outpainting. The input-specific adaptation phase involves conducting efficient and effective pseudo outpainting learning on the single-shot source video. This process encourages the model to identify and learn patterns within the source video, as well as bridging the gap between standard generative processes and outpainting. The subsequent phase, pattern-aware outpainting, is dedicated to the generalization of these learned patterns to generate outpainting outcomes. Additional strategies including spatial-aware insertion and noise travel are proposed to better leverage the diffusion model's generative prior and the acquired video patterns from source videos. Extensive evaluations underscore MOTIA's superiority, outperforming existing state-of-the-art methods in widely recognized benchmarks. Notably, these advancements are achieved without necessitating extensive, task-specific tuning., Comment: Code will be available at https://github.com/G-U-N/Be-Your-Outpainter
Published: 2024

15. Don't Judge by the Look: Towards Motion Coherent Video Representation

Author: Zhang, Yitian, Bai, Yue, Wang, Huan, Wang, Yizhou, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Current training pipelines in object recognition neglect Hue Jittering when doing data augmentation as it not only brings appearance changes that are detrimental to classification, but also the implementation is inefficient in practice. In this study, we investigate the effect of hue variance in the context of video understanding and find this variance to be beneficial since static appearances are less important in videos that contain motion information. Based on this observation, we propose a data augmentation method for video understanding, named Motion Coherent Augmentation (MCA), that introduces appearance variation in videos and implicitly encourages the model to prioritize motion patterns, rather than static appearances. Concretely, we propose an operation SwapMix to efficiently modify the appearance of video samples, and introduce Variation Alignment (VA) to resolve the distribution shift caused by SwapMix, enforcing the model to learn appearance invariant representations. Comprehensive empirical evaluation across various architectures and different datasets solidly validates the effectiveness and generalization ability of MCA, and the application of VA in other augmentation methods. Code is available at https://github.com/BeSpontaneous/MCA-pytorch., Comment: Accepted by ICLR2024
Published: 2024

16. AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning

Author: Wang, Fu-Yun, Huang, Zhaoyang, Shi, Xiaoyu, Bian, Weikang, Song, Guanglu, Liu, Yu, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Video diffusion models has been gaining increasing attention for its ability to produce videos that are both coherent and of high fidelity. However, the iterative denoising process makes it computationally intensive and time-consuming, thus limiting its applications. Inspired by the Consistency Model (CM) that distills pretrained image diffusion models to accelerate the sampling with minimal steps and its successful extension Latent Consistency Model (LCM) on conditional image generation, we propose AnimateLCM, allowing for high-fidelity video generation within minimal steps. Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy that decouples the distillation of image generation priors and motion generation priors, which improves the training efficiency and enhance the generation visual quality. Additionally, to enable the combination of plug-and-play adapters in stable diffusion community to achieve various functions (e.g., ControlNet for controllable generation). we propose an efficient strategy to adapt existing adapters to our distilled text-conditioned video consistency model or train adapters from scratch without harming the sampling speed. We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results. Experimental results validate the effectiveness of our proposed method. Code and weights will be made public. More details are available at https://github.com/G-U-N/AnimateLCM., Comment: Project Page: https://animatelcm.github.io/
Published: 2024

17. Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Author: Shi, Xiaoyu, Huang, Zhaoyang, Wang, Fu-Yun, Bian, Weikang, Li, Dasong, Zhang, Yi, Zhang, Manyuan, Cheung, Ka Chun, See, Simon, Qin, Hongwei, Dai, Jifeng, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at https://xiaoyushi97.github.io/Motion-I2V/., Comment: Project page: https://xiaoyushi97.github.io/Motion-I2V/
Published: 2024

18. VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

Author: Wang, Yizhou, Zhang, Ruiyi, Wang, Haoliang, Bhattacharya, Uttaran, Fu, Yun, and Wu, Gang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
Published: 2023

19. HyperSTAR: Task-Aware Hyperparameter Recommendation for Training and Compression

Author: Liu, Chang, Mittal, Gaurav, Karianakis, Nikolaos, Fragoso, Victor, Yu, Ye, Fu, Yun, and Chen, Mei
Published: 2024
Full Text: View/download PDF

20. Exploring Question Decomposition for Zero-Shot VQA

Author: Khan, Zaid, BG, Vijay Kumar, Schulter, Samuel, Chandraker, Manmohan, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/, Comment: NeurIPS 2023 Camera Ready
Published: 2023

21. Layout Sequence Prediction From Noisy Mobile Modality

Author: Zhang, Haichao, Xu, Yi, Lu, Hongsheng, Shimizu, Takayuki, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Robotics
Abstract: Trajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics. Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities. Nevertheless, real-world situations often involve obstructed cameras, missed objects, or objects out of sight due to environmental factors, leading to incomplete or noisy trajectories. To overcome these limitations, we propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories. LTrajDiff utilizes sensor data from mobile phones to surmount out-of-sight constraints, albeit introducing new challenges such as modality fusion, noisy data, and the absence of spatial layout and object size information. We employ a denoising diffusion model to predict precise layout sequences from noisy mobile data using a coarse-to-fine diffusion strategy, incorporating the RMS, Siamese Masked Encoding Module, and MFM. Our model predicts layout sequences by implicitly inferring object size and projection status from a single reference timestamp or significantly obstructed sequences. Achieving SOTA results in randomly obstructed experiments and extremely short input experiments, our model illustrates the effectiveness of leveraging noisy mobile data. In summary, our approach offers a promising solution to the challenges faced by layout sequence and trajectory prediction models in real-world settings, paving the way for utilizing sensor data from mobile phones to accurately predict pedestrian bounding box trajectories. To the best of our knowledge, this is the first work that addresses severely obstructed and extremely short layout sequences by combining vision with noisy mobile modality, making it the pioneering work in the field of layout sequence trajectory prediction., Comment: In Proceedings of the 31st ACM International Conference on Multimedia 2023 (MM 23)
Published: 2023
Full Text: View/download PDF

22. Latent Graph Inference with Limited Supervision

Author: Lu, Jianglin, Xu, Yi, Wang, Huan, Bai, Yue, and Fu, Yun
Subjects: Computer Science - Machine Learning
Abstract: Latent graph inference (LGI) aims to jointly learn the underlying graph structure and node representations from data features. However, existing LGI methods commonly suffer from the issue of supervision starvation, where massive edge weights are learned without semantic supervision and do not contribute to the training loss. Consequently, these supervision-starved weights, which may determine the predictions of testing samples, cannot be semantically optimal, resulting in poor generalization. In this paper, we observe that this issue is actually caused by the graph sparsification operation, which severely destroys the important connections established between pivotal nodes and labeled ones. To address this, we propose to restore the corrupted affinities and replenish the missed supervision for better LGI. The key challenge then lies in identifying the critical nodes and recovering the corrupted affinities. We begin by defining the pivotal nodes as $k$-hop starved nodes, which can be identified based on a given adjacency matrix. Considering the high computational burden, we further present a more efficient alternative inspired by CUR matrix decomposition. Subsequently, we eliminate the starved nodes by reconstructing the destroyed connections. Extensive experiments on representative benchmarks demonstrate that reducing the starved nodes consistently improves the performance of state-of-the-art LGI methods, especially under extremely limited supervision (6.12% improvement on Pubmed with a labeling rate of only 0.3%).
Published: 2023

23. Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection

Author: Zhang, Haichao, Qin, Can, Yin, Yu, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Camouflaged objects that blend into natural scenes pose significant challenges for deep-learning models to detect and synthesize. While camouflaged object detection is a crucial task in computer vision with diverse real-world applications, this research topic has been constrained by limited data availability. We propose a framework for synthesizing camouflage data to enhance the detection of camouflaged objects in natural scenes. Our approach employs a generative model to produce realistic camouflage images, which can be used to train existing object detection models. Specifically, we use a camouflage environment generator supervised by a camouflage distribution classifier to synthesize the camouflage images, which are then fed into our generator to expand the dataset. Our framework outperforms the current state-of-the-art method on three datasets (COD10k, CAMO, and CHAMELEON), demonstrating its effectiveness in improving camouflaged object detection. This approach can serve as a plug-and-play data generation and augmentation module for existing camouflaged object detection tasks and provides a novel way to introduce more diversity and distributions into current camouflage datasets.
Published: 2023

24. BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation

Author: Li, Miaoyu, Zhang, Yachao, MA, Xu, Qu, Yanyun, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Cross-modal Unsupervised Domain Adaptation (UDA) aims to exploit the complementarity of 2D-3D data to overcome the lack of annotation in a new domain. However, UDA methods rely on access to the target domain during training, meaning the trained model only works in a specific target domain. In light of this, we propose cross-modal learning under bird's-eye view for Domain Generalization (DG) of 3D semantic segmentation, called BEV-DG. DG is more challenging because the model cannot access the target domain during training, meaning it needs to rely on cross-modal learning to alleviate the domain gap. Since 3D semantic segmentation requires the classification of each point, existing cross-modal learning is directly conducted point-to-point, which is sensitive to the misalignment in projections between pixels and points. To this end, our approach aims to optimize domain-irrelevant representation modeling with the aid of cross-modal learning under bird's-eye view. We propose BEV-based Area-to-area Fusion (BAF) to conduct cross-modal learning under bird's-eye view, which has a higher fault tolerance for point-level misalignment. Furthermore, to model domain-irrelevant representations, we propose BEV-driven Domain Contrastive Learning (BDCL) with the help of cross-modal learning under bird's-eye view. We design three domain generalization settings based on three 3D datasets, and BEV-DG significantly outperforms state-of-the-art competitors with tremendous margins in all settings., Comment: Accepted by ICCV 2023
Published: 2023

25. Citing as an Online Learning Support Tool for Student-Generated Assessment

Author: Fu-Yun Yu and Ju-Ko Wei
Abstract: While enabling students to refer to and build on peer-produced work (termed 'citing' herein) during content creation appears pedagogically promising, its associated learning effects remain under-studied. This research aimed at examining the effects of online citing of peer-generated assessment items during student test-construction on promoting learning and task performance. Additionally, any negative effects this approach may have on inducing cognitive and emotional burdens were examined. A pretest-posttest quasi-experimental research design was adopted, where two treatment groups were devised (i.e. the citing and no-citing groups). A group of fifth-graders from six classes participated in an 11-week study. Based on the results of the analysis of covariance, it was found that students in the citing group scored significantly higher than those in the no-citing group on academic achievement and question-generation performance. In addition, citing did not induce significantly higher cognitive load or learning anxiety as compared to the no-citing situation.
Published: 2024
Full Text: View/download PDF

26. A Systematic Review of Published Student Question-Generation Systems: Supporting Functionalities and Design Features

Author: Fu-Yun Yu and Chih-Wei Kuo
Abstract: There has been an increasing interest in the development of student question-generation (SQG) systems since 2000. To offer a holistic and detailed view of extant SQG systems, a two-dimensional classification scheme was derived to identify commonly embedded ancillary functionalities and design features in the 54 SQG learning systems located through a systematic review. The results revealed that six ancillary functionalities most frequently included around question-generation are management, question-assessment, question-answering, question-viewing, communication, and gamification in that order. Additionally, SQG systems differ greatly in terms of design features, with the majority falling under the domain-specific type, supporting only one type of question-generation, gearing toward the free SQG type, not providing scaffolding of any type, and not allowing the inclusion of multimedia files for SQG.
Published: 2024
Full Text: View/download PDF

27. Does Consideration and Initiating Structure Leadership Help or Hinder Employees’ Creativity? The Role of Psychological Ownership and Time Availability

Author: Li, Zhengwei, Li, Wenxin, Fu, Yun, Xu, Zhoutong, and Wang, Feirong
Published: 2024
Full Text: View/download PDF

28. Construction and validation of a risk-prediction model for chemotherapy-related cognitive impairment in patients with breast cancer

Author: Liu, Yu, Li, Ruo-Lin, Chen, Lu, Zhao, Fu-Yun, Su, Ya-Li, Jin, Shuai, and Liu, Jun-E
Published: 2024
Full Text: View/download PDF

29. CFD simulation of pumping ventilation in a three-story isolated building with internal partitioning: Effects of partition widths, heights and locations

Author: Zhong, Huai-Yu, Sun, Jie, Lin, Chao, Wu, Song-Heng, Shang, Jin, Kikumoto, Hideki, Qian, Fu-Ping, Jimenez-Bescos, Carlos, and Zhao, Fu-Yun
Published: 2024
Full Text: View/download PDF

30. Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Author: Khan, Zaid, BG, Vijay Kumar, Schulter, Samuel, Yu, Xiang, Fu, Yun, and Chandraker, Manmohan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question-answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions, counterfactual examples and rephrasings, improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder-decoder multimodal transformer. Code available at https://github.com/codezakh/SelTDA., Comment: CVPR 2023
Published: 2023

31. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Author: Li, Yanyu, Wang, Huan, Jin, Qing, Hu, Ju, Chemerys, Pavlo, Fu, Yun, Wang, Yanzhi, Tulyakov, Sergey, and Ren, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users., Comment: Our project webpage: https://snap-research.github.io/SnapFusion/
Published: 2023

32. Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Author: Wang, Fu-Yun, Chen, Wenshuo, Song, Guanglu, Ye, Han-Jia, Liu, Yu, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video., Comment: The code is available at https://github.com/G-U-N/Gen-L-Video
Published: 2023

33. UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Author: Qin, Can, Zhang, Shu, Yu, Ning, Feng, Yihao, Yang, Xinyi, Zhou, Yingbo, Wang, Huan, Niebles, Juan Carlos, Xiong, Caiming, Savarese, Silvio, Ermon, Stefano, Fu, Yun, and Xu, Ran
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation., Comment: NeurIPS 2023
Published: 2023

34. Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction

Author: Xu, Yi, Bazarjani, Armin, Chi, Hyung-gun, Choi, Chiho, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Trajectory prediction is a crucial undertaking in understanding entity movement or human behavior from observed sequences. However, current methods often assume that the observed sequences are complete while ignoring the potential for missing values caused by object occlusion, scope limitation, sensor failure, etc. This limitation inevitably hinders the accuracy of trajectory prediction. To address this issue, our paper presents a unified framework, the Graph-based Conditional Variational Recurrent Neural Network (GC-VRNN), which can perform trajectory imputation and prediction simultaneously. Specifically, we introduce a novel Multi-Space Graph Neural Network (MS-GNN) that can extract spatial features from incomplete observations and leverage missing patterns. Additionally, we employ a Conditional VRNN with a specifically designed Temporal Decay (TD) module to capture temporal dependencies and temporal missing patterns in incomplete trajectories. The inclusion of the TD module allows for valuable information to be conveyed through the temporal flow. We also curate and benchmark three practical datasets for the joint problem of trajectory imputation and prediction. Extensive experiments verify the exceptional performance of our proposed method. As far as we know, this is the first work to address the lack of benchmarks and techniques for trajectory imputation and prediction in a unified manner., Comment: Accepted by CVPR 2023, Supplementary Material at https://github.com/colorfulfuture/GC-VRNN
Published: 2023

35. Frame Flexible Network

Author: Zhang, Yitian, Bai, Yue, Liu, Chang, Wang, Huan, Li, Sheng, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly (see Fig.1), which is summarized as Temporal Frequency Deviation phenomenon. To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly. Concretely, FFN integrates several sets of training sequences, involves Multi-Frequency Alignment (MFAL) to learn temporal frequency invariant representations, and leverages Multi-Frequency Adaptation (MFAD) to further strengthen the representation abilities. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on Something-Something V1 dataset over Uniformer). Code is available at https://github.com/BeSpontaneous/FFN., Comment: Accepted by CVPR2023
Published: 2023

36. Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Author: Khan, Zaid and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT., Comment: Accepted to ICLR 2023
Published: 2023

37. GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Author: Qin, Can, Yu, Ning, Xing, Chen, Zhang, Shu, Chen, Zeyuan, Ermon, Stefano, Fu, Yun, Xiong, Caiming, and Xu, Ran
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation., Comment: ICCV 2023
Published: 2023

38. Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution

Author: Wang, Jiamian, Wang, Huan, Zhang, Yulun, Fu, Yun, and Tao, Zhiqiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. However, prevailing SR models suffer from prohibitive memory footprint and intensive computations, which limits further deployment on edge devices. This work investigates the potential of network pruning for super-resolution to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. Two main challenges remain in applying pruning methods for SR. First, the widely-used filter pruning technique reflects limited granularity and restricted adaptability to diverse network structures. Second, existing pruning methods generally operate upon a pre-trained network for the sparse structure determination, hard to get rid of dense model training in the traditional SR paradigm. To address these challenges, we adopt unstructured pruning with sparse models directly trained from scratch. Specifically, we propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly initialized network at each iteration and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly. We observe that the proposed ISS-P can dynamically learn sparse structures adapting to the optimization process and preserve the sparse model's trainability by yielding a more regularized gradient throughput. Experiments on benchmark datasets demonstrate the effectiveness of the proposed ISS-P over diverse network architectures. Code is available at https://github.com/Jiamian-Wang/Iterative-Soft-Shrinkage-SR, Comment: Accepted by ICCV 2023, code released at https://github.com/Jiamian-Wang/Iterative-Soft-Shrinkage-SR
Published: 2023

39. Image as Set of Points

Author: Ma, Xu, Zhou, Yuqian, Wang, Huan, Qin, Can, Sun, Bin, Liu, Chang, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better results than ConvNets or ViTs on several benchmarks. Codes are available at: https://github.com/ma-xu/Context-Cluster., Comment: ICLR'23 Oral (top 5%); Codes: https://github.com/ma-xu/Context-Cluster
Published: 2023

40. Explainable Anomaly Detection in Images and Videos: A Survey

Author: Wang, Yizhou, Guo, Dongliang, Li, Sheng, Camps, Octavia, and Fu, Yun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Anomaly detection and localization of visual data, including images and videos, are of great significance in both machine learning academia and applied real-world scenarios. Despite the rapid development of visual anomaly detection techniques in recent years, the interpretations of these black-box models and reasonable explanations of why anomalies can be distinguished out are scarce. This paper provides the first survey concentrated on explainable visual anomaly detection methods. We first introduce the basic background of image-level and video-level anomaly detection. Then, as the main content of this survey, a comprehensive and exhaustive literature review of explainable anomaly detection methods for both images and videos is presented. Next, we analyze why some explainable anomaly detection methods can be applied to both images and videos and why others can be only applied to one modality. Additionally, we provide summaries of current 2D visual anomaly detection datasets and evaluation metrics. Finally, we discuss several promising future directions and open problems to explore the explainability of 2D visual anomaly detection. The related resource collection is given at https://github.com/wyzjack/Awesome-XAD.
Published: 2023

41. Pyramid Attention Network for Image Restoration

Author: Mei, Yiqun, Fan, Yuchen, Zhang, Yulun, Yu, Jiahui, Zhou, Yuqian, Liu, Ding, Fu, Yun, Huang, Thomas S., and Shi, Humphrey
Published: 2023
Full Text: View/download PDF

42. Prime editing-based gene correction alleviates the hyperexcitable phenotype and seizures of a genetic epilepsy mouse model

Author: Cao, Bi-rong, Huang, Yi-ming, Tian, Fu-yun, Li, Jin-hui, Xu, Chun-long, Wei, Yu, Liu, Ji-kai, Guo, Qian-bei, Xu, Hai-yan, Zhan, Li, Lv, Rui-min, Sun, Yi-di, Hu, Xin-de, Gao, Zhao-bing, and Zhou, Chang-yang
Published: 2023
Full Text: View/download PDF

43. Making Reconstruction-based Method Great Again for Video Anomaly Detection

Author: Wang, Yizhou, Qin, Can, Bai, Yue, Xu, Yi, Ma, Xu, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Anomaly detection in videos is a significant yet challenging problem. Previous approaches based on deep neural networks employ either reconstruction-based or prediction-based approaches. Nevertheless, existing reconstruction-based methods 1) rely on old-fashioned convolutional autoencoders and are poor at modeling temporal dependency; 2) are prone to overfit the training samples, leading to indistinguishable reconstruction errors of normal and abnormal frames during the inference phase. To address such issues, firstly, we get inspiration from transformer and propose ${\textbf S}$patio-${\textbf T}$emporal ${\textbf A}$uto-${\textbf T}$rans-${\textbf E}$ncoder, dubbed as $\textbf{STATE}$, as a new autoencoder model for enhanced consecutive frame reconstruction. Our STATE is equipped with a specifically designed learnable convolutional attention module for efficient temporal learning and reasoning. Secondly, we put forward a novel reconstruction-based input perturbation technique during testing to further differentiate anomalous frames. With the same perturbation magnitude, the testing reconstruction error of the normal frames lowers more than that of the abnormal frames, which contributes to mitigating the overfitting problem of reconstruction. Owing to the high relevance of the frame abnormality and the objects in the frame, we conduct object-level reconstruction using both the raw frame and the corresponding optical flow patches. Finally, the anomaly score is designed based on the combination of the raw and motion reconstruction errors using perturbed inputs. Extensive experiments on benchmark video anomaly detection datasets demonstrate that our approach outperforms previous reconstruction-based methods by a notable margin, and achieves state-of-the-art anomaly detection performance consistently. The code is available at https://github.com/wyzjack/MRMGA4VAD., Comment: Accepted by ICDM 2022
Published: 2023

44. Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Pruning

Author: Wang, Huan, Qin, Can, Bai, Yue, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The state of neural network pruning has been noticed to be unclear and even confusing for a while, largely due to "a lack of standardized benchmarks and metrics" [3]. To standardize benchmarks, first, we need to answer: what kind of comparison setup is considered fair? This basic yet crucial question has barely been clarified in the community, unfortunately. Meanwhile, we observe several papers have used (severely) sub-optimal hyper-parameters in pruning experiments, while the reason behind them is also elusive. These sub-optimal hyper-parameters further exacerbate the distorted benchmarks, rendering the state of neural network pruning even more obscure. Two mysteries in pruning represent such a confusing status: the performance-boosting effect of a larger finetuning learning rate, and the no-value argument of inheriting pretrained weights in filter pruning. In this work, we attempt to explain the confusing state of network pruning by demystifying the two mysteries. Specifically, (1) we first clarify the fairness principle in pruning experiments and summarize the widely-used comparison setups; (2) then we unveil the two pruning mysteries and point out the central role of network trainability, which has not been well recognized so far; (3) finally, we conclude the paper and give some concrete suggestions regarding how to calibrate the pruning benchmarks in the future. Code: https://github.com/mingsun-tse/why-the-state-of-pruning-so-confusing., Comment: 17 pages, v2, corrected wrong references
Published: 2023

45. An Automatic Method for Generating Symbolic Expressions of Zernike Circular Polynomials

Author: Zhang, Hong-Yan, Zhou, Yu, and Li, Fu-Yun
Subjects: Computer Science - Symbolic Computation, Computer Science - Mathematical Software
Abstract: Zernike circular polynomials (ZCP) play a significant role in optics engineering. The symbolic expressions for ZCP are valuable for theoretic analysis and engineering designs. However, there are still two problems which remain open: firstly, there is a lack of sufficient mathematical formulas of the ZCP for optics designers; secondly the formulas for inter-conversion of Noll's single index and Born-Wolf's double indices of ZCP are neither uniquely determinate nor satisfactory. An automatic method for generating symbolic expressions for ZCP is proposed based on five essential factors: the new theorems for converting the single/double indices of the ZCP, the robust and effective numeric algorithms for computing key parameters of ZCP, the symbolic algorithms for generating mathematical expressions of ZCP, and meta-programming \& \LaTeX{} programming for generating the table of ZCP. The theorems, method, algorithms and system architecture proposed are beneficial to both optics design process, optics software, computer-output typesetting in publishing industry as well as STEM education., Comment: This paper has proposed a method for creating mathematical table as demonstrated by generating Zenike circular polynomials. The code is available on GitHub site, please see: https://github.com/GrAbsRD/ZernikeSymbolicExpression
Published: 2023
Full Text: View/download PDF

46. A Close Look at Spatial Modeling: From Attention to Convolution

Author: Ma, Xu, Wang, Huan, Qin, Can, Li, Kunpeng, Zhao, Xingchen, Fu, Jie, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. By revisiting the self-attention responses in Transformers, we empirically observe two interesting issues. First, Vision Transformers present a queryirrelevant behavior at deep layers, where the attention maps exhibit nearly consistent contexts in global scope, regardless of the query patch position (also head-irrelevant). Second, the attention maps are intrinsically sparse, few tokens dominate the attention weights; introducing the knowledge from ConvNets would largely smooth the attention and enhance the performance. Motivated by above observations, we generalize self-attention formulation to abstract a queryirrelevant global context directly and further integrate the global context into convolutions. The resulting model, a Fully Convolutional Vision Transformer (i.e., FCViT), purely consists of convolutional layers and firmly inherits the merits of both attention mechanism and convolutions, including dynamic property, weight sharing, and short- and long-range feature modeling, etc. Experimental results demonstrate the effectiveness of FCViT. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K. When scaling FCViT to larger models, we still perform better than previous state-of-the-art ConvNeXt with even fewer parameters. FCViT-based models also demonstrate promising transferability to downstream tasks, like object detection, instance segmentation, and semantic segmentation. Codes and models are made available at: https://github.com/ma-xu/FCViT.
Published: 2022

47. CDIO-CT collaborative strategy for solving complex STEM problems in system modeling and simulation: an illustration of solving the period of mathematical pendulum

Author: Zhang, Hong-Yan, Zhou, Yu, Li, Yu-Tao, Li, Fu-Yun, and Jiang, Yong-Hui
Subjects: Physics - Physics Education, Computer Science - Computers and Society, Mathematics - Numerical Analysis
Abstract: The problem-project-oriented STEM education plays a significant role in training students' ability of innovation. Although the conceive-design-implement-operate (CDIO) approach and the computational thinking (CT) are hot topics in recent decade, there are still two deficiencies: the CDIO approach and CT are discussed separately and a general framework of coping with complex STEM problems in system modeling and simulation is missing. In this paper, a collaborative strategy based on the CDIO and CT is proposed for solving complex STEM problems in system modeling and simulation with a general framework, in which the CDIO is about ``how to do", CT is about ``how to think", and the project means ``what to do". As an illustration, the problem of solving the period of mathematical pendulum (MP) is discussed in detail. The most challenging task involved in the problem is to compute the complete elliptic integral of the first kind (CEI-1). In the philosophy of STEM education, all problems have more than one solutions. For computing the CEI-1, four methods are discussed with a top-down strategy, which includes the infinite series method, arithmetic-geometric mean (AGM) method, Gauss-Chebyshev method and Gauss-Legendre method. The algorithms involved can be utilized for R & D projects of interest and be reused according to the requirements encountered. The general framework for solving complex STEM problem in system modeling and simulation is worth recommending to the college students and instructors., Comment: 27 pages, 12 figures, 11 tables
Published: 2022
Full Text: View/download PDF

48. Real-Time Neural Light Field on Mobile Devices

Author: Cao, Junli, Wang, Huan, Chemerys, Pavlo, Shakhrai, Vladislav, Hu, Ju, Fu, Yun, Makoviichuk, Denys, Tulyakov, Sergey, and Ren, Jian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent efforts in Neural Rendering Fields (NeRF) have shown impressive results on novel view synthesis by utilizing implicit neural representation to represent 3D scenes. Due to the process of volumetric rendering, the inference speed for NeRF is extremely slow, limiting the application scenarios of utilizing NeRF on resource-constrained hardware, such as mobile devices. Many works have been conducted to reduce the latency of running NeRF models. However, most of them still require high-end GPU for acceleration or extra storage memory, which is all unavailable on mobile devices. Another emerging direction utilizes the neural light field (NeLF) for speedup, as only one forward pass is performed on a ray to predict the pixel color. Nevertheless, to reach a similar rendering quality as NeRF, the network in NeLF is designed with intensive computation, which is not mobile-friendly. In this work, we propose an efficient network that runs in real-time on mobile devices for neural rendering. We follow the setting of NeLF to train our network. Unlike existing works, we introduce a novel network architecture that runs efficiently on mobile devices with low latency and small size, i.e., saving $15\times \sim 24\times$ storage compared with MobileNeRF. Our model achieves high-resolution generation while maintaining real-time inference for both synthetic and real-world scenes on mobile devices, e.g., $18.04$ms (iPhone 13) for rendering one $1008\times756$ image of real 3D scenes. Additionally, we achieve similar image quality as NeRF and better quality than MobileNeRF (PSNR $26.15$ vs. $25.91$ on the real-world forward-facing dataset)., Comment: CVPR 2023. Project page: https://snap-research.github.io/MobileR2L/ Code: https://github.com/snap-research/MobileR2L/
Published: 2022

49. NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-shot Real Image Animation

Author: Yin, Yu, Ghasedi, Kamran, Wu, HsiangTao, Yang, Jiaolong, Tong, Xin, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Nerf-based Generative models have shown impressive capacity in generating high-quality images with consistent 3D geometry. Despite successful synthesis of fake identity images randomly sampled from latent space, adopting these models for generating face images of real subjects is still a challenging task due to its so-called inversion issue. In this paper, we propose a universal method to surgically fine-tune these NeRF-GAN models in order to achieve high-fidelity animation of real subjects only by a single image. Given the optimized latent code for an out-of-domain real image, we employ 2D loss functions on the rendered image to reduce the identity gap. Furthermore, our method leverages explicit and implicit 3D regularizations using the in-domain neighborhood samples around the optimized latent code to remove geometrical and visual artifacts. Our experiments confirm the effectiveness of our method in realistic, high-fidelity, and 3D consistent animation of real faces on multiple NeRF-GAN models across different datasets.
Published: 2022

50. Look More but Care Less in Video Recognition

Author: Zhang, Yitian, Bai, Yue, Wang, Huan, Xu, Yi, and Fu, Yun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Existing action recognition methods typically sample a few frames to represent each video to avoid the enormous computation, which often limits the recognition performance. To tackle this problem, we propose Ample and Focal Network (AFNet), which is composed of two branches to utilize more frames but with less computation. Specifically, the Ample Branch takes all input frames to obtain abundant information with condensed computation and provides the guidance for Focal Branch by the proposed Navigation Module; the Focal Branch squeezes the temporal size to only focus on the salient frames at each convolution block; in the end, the results of two branches are adaptively fused to prevent the loss of information. With this design, we can introduce more frames to the network but cost less computation. Besides, we demonstrate AFNet can utilize fewer frames while achieving higher accuracy as the dynamic selection in intermediate features enforces implicit temporal modeling. Further, we show that our method can be extended to reduce spatial redundancy with even less cost. Extensive experiments on five datasets demonstrate the effectiveness and efficiency of our method., Comment: Accepted by NeurIPS 2022
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

5,127 results on '"Fu, Yun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources